本文介绍了GTE-Moderncolbert-V1,这是一种基于Colbert建筑并整合Modernbert Foundation的新模型

Researchers from LightOn AI have presented GTE-ModernColBERT-v1, a model that builds upon the ColBERT architecture and integrates the ModernBERT foundation developed by Alibaba-NLP. Their aim was to distill knowledge from a base model and optimize it on the MS MARCO dataset, hoping to overcome limitations related to context length and semantic preservation. The model was trained using 300-token document inputs but demonstrated the ability to handle inputs as large as 8192 tokens, making it suitable for indexing and retrieving longer documents with minimal information loss. This work was deployed through PyLate, a library that simplifies the indexing and querying of documents using dense vector models. The model performs token-level semantic matching using the MaxSim operator, which evaluates similarity between individual token embeddings rather than compressing them into a single vector.
莱顿AI的研究人员提出了GTE-Moderncolbert-V1,该模型建立在Colbert Architecture,并整合了由Alibaba-NLP开发的Modernbert Foundation。他们的目的是将知识从基本模型中提取,并在MS MARCO数据集上进行优化,希望克服与上下文长度和语义保存相关的局限性。该模型是使用300个文档输入训练的,但证明了处理输入的能力,使其适合于索引和检索更长的文档,并以最小的信息丢失。这项工作是通过Pylate部署的。该模型使用MaxSIM操作员执行令牌级的语义匹配,该操作员评估了单个令牌嵌入之间的相似性,而不是将它们压缩到单个向量中。
GTE-ModernColBERT-v1 transforms text into 128-dimensional dense vectors and uses the MaxSim function for computing semantic similarity between query and document tokens, preserving granular context and enabling more refined retrieval. It integrates with PyLate’s Voyager indexing system, which manages large-scale embeddings using an efficient HNSW (Hierarchical Navigable Small World) index. Once documents are embedded and stored, users can retrieve the top-k most relevant documents using the ColBERT retriever. This process supports full pipeline indexing and lightweight reranking for first-stage retrieval systems. PyLate offers flexibility in modifying document length during inference, allowing users to handle texts much longer than the model was originally trained on, an advantage rarely seen in standard embedding models.
GTE-MODERNCOLBERT-V1将文本转换为128维密集的向量,并使用MaxSim函数来计算查询和文档令牌之间的语义相似性,从而保留颗粒状上下文并实现更精致的检索。它与Pylate的Voyager索引系统集成在一起,该系统使用有效的HNSW(层次可导航的小世界)索引管理大规模嵌入。嵌入并存储文档后,用户可以使用COLBERT回收器检索Top-K最相关的文档。此过程支持第一阶段检索系统的完整管道索引和轻巧的重新骑行。 Pylate在推理过程中修改文档长度时具有灵活性,使用户处理文本的时间比最初对模型进行的训练更长,这是标准嵌入模型中很少见的优势。
On the NanoClimate dataset, the model achieved a MaxSim Accuracy@1 of 0.360, Accuracy@5 of 0.780, and Accuracy@10 of 0.860. Precision and recall scores were consistent, with MaxSim Recall@3 reaching 0.289 and Precision@3 at 0.233. These scores highlight the model’s capability to retrieve accurate results even in longer-context retrieval scenarios.
在纳米气候数据集上,该模型达到了0.360的最高准确性,0.780的准确性@5和0.860的准确性@10。精度和召回分数是一致的,MaxSim召回@3达到0.289,而Precision@3在0.233处。这些得分突出了该模型的能力,即使在更长的检索场景中,模型也可以检索准确的结果。
When evaluated on the BEIR benchmark, GTE-ModernColBERT outperformed previous models, including ColBERT-small. It scored 54.89 on the FiQA2018 dataset, 48.51 on NFCorpus, and 83.59 on the TREC-COVID task. The average performance across these tasks was significantly higher than baseline ColBERT variants. Notably, in the LongEmbed benchmark, the model scored 88.39 in Mean score and 78.82 in LEMB Narrative QA Retrieval, surpassing other leading models such as voyage-multilingual-2 (79.17) and bge-m3 (58.73). These results suggest that the model offers robust generalization and effective handling of long-context documents, outperforming many contemporaries by almost 10 points on long-context tasks. It is also highly adaptable to different retrieval pipelines, supporting both indexing and reranking implementations, making it a versatile solution for scalable semantic search.
当在贝尔基准测试上进行评估时,GTE-Moderncolbert优于先前的模型,包括Colbert-Small。它在FIQA2018数据集上得分为54.89,NFCORPUS上的48.51和TREC-COVID任务的83.59分。这些任务的平均性能明显高于基线Colbert变体。值得注意的是,在长期安装的基准测试中,该模型在LEMB叙事QA检索中的平均得分为88.39,超过了其他领先模型,例如Voyage-Multingual-2(79.17)和BGE-M3(58.73)。这些结果表明,该模型提供了强大的概括和有效处理长篇文档的文档,在长篇小说任务上的表现优于许多同时代的同时代。它也高度适应不同的检索管道,支持索引和重新管理的实现,使其成为可扩展语义搜索的多功能解决方案。
Several Key Highlights from the Research on GTE-ModernColBERT-v1 include:This research provides a meaningful contribution to long-document semantic retrieval. By combining the strengths of token-level matching with scalable architecture, GTE-ModernColBERT-v1 addresses several bottlenecks that current models face. It introduces a reliable method for processing and retrieving semantically rich information from extended contexts, significantly improving precision and recall in longer-context retrieval scenarios.
GTE-Moderncolbert-V1研究的几个关键重点包括:这项研究为长期文档的语义检索提供了有意义的贡献。通过将令牌级匹配的优势与可扩展体系结构相结合,GTE-ModernColbert-V1解决了当前型号面临的几种瓶颈。它引入了一种可靠的方法,用于从扩展上下文中处理和检索语义丰富的信息,从而在更长的文本检索方案中显着提高了精度和回忆。