本文介紹了GTE-Moderncolbert-V1,這是一種基於Colbert建築並整合Modernbert Foundation的新模型

Researchers from LightOn AI have presented GTE-ModernColBERT-v1, a model that builds upon the ColBERT architecture and integrates the ModernBERT foundation developed by Alibaba-NLP. Their aim was to distill knowledge from a base model and optimize it on the MS MARCO dataset, hoping to overcome limitations related to context length and semantic preservation. The model was trained using 300-token document inputs but demonstrated the ability to handle inputs as large as 8192 tokens, making it suitable for indexing and retrieving longer documents with minimal information loss. This work was deployed through PyLate, a library that simplifies the indexing and querying of documents using dense vector models. The model performs token-level semantic matching using the MaxSim operator, which evaluates similarity between individual token embeddings rather than compressing them into a single vector.
萊頓AI的研究人員提出了GTE-Moderncolbert-V1,該模型建立在Colbert Architecture,並整合了由Alibaba-NLP開發的Modernbert Foundation。他們的目的是將知識從基本模型中提取,並在MS MARCO數據集上進行優化,希望克服與上下文長度和語義保存相關的局限性。該模型是使用300個文檔輸入訓練的,但證明了處理輸入的能力,使其適合於索引和檢索更長的文檔,並以最小的信息丟失。這項工作是通過Pylate部署的。該模型使用MaxSIM操作員執行令牌級的語義匹配,該操作員評估了單個令牌嵌入之間的相似性,而不是將它們壓縮到單個向量中。
GTE-ModernColBERT-v1 transforms text into 128-dimensional dense vectors and uses the MaxSim function for computing semantic similarity between query and document tokens, preserving granular context and enabling more refined retrieval. It integrates with PyLate’s Voyager indexing system, which manages large-scale embeddings using an efficient HNSW (Hierarchical Navigable Small World) index. Once documents are embedded and stored, users can retrieve the top-k most relevant documents using the ColBERT retriever. This process supports full pipeline indexing and lightweight reranking for first-stage retrieval systems. PyLate offers flexibility in modifying document length during inference, allowing users to handle texts much longer than the model was originally trained on, an advantage rarely seen in standard embedding models.
GTE-MODERNCOLBERT-V1將文本轉換為128維密集的向量,並使用MaxSim函數來計算查詢和文檔令牌之間的語義相似性,從而保留顆粒狀上下文並實現更精緻的檢索。它與Pylate的Voyager索引系統集成在一起,該系統使用有效的HNSW(層次可導航的小世界)索引管理大規模嵌入。嵌入並存儲文檔後,用戶可以使用COLBERT回收器檢索Top-K最相關的文檔。此過程支持第一階段檢索系統的完整管道索引和輕巧的重新騎行。 Pylate在推理過程中修改文檔長度時具有靈活性,使用戶處理文本的時間比最初對模型進行的訓練更長,這是標準嵌入模型中很少見的優勢。
On the NanoClimate dataset, the model achieved a MaxSim Accuracy@1 of 0.360, Accuracy@5 of 0.780, and Accuracy@10 of 0.860. Precision and recall scores were consistent, with MaxSim Recall@3 reaching 0.289 and Precision@3 at 0.233. These scores highlight the model’s capability to retrieve accurate results even in longer-context retrieval scenarios.
在納米氣候數據集上,該模型達到了0.360的最高準確性,0.780的準確性@5和0.860的準確性@10。精度和召回分數是一致的,MaxSim召回@3達到0.289,而Precision@3在0.233處。這些得分突出了該模型的能力,即使在更長的檢索場景中,模型也可以檢索準確的結果。
When evaluated on the BEIR benchmark, GTE-ModernColBERT outperformed previous models, including ColBERT-small. It scored 54.89 on the FiQA2018 dataset, 48.51 on NFCorpus, and 83.59 on the TREC-COVID task. The average performance across these tasks was significantly higher than baseline ColBERT variants. Notably, in the LongEmbed benchmark, the model scored 88.39 in Mean score and 78.82 in LEMB Narrative QA Retrieval, surpassing other leading models such as voyage-multilingual-2 (79.17) and bge-m3 (58.73). These results suggest that the model offers robust generalization and effective handling of long-context documents, outperforming many contemporaries by almost 10 points on long-context tasks. It is also highly adaptable to different retrieval pipelines, supporting both indexing and reranking implementations, making it a versatile solution for scalable semantic search.
當在貝爾基準測試上進行評估時,GTE-Moderncolbert優於先前的模型,包括Colbert-Small。它在FIQA2018數據集上得分為54.89,NFCORPUS上的48.51和TREC-COVID任務的83.59分。這些任務的平均性能明顯高於基線Colbert變體。值得注意的是,在長期安裝的基準測試中,該模型在LEMB敘事QA檢索中的平均得分為88.39,超過了其他領先模型,例如Voyage-Multingual-2(79.17)和BGE-M3(58.73)。這些結果表明,該模型提供了強大的概括和有效處理長篇文檔的文檔,在長篇小說任務上的表現優於許多同時代的同時代。它也高度適應不同的檢索管道,支持索引和重新管理的實現,使其成為可擴展語義搜索的多功能解決方案。
Several Key Highlights from the Research on GTE-ModernColBERT-v1 include:This research provides a meaningful contribution to long-document semantic retrieval. By combining the strengths of token-level matching with scalable architecture, GTE-ModernColBERT-v1 addresses several bottlenecks that current models face. It introduces a reliable method for processing and retrieving semantically rich information from extended contexts, significantly improving precision and recall in longer-context retrieval scenarios.
GTE-Moderncolbert-V1研究的幾個關鍵重點包括:這項研究為長期文檔的語義檢索提供了有意義的貢獻。通過將令牌級匹配的優勢與可擴展體系結構相結合,GTE-ModernColbert-V1解決了當前型號面臨的幾種瓶頸。它引入了一種可靠的方法,用於從擴展上下文中處理和檢索語義豐富的信息,從而在更長的文本檢索方案中顯著提高了精度和回憶。