$113468.010845 USD

-0.15%

ethereum

$3444.015026 USD

-2.15%

xrp

$2.825479 USD

-5.01%

tether

$0.999803 USD

0.02%

bnb

$743.647531 USD

-2.88%

solana

$160.624692 USD

-2.34%

usd-coin

$0.999903 USD

0.02%

tron

$0.323529 USD

-0.95%

dogecoin

$0.196081 USD

-2.87%

cardano

$0.713030 USD

-1.29%

hyperliquid

$37.499790 USD

-1.55%

sui

$3.408836 USD

-2.25%

stellar

$0.374679 USD

-2.93%

chainlink

$15.888532 USD

-1.95%

bitcoin-cash

$529.141629 USD

-2.14%

加密货币新闻

多to toke注意（MTA）可以有效地检索上下文信息

2025/04/02 14:54

本文介绍了多句话的注意（MTA），这是一种高级注意机制，在多个查询和关键矢量上同时调节注意力。

Large Language Models (LLMs) have significantly benefited from attention mechanisms, which enable the effective retrieval of contextual information. However, traditional attention methods primarily depend on single token attention, where each attention weight is calculated from a single pair of query and key vectors.

大型语言模型（LLM）从注意机制中大大受益，这可以有效地检索上下文信息。但是，传统的注意方法主要取决于单一令牌注意，其中每个注意力的重量都是根据一对查询和关键向量计算得出的。

This design inherently constrains the model's ability to discern contexts that require the integration of multiple token signals, ultimately limiting its effectiveness on complex linguistic dependencies. For instance, identifying sentences that simultaneously contain both "Alice" and "rabbit" poses a challenge because conventional attention mechanisms struggle to combine multiple separate attention signals efficiently without substantially increasing model complexity.

该设计固有地限制了模型辨别需要集成多个令牌信号的上下文的能力，最终将其有效性限制在复杂的语言依赖性上。例如，识别同时包含“爱丽丝”和“兔子”的句子构成了挑战，因为常规的注意机制难以有效地结合多个单独的注意信号，而不会实质上增加模型的复杂性。

To address this limitation, researchers from Meta AI have introduced Multi-Token Attention (MTA), an advanced attention mechanism that simultaneously conditions attention weights on multiple query and key vectors. MTA integrates convolution operations over queries, keys, and attention heads, thus enhancing the precision and efficiency of contextual information retrieval.

为了解决这一局限性，来自Meta AI的研究人员引入了多型注意（MTA），这是一种高级注意机制，同时在多个查询和关键矢量上调节注意力的权重。 MTA整合了有关查询，钥匙和注意力头的卷积操作，从而提高了上下文信息检索的精度和效率。

MTA framework consists of two convolutional components:

MTA框架由两个卷积组成部分组成：

1) key-query convolution, which aggregates multiple token signals within individual attention heads, and

1）钥匙要查询卷积，该卷积汇总了个体注意力负责人内的多个令牌信号，并且

2) head mixing convolution, which facilitates information sharing among different attention heads. MTA is implemented using group normalization with depth-dependent scaling to stabilize gradient flow, further improving model training stability and efficacy.

2）头部混合卷积，这有助于不同注意力头之间的信息共享。 MTA是使用与深度依赖性缩放的组归一化实施的，以稳定梯度流，从而进一步提高模型训练稳定性和功效。

At a technical level, MTA modifies standard attention calculations by incorporating a two-dimensional convolution operation on the attention logits before softmax normalization. This convolution allows adjacent queries and keys to influence attention scores mutually, enabling the attention mechanism to identify contextual relationships more precisely. Consequently, the model efficiently aggregates local token interactions without significantly increasing the number of parameters or the dimensionality of attention vectors.

在技术层面上，MTA通过在SoftMax归一化之前将注意力逻辑上的二维卷积操作纳入标准注意计算。这次卷积允许相邻的查询和钥匙相互影响的注意力评分，从而使注意力机制更加精确地识别上下文关系。因此，该模型有效地汇总了局部令牌相互作用，而没有显着增加参数数量或注意向量的维度。

MTA promotes effective knowledge transfer among attention heads, selectively amplifying relevant context signals while attenuating less pertinent information. These enhancements collectively yield a more robust attention mechanism capable of capturing complex multi-token interactions.

MTA促进了注意力头之间的有效知识转移，有选择地放大相关上下文信号，同时减少相关信息较少。这些增强能够共同产生一种更强大的注意机制，能够捕获复杂的多型相互作用。

Empirical evaluations validate the efficacy of MTA across several natural language processing (NLP) benchmarks. In a structured motivating task explicitly designed to illustrate the shortcomings of single-token attention mechanisms, MTA demonstrated near-perfect performance, achieving an error rate of only 0.1% in tasks with 4 x 1024 token sequences. In contrast, standard Transformer models exhibited error rates greater than 50%.

经验评估验证了MTA对几种自然语言处理（NLP）基准的疗效。在明确设计的结构化激励任务中，旨在说明单次注意机制的缺点，MTA表现出近乎完美的性能，在4 x 1024代币序列的任务中仅达到0.1％的错误率。相比之下，标准变压器模型显示出大于50％的错误率。

Further large-scale experiments involved an 880M-parameter model trained on 105 billion tokens using MTA and baseline architectures. MTA achieved superior validation perplexity scores across diverse datasets such as arXiv, GitHub, and Wikipedia.

进一步的大规模实验涉及使用MTA和基线体系结构对10050亿代币进行培训的880m参数模型。 MTA实现了各种数据集（例如Arxiv，Github和Wikipedia）的卓越验证困惑得分。

MTA outperformed standard Transformer models in tasks requiring extended context comprehension, such as the Needle-in-the-Haystack and BabiLong benchmarks. In the Needle-in-the-Haystack task with 4K token contexts containing multiple needles, MTA achieved accuracies ranging from 67% to 97.6%, surpassing standard models by substantial margins. These results highlight the potential of MTA for enabling LLMs to efficiently process very long-range dependencies.

在需要扩展上下文理解的任务中，MTA优于标准变压器模型，例如，在海域和Babilong基准测试中。在包含多个针头的4K令牌上下文的《针线中的针刺》任务中，MTA的准确性范围从67％到97.6％，超过了大量利润。这些结果突出了MTA使LLM有效处理非常长期依赖性的潜力。

In summary, Multi-Token Attention (MTA) presents a refined advancement in attention mechanisms by addressing fundamental limitations of traditional single-token attention. Leveraging convolutional operations to concurrently integrate multiple query-key interactions, MTA enhances the ability of language models to handle intricate contextual dependencies.

总之，通过解决传统的单一注意的基本局限性，多句话注意（MTA）在注意机制方面提出了精致的进步。 MTA利用卷积操作同时整合了多个查询键交互，增强了语言模型处理复杂的上下文依赖性的能力。

These methodological improvements facilitate more precise and efficient performance, particularly in scenarios involving complex token interactions and long-range contextual understanding. Through targeted modifications to standard attention mechanisms, MTA contributes meaningfully to the evolution of more sophisticated, accurate, and computationally efficient language models.

这些方法学的改进有助于更精确，更有效的性能，尤其是在涉及复杂的令牌互动和远程上下文理解的情况下。通过针对标准注意机制的有针对性的修改，MTA对更复杂，准确和计算有效的语言模型的演变有意义地贡献。

原文来源：marktechpost

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资，kdj.com不承担任何责任。加密货币具有高波动性，强烈建议您深入研究后，谨慎投资！

如您认为本网站上使用的内容侵犯了您的版权，请立即联系我们（info@kdj.com），我们将及时删除。

2025年08月03日发表的其他文章