$104654.464793 USD

2.47%

ethereum

$2482.196122 USD

1.96%

tether

$1.000892 USD

0.06%

xrp

$2.172204 USD

3.01%

bnb

$645.665986 USD

1.55%

solana

$148.547704 USD

1.62%

usd-coin

$0.999890 USD

0.00%

dogecoin

$0.181008 USD

5.22%

tron

$0.278244 USD

0.72%

cardano

$0.658362 USD

4.58%

hyperliquid

$33.402451 USD

-1.57%

sui

$3.243792 USD

9.23%

chainlink

$13.703476 USD

4.93%

avalanche

$19.876159 USD

5.04%

unus-sed-leo

$8.988912 USD

2.86%

加密货币新闻

低排名稀疏注意（Lorsa）DISENTANGLES ATOMIC注意单元

2025/05/08 02:07

近年来，大型语言模型（LLM）引起了人们的重大关注，但是了解其内部机制仍然具有挑战性。

Large Language Models (LLMs) have recently come into the spotlight, yet comprehending their internal mechanisms remains a challenge. When examining individual attention heads in Transformer models, researchers have identified specific functionalities in some heads. For instance, researchers have discovered induction heads in the Pythia model that predict tokens like ‘Potter’ following ‘Harry’ when the phrase appears in context, and ablation studies confirm these heads’ causal relationship to model behaviours. However, most attention heads distribute focus across diverse contexts without clear functionality.

大型语言模型（LLM）最近引起了人们的关注，但是理解其内部机制仍然是一个挑战。在检查变压器模型中的个人注意力头时，研究人员已经确定了某些头部的特定功能。例如，研究人员发现了毕达斯模型中的诱导头，当短语出现在上下文中时，诸如“ potter”之类的令牌诸如“ potter”之类的令牌，而消融研究证实了这些头部与模型行为的因果关系。但是，大多数注意力负责人在没有明确功能的情况下在各种环境中分配了焦点。

The challenge lies in interpreting these complex attention patterns, as inter-head collaboration occurs rather than isolated functionality. This phenomenon is similar to how neurons in the brain can encode multiple features in a low-dimensional space, leading to feature superposition. The research proposes an overcomplete sparse attention architecture, termed Low-Rank Sparse Attention (Lorsa), to decompose attention superposition in Multi-Head Self-Attention (MHSA) mechanisms, taking inspiration from Sparse Autoencoders (SAEs) that extract overcomplete sets of sparse, linearly comprehensible features from neural networks.

挑战在于解释这些复杂的注意力模式，因为发生头间协作而不是孤立的功能。这种现象类似于大脑中的神经元如何在低维空间中编码多个特征，从而导致特征叠加。该研究提出了一种过于复制的稀疏注意体系结构，称为低级稀疏注意（Lorsa），以分解多头自我注意力（MHSA）机制中的注意力叠加，从稀疏的自动编码器（SAE）中汲取灵感，从稀疏的自动编码器（SAE）中提取了疏忽了稀疏，可线综合特征的稀疏集合。

Attention superposition arises from the hypothesis that MHSA comprises multiple attention units in superposition, each attending between specific token pairs with interpretable read/write operations on the residual stream. This hypothesis suggests atomic attention units might be spread across multiple MHSA heads, while individual heads contain a few attention units.

注意叠加来自以下假设：MHSA包括叠加中的多个注意力单位，每次都在特定的令牌对之间进行，并在残留流上具有可解释的读/写操作。该假设表明，原子关注单元可能分布在多个MHSA头部，而单个头部包含一些注意力单位。

Three key pieces of evidence support attention superposition: First, polysemantic heads respond to unrelated inputs, like successor heads that increment days, numbers, and exhibit acronym/copying behaviours simultaneously. Second, most attention heads lack clear interpretation patterns, with studies showing failed interpretation attempts for over 90% of GPT-2 heads. Third, direct observations show attention output features collectively contributed by multiple heads, with approximately 25% of learned attention units being spanned by multiple MHSA heads.

三个关键的证据支持注意力叠加：首先，多政论头对无关的输入做出反应，例如后继头部，同时增加天数，数字和展示缩写/复制行为。其次，大多数注意力负责人缺乏明确的解释模式，研究表明，超过90％的GPT-2头的解释尝试失败。第三，直接观察结果显示了由多个头部共同贡献的注意力输出特征，大约25％的学习注意力单元被多个MHSA头部跨越。

This lack of interpretability is a major hurdle in attributing model behavior to specific internal circuits. The structure of attention superposition may hold the key to understanding this biological motif, as it raises the question of why certain attention units, like induction heads, are implemented by single MHSA heads while others exist in superposition.

缺乏可解释性是将模型行为归因于特定内部电路的一个主要障碍。注意叠加的结构可能是理解这种生物学基序的关键，因为它提出了一个问题：为什么单个MHSA头（如诱导头）某些注意力单位（例如归纳头），而其他人则存在于叠加中。

To address this, Lorsa is trained to predict MHSA outputs by minimizing mean square error. It employs one-dimensional OV circuits that restrict read/write operations to specific residual stream features, aligning with the linear representation hypothesis. For Query and Key weights, Lorsa implements parameter sharing across every DLorsa QK head, maintaining parameter efficiency while preserving performance. This strategy makes Lorsa QK circuits similar to MHSA but with sparsity constraints on each OV dimension.

为了解决这个问题，洛尔萨经过训练，可以通过最大程度地减少均方误差来预测MHSA输出。它采用一维的OV电路，将读/写操作限制为特定的残留流特征，与线性表示假设保持一致。对于查询和关键权重，Lorsa在每个Dlorsa QK头上都实现参数共享，从而在保持性能的同时保持参数效率。该策略使Lorsa QK电路类似于MHSA，但每个OV尺寸都具有稀疏性限制。

Lorsa employs orders of magnitude more heads than standard MH. For each position, Lorsa’s output aggregates only the top-K heads with the largest activation values, with the active head subset varying dynamically across token positions. This approach is similar to TopK-SAEs, selecting the most salient linear components. However, Lorsa’s head activations derive from attention patterns of previous tokens rather than simple linear encoders with ReLU.

Lorsa使用的数量级比标准MH多。对于每个位置，Lorsa的输出汇总仅具有最大激活值的TOP-K头，而活动的头部子集则在代币位置上动态变化。这种方法类似于Topk-Saes，选择了最显着的线性组件。然而，洛尔萨的头部激活源自以前的令牌的注意力模式，而不是带有relu的简单线性编码器。

Lorsa’s interpretability assessment uses several key metrics to understand individual head functionality. Top activations help identify patterns by examining the 16 highest-activating tokens for each Lorsa head across 100 million samples from held-out data. The z pattern analysis decomposes activations linearly into token-wise contributions from preceding positions, revealing which previous tokens contribute to current activations. This approach parallels direct feature attribution analysis used for attention Sparse Autoencoders, but with simpler attribution involving just one one-dimensional OV circuit and a single QK circuit.

Lorsa的可解释性评估使用了几个关键指标来了解个体的头部功能。最高激活有助于通过从Hold-Out数据中检查1亿个样本中每个Lorsa头的16个最高激活令牌，从而有助于识别模式。 Z模式分析将激活分解为前面位置的代币贡献，从而揭示了以前的令牌有助于当前激活。这种方法与用于注意稀疏自动编码器的直接特征归因分析相似，但仅涉及一个一维OV电路和单个QK电路的简单归因。

A visualisation dashboard provides comprehensive information about each Lorsa head. For example, a “you”-specific induction head shows several important patterns: it primarily reads from features indicating the current token is “you”/”your” through its weight vector, strongly activates a “say you” feature that amplifies the logit of “you,” and increases prediction probabilities for various “you” tokens. The QK attention pattern computation involves current token features at the query position and previous token features where the current token is “you,” with the previous token often being words like “with,” “thank,” or “do.” Interestingly, this particular Lorsa head is almost equally distributed between two MHSA heads (5.0 and 5.7), demonstrating how Lorsa successfully disentangles attention units that exist across multiple standard attention heads.

可视化仪表板提供了有关每个Lorsa头的全面信息。例如，“您”特定的感应头显示了几种重要模式：它主要来自表明当前令牌的功能，这些功能是“您”/“您的”通过其重量向量，强烈激活了“ Say Say You”功能，以放大“您”的logit，并增加了各种“您”标记的预测概率。 QK注意模式计算涉及查询位置上的当前令牌功能，而以前的令牌功能当前令牌为“您”，前面的令牌通常是“ with with with”，“ thess”或“ do”之类的单词。有趣的是，这个特殊的Lorsa头几乎平均分布在两个MHSA头（5.0和5.7）之间，这证明了Lorsa如何成功地脱离了在多个标准注意力头上存在的注意力单元。

The research, conducted by the Shanghai Innovation Institute, OpenMOSS Team, and Fudan University, evaluated Lorsa on both Pythia-160M and Llama-3.1-8B models. Using an exploration interface and a visualization dashboard, they quantitatively assessed Lorsa’s interpretability through top activations and attribution patterns.

这项研究由上海创新研究所，OpenMoss团队和Fudan University进行的研究评估了Pythia-160m和Llama-3.1-8B模型的Lorsa。使用探索界面和可视化仪表板，它们通过顶级激活和归因模式定量评估了Lorsa的解释性。

The results showed that Lorsa's monosemanticity compares favorably to Sparse Autoencoder features. In Pythia-160M, Lorsa successfully identified known attention mechanisms such as induction heads, name mover heads, successor heads, and attention sinks, which were previously discovered by researchers using techniques like activation patching

结果表明，Lorsa的单个气质与稀疏的自动编码器功能相比。在Pythia-160m中，Lorsa成功地识别了已知的注意机制，例如感应头，名字搬运头，后继头和注意下沉，这些技术以前是由研究人员使用激活补丁等技术发现的

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资，kdj.com不承担任何责任。加密货币具有高波动性，强烈建议您深入研究后，谨慎投资！

如您认为本网站上使用的内容侵犯了您的版权，请立即联系我们（info@kdj.com），我们将及时删除。

2025年06月08日发表的其他文章