$109523.663807 USD

-0.13%

ethereum

$4019.526508 USD

2.06%

tether

$1.000482 USD

0.00%

xrp

$2.776815 USD

0.18%

bnb

$958.942396 USD

0.12%

solana

$204.294698 USD

3.84%

usd-coin

$0.999693 USD

0.00%

dogecoin

$0.232115 USD

2.09%

tron

$0.338028 USD

0.84%

cardano

$0.790920 USD

1.50%

hyperliquid

$44.871443 USD

5.60%

ethena-usde

$1.000322 USD

0.04%

chainlink

$21.034165 USD

2.60%

avalanche

$28.794831 USD

-0.54%

stellar

$0.360466 USD

1.24%

加密貨幣新聞文章

低排名稀疏注意（Lorsa）DISENTANGLES ATOMIC注意單元

2025/05/08 02:07

近年來，大型語言模型（LLM）引起了人們的重大關注，但是了解其內部機制仍然具有挑戰性。

Large Language Models (LLMs) have recently come into the spotlight, yet comprehending their internal mechanisms remains a challenge. When examining individual attention heads in Transformer models, researchers have identified specific functionalities in some heads. For instance, researchers have discovered induction heads in the Pythia model that predict tokens like ‘Potter’ following ‘Harry’ when the phrase appears in context, and ablation studies confirm these heads’ causal relationship to model behaviours. However, most attention heads distribute focus across diverse contexts without clear functionality.

大型語言模型（LLM）最近引起了人們的關注，但是理解其內部機制仍然是一個挑戰。在檢查變壓器模型中的個人注意力頭時，研究人員已經確定了某些頭部的特定功能。例如，研究人員發現了畢達斯模型中的誘導頭，當短語出現在上下文中時，諸如“ potter”之類的令牌諸如“ potter”之類的令牌，而消融研究證實了這些頭部與模型行為的因果關係。但是，大多數注意力負責人在沒有明確功能的情況下在各種環境中分配了焦點。

The challenge lies in interpreting these complex attention patterns, as inter-head collaboration occurs rather than isolated functionality. This phenomenon is similar to how neurons in the brain can encode multiple features in a low-dimensional space, leading to feature superposition. The research proposes an overcomplete sparse attention architecture, termed Low-Rank Sparse Attention (Lorsa), to decompose attention superposition in Multi-Head Self-Attention (MHSA) mechanisms, taking inspiration from Sparse Autoencoders (SAEs) that extract overcomplete sets of sparse, linearly comprehensible features from neural networks.

挑戰在於解釋這些複雜的注意力模式，因為發生頭間協作而不是孤立的功能。這種現像類似於大腦中的神經元如何在低維空間中編碼多個特徵，從而導致特徵疊加。該研究提出了一種過於復制的稀疏注意體系結構，稱為低級稀疏注意（Lorsa），以分解多頭自我注意力（MHSA）機制中的注意力疊加，從稀疏的自動編碼器（SAE）中汲取靈感，從稀疏的自動編碼器（SAE）中提取了疏忽了稀疏，可線綜合特徵的稀疏集合。

Attention superposition arises from the hypothesis that MHSA comprises multiple attention units in superposition, each attending between specific token pairs with interpretable read/write operations on the residual stream. This hypothesis suggests atomic attention units might be spread across multiple MHSA heads, while individual heads contain a few attention units.

注意疊加來自以下假設：MHSA包括疊加中的多個注意力單位，每次都在特定的令牌對之間進行，並在殘留流上具有可解釋的讀/寫操作。該假設表明，原子關注單元可能分佈在多個MHSA頭部，而單個頭部包含一些注意力單位。

Three key pieces of evidence support attention superposition: First, polysemantic heads respond to unrelated inputs, like successor heads that increment days, numbers, and exhibit acronym/copying behaviours simultaneously. Second, most attention heads lack clear interpretation patterns, with studies showing failed interpretation attempts for over 90% of GPT-2 heads. Third, direct observations show attention output features collectively contributed by multiple heads, with approximately 25% of learned attention units being spanned by multiple MHSA heads.

三個關鍵的證據支持注意力疊加：首先，多政論頭對無關的輸入做出反應，例如後繼頭部，同時增加天數，數字和展示縮寫/複製行為。其次，大多數注意力負責人缺乏明確的解釋模式，研究表明，超過90％的GPT-2頭的解釋嘗試失敗。第三，直接觀察結果顯示了由多個頭部共同貢獻的注意力輸出特徵，大約25％的學習注意力單元被多個MHSA頭部跨越。

This lack of interpretability is a major hurdle in attributing model behavior to specific internal circuits. The structure of attention superposition may hold the key to understanding this biological motif, as it raises the question of why certain attention units, like induction heads, are implemented by single MHSA heads while others exist in superposition.

缺乏可解釋性是將模型行為歸因於特定內部電路的一個主要障礙。注意疊加的結構可能是理解這種生物學基序的關鍵，因為它提出了一個問題：為什麼單個MHSA頭（如誘導頭）某些注意力單位（例如歸納頭），而其他人則存在於疊加中。

To address this, Lorsa is trained to predict MHSA outputs by minimizing mean square error. It employs one-dimensional OV circuits that restrict read/write operations to specific residual stream features, aligning with the linear representation hypothesis. For Query and Key weights, Lorsa implements parameter sharing across every DLorsa QK head, maintaining parameter efficiency while preserving performance. This strategy makes Lorsa QK circuits similar to MHSA but with sparsity constraints on each OV dimension.

為了解決這個問題，洛爾薩經過訓練，可以通過最大程度地減少均方誤差來預測MHSA輸出。它採用一維的OV電路，將讀/寫操作限制為特定的殘留流特徵，與線性表示假設保持一致。對於查詢和關鍵權重，Lorsa在每個Dlorsa QK頭上都實現參數共享，從而在保持性能的同時保持參數效率。該策略使Lorsa QK電路類似於MHSA，但每個OV尺寸都具有稀疏性限制。

Lorsa employs orders of magnitude more heads than standard MH. For each position, Lorsa’s output aggregates only the top-K heads with the largest activation values, with the active head subset varying dynamically across token positions. This approach is similar to TopK-SAEs, selecting the most salient linear components. However, Lorsa’s head activations derive from attention patterns of previous tokens rather than simple linear encoders with ReLU.

Lorsa使用的數量級比標準MH多。對於每個位置，Lorsa的輸出匯總僅具有最大激活值的TOP-K頭，而活動的頭部子集則在代幣位置上動態變化。這種方法類似於Topk-Saes，選擇了最顯著的線性組件。然而，洛爾薩的頭部激活源自以前的令牌的注意力模式，而不是帶有relu的簡單線性編碼器。

Lorsa’s interpretability assessment uses several key metrics to understand individual head functionality. Top activations help identify patterns by examining the 16 highest-activating tokens for each Lorsa head across 100 million samples from held-out data. The z pattern analysis decomposes activations linearly into token-wise contributions from preceding positions, revealing which previous tokens contribute to current activations. This approach parallels direct feature attribution analysis used for attention Sparse Autoencoders, but with simpler attribution involving just one one-dimensional OV circuit and a single QK circuit.

Lorsa的可解釋性評估使用了幾個關鍵指標來了解個體的頭部功能。最高激活有助於通過從Hold-Out數據中檢查1億個樣本中每個Lorsa頭的16個最高激活令牌，從而有助於識別模式。 Z模式分析將激活分解為前面位置的代幣貢獻，從而揭示了以前的令牌有助於當前激活。這種方法與用於注意稀疏自動編碼器的直接特徵歸因分析相似，但僅涉及一個一維OV電路和單個QK電路的簡單歸因。

A visualisation dashboard provides comprehensive information about each Lorsa head. For example, a “you”-specific induction head shows several important patterns: it primarily reads from features indicating the current token is “you”/”your” through its weight vector, strongly activates a “say you” feature that amplifies the logit of “you,” and increases prediction probabilities for various “you” tokens. The QK attention pattern computation involves current token features at the query position and previous token features where the current token is “you,” with the previous token often being words like “with,” “thank,” or “do.” Interestingly, this particular Lorsa head is almost equally distributed between two MHSA heads (5.0 and 5.7), demonstrating how Lorsa successfully disentangles attention units that exist across multiple standard attention heads.

可視化儀表板提供了有關每個Lorsa頭的全面信息。例如，“您”特定的感應頭顯示了幾種重要模式：它主要來自表明當前令牌的功能，這些功能是“您”/“您的”通過其重量向量，強烈激活了“ Say Say You”功能，以放大“您”的logit，並增加了各種“您”標記的預測概率。 QK注意模式計算涉及查詢位置上的當前令牌功能，而以前的令牌功能當前令牌為“您”，前面的令牌通常是“ with with with”，“ thess”或“ do”之類的單詞。有趣的是，這個特殊的Lorsa頭幾乎平均分佈在兩個MHSA頭（5.0和5.7）之間，這證明了Lorsa如何成功地脫離了在多個標準注意力頭上存在的注意力單元。

The research, conducted by the Shanghai Innovation Institute, OpenMOSS Team, and Fudan University, evaluated Lorsa on both Pythia-160M and Llama-3.1-8B models. Using an exploration interface and a visualization dashboard, they quantitatively assessed Lorsa’s interpretability through top activations and attribution patterns.

這項研究由上海創新研究所，OpenMoss團隊和Fudan University進行的研究評估了Pythia-160m和Llama-3.1-8B模型的Lorsa。使用探索界面和可視化儀表板，它們通過頂級激活和歸因模式定量評估了Lorsa的解釋性。

The results showed that Lorsa's monosemanticity compares favorably to Sparse Autoencoder features. In Pythia-160M, Lorsa successfully identified known attention mechanisms such as induction heads, name mover heads, successor heads, and attention sinks, which were previously discovered by researchers using techniques like activation patching

結果表明，Lorsa的單個氣質與稀疏的自動編碼器功能相比。在Pythia-160m中，Lorsa成功地識別了已知的注意機制，例如感應頭，名字搬運頭，後繼頭和注意下沉，這些技術以前是由研究人員使用激活補丁等技術發現的

原始來源：marktechpost

免責聲明:info@kdj.com

所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大，建議您充分研究後謹慎投資！

如果您認為本網站使用的內容侵犯了您的版權，請立即聯絡我們（info@kdj.com），我們將及時刪除。

2025年09月27日其他文章發表於