$109667.069529 USD

-3.03%

ethereum

$3936.685804 USD

-4.07%

tether

$1.000493 USD

0.01%

xrp

$2.771823 USD

-4.74%

bnb

$957.805027 USD

-5.34%

solana

$196.735100 USD

-6.68%

usd-coin

$0.999727 USD

-0.01%

dogecoin

$0.227355 USD

-5.12%

tron

$0.335205 USD

-0.81%

cardano

$0.779256 USD

-3.59%

ethena-usde

$0.999900 USD

-0.06%

hyperliquid

$42.492095 USD

-6.61%

chainlink

$20.501853 USD

-4.34%

avalanche

$28.952606 USD

-11.21%

stellar

$0.356038 USD

-3.93%

암호화폐 뉴스 기사

저 순위의 드문주의주의 (LORSA)는 원자주의 단위를 분리합니다

2025/05/08 02:07

대형 언어 모델 (LLM)은 최근 몇 년 동안 상당한 관심을 끌었지만 내부 메커니즘을 이해하는 것은 여전히 어려운 일입니다.

Large Language Models (LLMs) have recently come into the spotlight, yet comprehending their internal mechanisms remains a challenge. When examining individual attention heads in Transformer models, researchers have identified specific functionalities in some heads. For instance, researchers have discovered induction heads in the Pythia model that predict tokens like ‘Potter’ following ‘Harry’ when the phrase appears in context, and ablation studies confirm these heads’ causal relationship to model behaviours. However, most attention heads distribute focus across diverse contexts without clear functionality.

대형 언어 모델 (LLM)은 최근에 스포트라이트에 들어 왔지만 내부 메커니즘을 이해하는 것은 여전히 어려운 일입니다. 변압기 모델에서 개별주의 헤드를 검사 할 때 연구원들은 일부 헤드에서 특정 기능을 확인했습니다. 예를 들어, 연구원들은 문구가 문맥에 나타날 때 '해리'에 따른 '포터'와 같은 토큰을 예측하는 피티아 모델에서 유도 헤드를 발견했으며, 절제 연구는 이러한 헤드의 원인 관계와 모델 행동과의 인과 관계를 확인했습니다. 그러나 대부분의주의 헤드는 명확한 기능없이 다양한 상황에 중점을 둡니다.

The challenge lies in interpreting these complex attention patterns, as inter-head collaboration occurs rather than isolated functionality. This phenomenon is similar to how neurons in the brain can encode multiple features in a low-dimensional space, leading to feature superposition. The research proposes an overcomplete sparse attention architecture, termed Low-Rank Sparse Attention (Lorsa), to decompose attention superposition in Multi-Head Self-Attention (MHSA) mechanisms, taking inspiration from Sparse Autoencoders (SAEs) that extract overcomplete sets of sparse, linearly comprehensible features from neural networks.

고립 된 기능보다는 헤드 간 협업이 발생하기 때문에 이러한 복잡한 관심 패턴을 해석하는 데있어 도전 과제는 있습니다. 이 현상은 뇌의 뉴런이 저차원 공간에서 여러 특징을 인코딩 할 수있는 방식과 유사하여 중첩을 초래합니다. 이 연구는 저 순위의 희소주의주의 (LORSA)라고 불리는 압도적 인 희소주의 아키텍처를 제안하여 MHSA (Multi-Head Self-Intention) 메커니즘에서주의 중첩을 분해하고, SAE (Sparse Autoencoders)에서 영감을 얻은 SAES (Sparse Autoencoders)의 영감을 얻을 수있는 세트의 신경 네트워크로부터의 선형으로 이해할 수있는 특징을 추출합니다.

Attention superposition arises from the hypothesis that MHSA comprises multiple attention units in superposition, each attending between specific token pairs with interpretable read/write operations on the residual stream. This hypothesis suggests atomic attention units might be spread across multiple MHSA heads, while individual heads contain a few attention units.

주의 중첩은 MHSA가 중첩의 여러주의 단위를 포함한다는 가설에서 발생하며, 각각은 잔류 스트림에서 해석 가능한 읽기/쓰기 작업을 갖춘 특정 토큰 쌍 사이에 참석합니다. 이 가설은 원자주의 단위가 여러 MHSA 헤드에 걸쳐 퍼질 수있는 반면 개별 헤드에는 몇 가지주의 단위가 포함되어 있음을 시사합니다.

Three key pieces of evidence support attention superposition: First, polysemantic heads respond to unrelated inputs, like successor heads that increment days, numbers, and exhibit acronym/copying behaviours simultaneously. Second, most attention heads lack clear interpretation patterns, with studies showing failed interpretation attempts for over 90% of GPT-2 heads. Third, direct observations show attention output features collectively contributed by multiple heads, with approximately 25% of learned attention units being spanned by multiple MHSA heads.

세 가지 주요 증거는주의 중첩을 뒷받침합니다. 첫째, 다국적 헤드는 일, 숫자 및 약어/복사 동시를 동시에 전시하는 후속 헤드와 같은 관련없는 입력에 반응합니다. 둘째, 대부분의주의 헤드에는 명확한 해석 패턴이 부족하며 GPT-2 헤드의 90% 이상에 대한 해석 시도가 실패한 연구. 셋째, 직접 관찰은 여러 헤드에 의해 총체적으로 기여하는주의 출력 기능을 보여 주며, 학습 된주의 장치의 약 25%가 여러 MHSA 헤드에 의해 걸려 있습니다.

This lack of interpretability is a major hurdle in attributing model behavior to specific internal circuits. The structure of attention superposition may hold the key to understanding this biological motif, as it raises the question of why certain attention units, like induction heads, are implemented by single MHSA heads while others exist in superposition.

이러한 해석 가능성 부족은 특정 내부 회로에 모델 동작을 기여하는 데 큰 장애물입니다. 주의 중첩 구조는 유도 헤드와 같은 특정주의 단위가 단일 MHSA 헤드에 의해 구현되는 반면 다른 사람들은 기둥에 존재하는 이유에 대한 의문을 제기하기 때문에이 생물학적 모티프를 이해하는 열쇠를 보유 할 수 있습니다.

To address this, Lorsa is trained to predict MHSA outputs by minimizing mean square error. It employs one-dimensional OV circuits that restrict read/write operations to specific residual stream features, aligning with the linear representation hypothesis. For Query and Key weights, Lorsa implements parameter sharing across every DLorsa QK head, maintaining parameter efficiency while preserving performance. This strategy makes Lorsa QK circuits similar to MHSA but with sparsity constraints on each OV dimension.

이를 해결하기 위해 Lorsa는 평균 제곱 오차를 최소화하여 MHSA 출력을 예측하도록 훈련되었습니다. 그것은 선형 표현 가설과 일치하는 특정 잔류 스트림 특징으로 읽기/쓰기 작업을 제한하는 1 차원 OV 회로를 사용합니다. 쿼리 및 키 가중치의 경우 Lorsa는 모든 Dlorsa QK 헤드에서 매개 변수 공유를 구현하여 성능을 유지하면서 매개 변수 효율성을 유지합니다. 이 전략은 LORSA QK 회로가 MHSA와 유사하지만 각 OV 차원에 대한 희소성 제약 조건을 갖습니다.

Lorsa employs orders of magnitude more heads than standard MH. For each position, Lorsa’s output aggregates only the top-K heads with the largest activation values, with the active head subset varying dynamically across token positions. This approach is similar to TopK-SAEs, selecting the most salient linear components. However, Lorsa’s head activations derive from attention patterns of previous tokens rather than simple linear encoders with ReLU.

Lorsa는 표준 MH보다 훨씬 더 많은 헤드를 사용합니다. 각 위치에 대해 Lorsa의 출력은 활성화 값이 가장 큰 Top-K 헤드 만 집계하며 활성 헤드 서브 세트는 토큰 위치에 걸쳐 동적으로 다양합니다. 이 접근법은 Topk-Saes와 유사하며 가장 두드러진 선형 구성 요소를 선택합니다. 그러나 Lorsa의 헤드 활성화는 Relu가있는 단순한 선형 인코더보다는 이전 토큰의주의 패턴에서 파생됩니다.

Lorsa’s interpretability assessment uses several key metrics to understand individual head functionality. Top activations help identify patterns by examining the 16 highest-activating tokens for each Lorsa head across 100 million samples from held-out data. The z pattern analysis decomposes activations linearly into token-wise contributions from preceding positions, revealing which previous tokens contribute to current activations. This approach parallels direct feature attribution analysis used for attention Sparse Autoencoders, but with simpler attribution involving just one one-dimensional OV circuit and a single QK circuit.

Lorsa의 해석 성 평가는 여러 주요 메트릭을 사용하여 개별 헤드 기능을 이해합니다. 최고 활성화는 각 LORSA 헤드에 대해 16 개의 가장 높은 활성화 토큰을 검사하여 고정 데이터에서 1 억 샘플에 걸쳐 패턴을 식별하는 데 도움이됩니다. Z 패턴 분석은 이전 위치에서 토큰 별 기여로 선형으로 활성화를 분해하여 이전 토큰이 현재 활성화에 기여하는 것을 보여줍니다. 이 접근법은주의 스파스 오토 코더에 사용 된 직접 기능 속성 분석과 유사하지만 단 하나의 1 차원 OV 회로와 단일 QK 회로를 포함하는 간단한 속성이 있습니다.

A visualisation dashboard provides comprehensive information about each Lorsa head. For example, a “you”-specific induction head shows several important patterns: it primarily reads from features indicating the current token is “you”/”your” through its weight vector, strongly activates a “say you” feature that amplifies the logit of “you,” and increases prediction probabilities for various “you” tokens. The QK attention pattern computation involves current token features at the query position and previous token features where the current token is “you,” with the previous token often being words like “with,” “thank,” or “do.” Interestingly, this particular Lorsa head is almost equally distributed between two MHSA heads (5.0 and 5.7), demonstrating how Lorsa successfully disentangles attention units that exist across multiple standard attention heads.

시각화 대시 보드는 각 Lorsa 헤드에 대한 포괄적 인 정보를 제공합니다. 예를 들어, "You"-특이 적 유도 헤드는 몇 가지 중요한 패턴을 보여줍니다. 주로 중량 벡터를 통해 현재 토큰이 "You"/"귀하의"임을 나타내는 기능을 읽고, "귀하의" "귀하"의 로짓을 증폭시키는 "Say You"기능을 강력하게 활성화하고 다양한 "You"토큰에 대한 예측 확률을 증가시킵니다. QK주의 패턴 계산에는 쿼리 위치의 현재 토큰 기능과 현재의 토큰 기능이 포함 된 이전 토큰 기능이 포함되며, 이전 토큰은 종종 "With", "Thank"또는 "Do"와 같은 단어입니다. 흥미롭게도,이 특정 LORSA 헤드는 거의 두 개의 MHSA 헤드 (5.0 및 5.7) 사이에 거의 동일하게 분포되어 있으며, Lorsa가 여러 표준주의 헤드에 존재하는주의 단위를 성공적으로 분리하는 방법을 보여줍니다.

The research, conducted by the Shanghai Innovation Institute, OpenMOSS Team, and Fudan University, evaluated Lorsa on both Pythia-160M and Llama-3.1-8B models. Using an exploration interface and a visualization dashboard, they quantitatively assessed Lorsa’s interpretability through top activations and attribution patterns.

상하이 혁신 연구소, Openmoss Team 및 Fudan University가 실시한이 연구는 Pythia-160M 및 LLAMA-3.1-8B 모델 모두에서 LORSA를 평가했습니다. 탐색 인터페이스와 시각화 대시 보드를 사용하여 최고 활성화 및 속성 패턴을 통해 Lorsa의 해석 성을 정량적으로 평가했습니다.

The results showed that Lorsa's monosemanticity compares favorably to Sparse Autoencoder features. In Pythia-160M, Lorsa successfully identified known attention mechanisms such as induction heads, name mover heads, successor heads, and attention sinks, which were previously discovered by researchers using techniques like activation patching

결과는 Lorsa의 단순성이 희소 한 자동 인코더 특징과 호의적으로 비교된다는 것을 보여 주었다. Pythia-160m에서 Lorsa는 유도 헤드, 이름 발동기 헤드, 후임 헤드 및주의 싱크대와 같은 알려진주의 메커니즘을 성공적으로 식별했습니다.

원본 소스：marktechpost

부인 성명:info@kdj.com

제공된 정보는 거래 조언이 아닙니다. kdj.com은 이 기사에 제공된 정보를 기반으로 이루어진 투자에 대해 어떠한 책임도 지지 않습니다. 암호화폐는 변동성이 매우 높으므로 철저한 조사 후 신중하게 투자하는 것이 좋습니다!

2025年09月27日 에 게재된 다른 기사

더