$117535.466428 USD

0.86%

ethereum

$3743.904248 USD

3.27%

xrp

$3.150293 USD

1.92%

tether

$1.000398 USD

-0.01%

bnb

$784.123542 USD

2.96%

solana

$186.703104 USD

3.73%

usd-coin

$1.000194 USD

0.03%

dogecoin

$0.237077 USD

4.66%

tron

$0.316954 USD

1.43%

cardano

$0.825919 USD

3.16%

hyperliquid

$44.329551 USD

6.60%

sui

$3.974508 USD

9.23%

stellar

$0.439026 USD

4.80%

chainlink

$18.426031 USD

5.08%

hedera

$0.267559 USD

12.80%

암호화폐 뉴스 기사

RWKV-X : 선형 시간 장거리 언어 모델

2025/05/06 02:09

변압기 아키텍처를 기반으로 한 LLM은 장기 텍스트 입력을 처리 할 때 순서 길이의 2 차 복잡성으로 인해 상당한 스케일링 문제에 직면 해 있습니다.

LLMs built on Transformer architectures face significant scaling challenges due to their quadratic complexity in sequence length when processing long-context inputs. Linear Attention models, State Space Models like Mamba, Linear RNNs like DeltaNet, and RWKV solve this problem. However, these linear architectures struggle with long-context understanding. For instance, RWKV-7 (2.9B) achieves high accuracy on passkey retrieval up to 28K tokens but experiences rapid performance degradation beyond this point. Even with continual pretraining using 128K-length data, long-context limitations persist. This issue extends beyond RWKV to other architectures like Mamba, presenting a fundamental challenge for this class of models.

변압기 아키텍처를 기반으로 한 LLM은 장기 텍스트 입력을 처리 할 때 순서 길이의 2 차 복잡성으로 인해 상당한 스케일링 문제에 직면 해 있습니다. 선형주의 모델, Mamba와 같은 상태 공간 모델, Deltanet과 같은 선형 RNN 및 RWKV 가이 문제를 해결합니다. 그러나 이러한 선형 아키텍처는 장기적인 이해로 어려움을 겪고 있습니다. 예를 들어, RWKV-7 (2.9b)은 최대 28K 토큰의 패스 키 검색에서 높은 정확도를 달성하지만이 시점에서 빠른 성능 저하를 경험합니다. 128k 길이의 데이터를 사용한 지속적인 사전 여지가 있더라도 긴 컨텍스트 제한이 지속됩니다. 이 문제는 RWKV를 넘어 Mamba와 같은 다른 아키텍처로 확장 되어이 클래스의 모델에 대한 근본적인 과제를 제시합니다.

Linear complexity language models are emerging as alternatives to transformer-based architectures, which suffer from quadratic computational demands when processing long sequences. The RWKV model series combines Transformer parallelizability during training with RNN-like recurrent state representation. RWKV has evolved through multiple iterations, starting with the foundational RWKV-4 and progressing to RWKV-5, RWKV-6, and RWKV-7. Hybrid language models, including Jamba, Zamba, and MiniMax, enhance hybrid designs uniquely. Additionally, Native Sparse Attention (NSA) organizes tokens into temporal blocks with three distinct attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local contextual information. Other attention types include SeerAttention and Block Attention (MoBA).

선형 복잡성 언어 모델은 트랜스포머 기반 아키텍처의 대안으로 등장하고 있으며, 이는 긴 시퀀스를 처리 할 때 2 차 계산 요구가 발생합니다. RWKV 모델 시리즈는 RNN- 유사 재발 상태 표현과 훈련하는 동안 변압기 병렬화 성을 결합합니다. RWKV는 기초 RWKV-4부터 RWKV-5, RWKV-6 및 RWKV-7으로 진행하여 여러 반복을 통해 진화했습니다. Jamba, Zamba 및 Minimax를 포함한 하이브리드 언어 모델은 하이브리드 디자인을 독특하게 향상시킵니다. 또한, NSA (Native Sparse Attention)는 3 가지 뚜렷한주의 경로를 갖춘 세 가지 뚜렷한주의 경로, 즉 압축 된 거친 토큰, 선택적으로 세밀한 토큰을 유지하고 지역 상황 정보를위한 슬라이딩 창을 사용하여 토큰을 시간적 블록으로 구성합니다. 다른주의 유형으로는 선구자 및 블록주의 (MOBA)가 있습니다.

Researchers from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University, and Qinghai University, Xining, have proposed a novel hybrid architecture called RWKV-X that combines RWKV’s efficiency for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches, RWKV-X achieves linear-time complexity during training and constant-time complexity during inference decoding. It shows near-perfect accuracy on the 64K passkey retrieval benchmark when pretrained on 64K-token sequences continuously. The model consistently outperforms previous RWKV-7 models on long-context benchmarks while maintaining strong performance on short-context tasks.

광동 인공 지능 및 디지털 경제 실험실 (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University 및 Qinghai University, Xinghai University의 연구원들은 RWKV -X라는 새로운 하이브리드 아키텍처를 RWKV의 효율성과 단거리 모델링을 결합하여 장기적인 컨텍스트를 캡처하도록 설계된 소소한주의 메커니즘과 결합했습니다. 이전 하이브리드 접근법과 달리 RWKV-X는 훈련 중 선형 시간 복잡성을 달성하고 추론 디코딩 중 일정한 시간 복잡성을 달성합니다. 64k-token 시퀀스에서 연속적으로 사전에 사전 할 때 64K Passkey 검색 벤치 마크에서 거의 완벽한 정확도를 보여줍니다. 이 모델은 장거리 텍스트 벤치 마크에서 이전 RWKV-7 모델을 지속적으로 능가하면서 단락 작업에서 강력한 성능을 유지합니다.

The authors present a two-stage training method for efficient preheating and fine-tuning of RWKV-X. In the first stage, they use short sequences (4096 tokens) to preheat the model quickly. Subsequently, they perform multi-stage pretraining with increasing sequence lengths to enable the model to process longer sequences gradually. This approach is inspired by LLaMA Pro's zero-initialization technique, where newly added parameters for expanded layers are initialized to zero. In contrast to LLaMA Pro's single-stage training, which may lead to instability, RWKV-X adopts a two-stage approach with a preheating stage to ensure stability.

저자는 RWKV-X의 효율적인 예열 및 미세 조정을위한 2 단계 훈련 방법을 제시합니다. 첫 번째 단계에서는 짧은 시퀀스 (4096 토큰)를 사용하여 모델을 빠르게 예열합니다. 결과적으로, 그들은 서열 길이가 증가함에 따라 다단 단계 전 사전 조정을 수행하여 모델이 더 긴 시퀀스를 점진적으로 처리 할 수있게한다. 이 접근법은 LLAMA PRO의 제로 시작 기술에서 영감을 얻은데, 여기서 확장 층에 대한 새로 추가 된 매개 변수는 0으로 초기화됩니다. RWKV-X는 불안정성으로 이어질 수있는 Llama Pro의 단일 단계 교육과 달리 안정성을 보장하기 위해 예열 단계와 2 단계 접근법을 채택합니다.

The Short-context evaluation reveals that RWKV-X maintains competitive performance across standard benchmarks. The smaller variant, RWKV-X (0.22B), achieves an average score of 51.0, comparable to RWKV-7’s 51.8. At a larger scale, RWKV-X (3.6B) reaches 71.9, closely matching RWKV-7 (2.9B, 72.8) and Qwen2.5-3B (71.4), while surpassing LLaMA3.2-3B (69.7). These results confirm RWKV-X’s effectiveness as a general-purpose LLM backbone without sacrificing performance on shorter contexts. Moreover, efficiency analysis demonstrates RWKV-X’s superior scaling characteristics for long sequences. At 128K tokens, RWKV-X achieves a 1.37 times speedup over Flash-Attention v3, with this advantage expanding as context length increases.

단락 평가는 RWKV-X가 표준 벤치 마크에서 경쟁력있는 성능을 유지하고 있음을 보여줍니다. 더 작은 변형 인 RWKV-X (0.22B)는 RWKV-7의 51.8에 비해 평균 51.0 점수를 달성합니다. 더 큰 규모로, RWKV-X (3.6B)는 71.9에 도달하여 RWKV-7 (2.9B, 72.8) 및 QWEN2.5-3B (71.4)와 밀접하게 일치시키면서 LLAMA3.2-3B (69.7)를 능가합니다. 이 결과는 더 짧은 상황에서 성능을 희생하지 않고 일반 목적 LLM 백본으로서 RWKV-X의 효과를 확인합니다. 또한 효율성 분석은 긴 시퀀스에 대한 RWKV-X의 우수한 스케일링 특성을 보여줍니다. 128K 토큰에서 RWKV-X는 플래시-항목 v3보다 1.37 배의 속도를 달성하며,이 장점은 컨텍스트 길이가 증가함에 따라 확장됩니다.

In this paper, researchers introduced RWKV-X, which emerges as a hybrid language model that successfully combines RWKV’s efficiency for modeling short-range dependencies with a novel sparse attention mechanism designed specifically for long-range context modeling. While RWKV-X demonstrates strong performance and efficiency in long-context language modeling, several limitations remain. First, its sparse attention mechanism, which relies on top-k chunk selection, employs a heuristic approach that may overlook semantically relevant dependencies. Second, the current implementation shows sparse attention decoding running slower than vanilla RWKV, indicating that further engineering efforts are needed to optimize performance.

이 논문에서 연구원들은 RWKV-X를 도입했는데, 이는 단거리 종속성을 모델링하기위한 RWKV의 효율성을 성공적으로 결합하는 하이브리드 언어 모델로 등장했습니다. RWKV-X는 장기 텍스트 언어 모델링에서 강력한 성능과 효율성을 보여 주지만 몇 가지 제한 사항이 남아 있습니다. 첫째, Top-K Chunk 선택에 의존하는 드문주의 메커니즘은 의미 적으로 관련된 종속성을 간과 할 수있는 휴리스틱 접근법을 사용합니다. 둘째, 현재 구현은 바닐라 RWKV보다 느리게 실행되는 희소 주의적 디코딩을 보여 주므로 성능을 최적화하기 위해 추가 엔지니어링 노력이 필요하다는 것을 나타냅니다.

Check out the Paper. Also, don’t forget to follow us on Twitter.

종이를 확인하십시오. 또한 트위터에서 우리를 팔로우하는 것을 잊지 마십시오.

Here’s a brief overview of what we’re building at Marktechpost:

다음은 MarkTechPost에서 우리가 구축하는 것에 대한 간단한 개요입니다.

ML News Community - r/machinelearningnews (92k+ members)

ML 뉴스 커뮤니티 -R/MachineLearningNews (92K+ 회원)

Newsletter– airesearchinsights.com/ (30k+ subscribers)

뉴스 레터 - airesearchinsights.com/ (30k+ 가입자)

miniCON AI Events - minicon.marktechpost.com

미니콘 AI 이벤트 -Minicon.marktechpost.com

AI Reports & Magazines - magazine.marktechpost.com

AI 보고서 및 잡지 -Magazine.marktechpost.com