$117535.466428 USD

0.86%

ethereum

$3743.904248 USD

3.27%

xrp

$3.150293 USD

1.92%

tether

$1.000398 USD

-0.01%

bnb

$784.123542 USD

2.96%

solana

$186.703104 USD

3.73%

usd-coin

$1.000194 USD

0.03%

dogecoin

$0.237077 USD

4.66%

tron

$0.316954 USD

1.43%

cardano

$0.825919 USD

3.16%

hyperliquid

$44.329551 USD

6.60%

sui

$3.974508 USD

9.23%

stellar

$0.439026 USD

4.80%

chainlink

$18.426031 USD

5.08%

hedera

$0.267559 USD

12.80%

加密貨幣新聞文章

RWKV-X：線性時間長篇小說語言模型

2025/05/06 02:09

在處理長篇下說輸入時，基於變壓器體系結構建立的LLMS由於其二次復雜性在序列長度上，面臨著顯著的縮放挑戰。

LLMs built on Transformer architectures face significant scaling challenges due to their quadratic complexity in sequence length when processing long-context inputs. Linear Attention models, State Space Models like Mamba, Linear RNNs like DeltaNet, and RWKV solve this problem. However, these linear architectures struggle with long-context understanding. For instance, RWKV-7 (2.9B) achieves high accuracy on passkey retrieval up to 28K tokens but experiences rapid performance degradation beyond this point. Even with continual pretraining using 128K-length data, long-context limitations persist. This issue extends beyond RWKV to other architectures like Mamba, presenting a fundamental challenge for this class of models.

在處理長篇下說輸入時，基於變壓器體系結構建立的LLMS由於其二次復雜性在序列長度上，面臨著顯著的縮放挑戰。線性注意模型，諸如MAMBA之類的狀態空間模型，Deltanet等線性RNN和RWKV解決了這個問題。但是，這些線性體系結構在長期以來的理解方面掙扎。例如，RWKV-7（2.9b）在Passkey檢索中達到了高度的高度準確性，最多可達28K令牌，但在這一點之後經歷了快速的性能下降。即使使用128k長度數據進行持續預處理，長期限制仍然存在。這個問題超出了RWKV，到Mamba等其他體系結構，對此類模型提出了根本性的挑戰。

Linear complexity language models are emerging as alternatives to transformer-based architectures, which suffer from quadratic computational demands when processing long sequences. The RWKV model series combines Transformer parallelizability during training with RNN-like recurrent state representation. RWKV has evolved through multiple iterations, starting with the foundational RWKV-4 and progressing to RWKV-5, RWKV-6, and RWKV-7. Hybrid language models, including Jamba, Zamba, and MiniMax, enhance hybrid designs uniquely. Additionally, Native Sparse Attention (NSA) organizes tokens into temporal blocks with three distinct attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local contextual information. Other attention types include SeerAttention and Block Attention (MoBA).

線性複雜性語言模型正在成為基於變壓器的架構的替代方案，這些架構在處理長序列時會遇到二次計算需求。 RWKV模型系列與RNN樣復發狀態表示期間的訓練過程中的變壓器並行性結合在一起。從基礎RWKV-4開始，RWKV通過多次迭代發展，並發展為RWKV-5，RWKV-6和RWKV-7。包括Jamba，Zamba和Minimax在內的混合語言模型可以獨特地增強混合設計。此外，本地稀疏注意（NSA）將令牌組織到具有三個不同註意力路徑的時間塊中：壓縮的粗粒令牌，選擇性保留的細顆粒令牌和滑動窗口以獲取本地上下文信息。其他注意力類型包括凝視和阻止注意力（MOBA）。

Researchers from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University, and Qinghai University, Xining, have proposed a novel hybrid architecture called RWKV-X that combines RWKV’s efficiency for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches, RWKV-X achieves linear-time complexity during training and constant-time complexity during inference decoding. It shows near-perfect accuracy on the 64K passkey retrieval benchmark when pretrained on 64K-token sequences continuously. The model consistently outperforms previous RWKV-7 models on long-context benchmarks while maintaining strong performance on short-context tasks.

廣東人工智能與數字經濟實驗室（SZ），深圳，Hohai大學，Nanjing，深圳大學和Xining大學的研究人員提出了一種稱為RWKV-X的新型混合體系結構，該架構結合了RWKV的效率，使RWKV效率與RWKV的效率相結合，使其與稀疏的注意力持續式的情節繪製了漫長的範圍，以捕獲長時間的範圍。與以前的混合方法不同，RWKV-X在推理解碼過程中訓練期間達到線性時間的複雜性。當在64k token序列上預估計時，它顯示了64K Passkey檢索基準的幾乎完美精度。該模型始終在長篇小說基准上勝過以前的RWKV-7模型，同時保持在短篇小說任務上的強大性能。

The authors present a two-stage training method for efficient preheating and fine-tuning of RWKV-X. In the first stage, they use short sequences (4096 tokens) to preheat the model quickly. Subsequently, they perform multi-stage pretraining with increasing sequence lengths to enable the model to process longer sequences gradually. This approach is inspired by LLaMA Pro's zero-initialization technique, where newly added parameters for expanded layers are initialized to zero. In contrast to LLaMA Pro's single-stage training, which may lead to instability, RWKV-X adopts a two-stage approach with a preheating stage to ensure stability.

作者提出了一種兩階段的訓練方法，用於有效的RWKV-X進行預熱和微調。在第一階段，他們使用短序列（4096令牌）快速預熱模型。隨後，它們具有增加序列長度的多階段預處理，以使模型能夠逐漸處理更長的序列。這種方法的靈感來自Llama Pro的零批量化技術，在該技術中，新添加的擴展層參數初始化為零。與Llama Pro的單階段訓練相反，這可能導致不穩定，RWKV-X採用了兩階段的方法，並採用預熱階段來確保穩定。

The Short-context evaluation reveals that RWKV-X maintains competitive performance across standard benchmarks. The smaller variant, RWKV-X (0.22B), achieves an average score of 51.0, comparable to RWKV-7’s 51.8. At a larger scale, RWKV-X (3.6B) reaches 71.9, closely matching RWKV-7 (2.9B, 72.8) and Qwen2.5-3B (71.4), while surpassing LLaMA3.2-3B (69.7). These results confirm RWKV-X’s effectiveness as a general-purpose LLM backbone without sacrificing performance on shorter contexts. Moreover, efficiency analysis demonstrates RWKV-X’s superior scaling characteristics for long sequences. At 128K tokens, RWKV-X achieves a 1.37 times speedup over Flash-Attention v3, with this advantage expanding as context length increases.

短篇小寫評估表明，RWKV-X在標準基準中保持競爭性能。較小的變體RWKV-X（0.22B）的平均得分為51.0，可與RWKV-7的51.8相當。在更大範圍內，RWKV-X（3.6b）達到71.9，與RWKV-7（2.9b，72.8）和Qwen2.5-3B（71.4）緊密匹配，同時超過Llama3.2-3b（69.7）。這些結果證實了RWKV-X作為通用LLM主鏈的有效性，而不會在較短的情況下犧牲性能。此外，效率分析證明了RWKV-X對於長序列的出色縮放特性。在128K代幣時，RWKV-X在閃存注意力的V3上實現了1.37倍的速度，隨著上下文長度的增加，此優勢不斷擴大。

In this paper, researchers introduced RWKV-X, which emerges as a hybrid language model that successfully combines RWKV’s efficiency for modeling short-range dependencies with a novel sparse attention mechanism designed specifically for long-range context modeling. While RWKV-X demonstrates strong performance and efficiency in long-context language modeling, several limitations remain. First, its sparse attention mechanism, which relies on top-k chunk selection, employs a heuristic approach that may overlook semantically relevant dependencies. Second, the current implementation shows sparse attention decoding running slower than vanilla RWKV, indicating that further engineering efforts are needed to optimize performance.

在本文中，研究人員介紹了RWKV-X，它是一種混合語言模型，該模型成功地將RWKV的效率結合了建模短期依賴性的效率與專門為遠程上下文建模設計的新型稀疏注意機制。雖然RWKV-X在長篇小說語言建模中表現出強大的性能和效率，但仍然存在一些局限性。首先，其稀疏的注意機制依賴於Top-K塊選擇，採用了一種啟發式方法，可以忽略語義相關的依賴性。其次，當前的實現表明，與香草RWKV相比，注意力解碼較少，這表明需要進一步的工程工作來優化性能。

Check out the Paper. Also, don’t forget to follow us on Twitter.

查看紙。另外，別忘了在Twitter上關注我們。

Here’s a brief overview of what we’re building at Marktechpost:

這是我們在Marktechpost構建的內容的簡要概述：

ML News Community - r/machinelearningnews (92k+ members)

ML新聞社區-R/Machinelearningnews（92K+會員）

Newsletter– airesearchinsights.com/ (30k+ subscribers)

新聞通訊 - airesearchinsights.com/（30k+訂戶）

miniCON AI Events - minicon.marktechpost.com

Minicon AI活動-Minicon.marktechpost.com

AI Reports & Magazines - magazine.marktechpost.com

AI報告和雜誌-Magazine.marktechpost.com