$117535.466428 USD

0.86%

ethereum

$3743.904248 USD

3.27%

xrp

$3.150293 USD

1.92%

tether

$1.000398 USD

-0.01%

bnb

$784.123542 USD

2.96%

solana

$186.703104 USD

3.73%

usd-coin

$1.000194 USD

0.03%

dogecoin

$0.237077 USD

4.66%

tron

$0.316954 USD

1.43%

cardano

$0.825919 USD

3.16%

hyperliquid

$44.329551 USD

6.60%

sui

$3.974508 USD

9.23%

stellar

$0.439026 USD

4.80%

chainlink

$18.426031 USD

5.08%

hedera

$0.267559 USD

12.80%

Cryptocurrency News Articles

RWKV-X: Linear-Time Long-Context Language Model

May 06, 2025 at 02:09 am

LLMs built on Transformer architectures face significant scaling challenges due to their quadratic complexity in sequence length when processing long-context inputs.

LLMs built on Transformer architectures face significant scaling challenges due to their quadratic complexity in sequence length when processing long-context inputs. Linear Attention models, State Space Models like Mamba, Linear RNNs like DeltaNet, and RWKV solve this problem. However, these linear architectures struggle with long-context understanding. For instance, RWKV-7 (2.9B) achieves high accuracy on passkey retrieval up to 28K tokens but experiences rapid performance degradation beyond this point. Even with continual pretraining using 128K-length data, long-context limitations persist. This issue extends beyond RWKV to other architectures like Mamba, presenting a fundamental challenge for this class of models.

Linear complexity language models are emerging as alternatives to transformer-based architectures, which suffer from quadratic computational demands when processing long sequences. The RWKV model series combines Transformer parallelizability during training with RNN-like recurrent state representation. RWKV has evolved through multiple iterations, starting with the foundational RWKV-4 and progressing to RWKV-5, RWKV-6, and RWKV-7. Hybrid language models, including Jamba, Zamba, and MiniMax, enhance hybrid designs uniquely. Additionally, Native Sparse Attention (NSA) organizes tokens into temporal blocks with three distinct attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local contextual information. Other attention types include SeerAttention and Block Attention (MoBA).

Researchers from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University, and Qinghai University, Xining, have proposed a novel hybrid architecture called RWKV-X that combines RWKV’s efficiency for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches, RWKV-X achieves linear-time complexity during training and constant-time complexity during inference decoding. It shows near-perfect accuracy on the 64K passkey retrieval benchmark when pretrained on 64K-token sequences continuously. The model consistently outperforms previous RWKV-7 models on long-context benchmarks while maintaining strong performance on short-context tasks.

The authors present a two-stage training method for efficient preheating and fine-tuning of RWKV-X. In the first stage, they use short sequences (4096 tokens) to preheat the model quickly. Subsequently, they perform multi-stage pretraining with increasing sequence lengths to enable the model to process longer sequences gradually. This approach is inspired by LLaMA Pro's zero-initialization technique, where newly added parameters for expanded layers are initialized to zero. In contrast to LLaMA Pro's single-stage training, which may lead to instability, RWKV-X adopts a two-stage approach with a preheating stage to ensure stability.

The Short-context evaluation reveals that RWKV-X maintains competitive performance across standard benchmarks. The smaller variant, RWKV-X (0.22B), achieves an average score of 51.0, comparable to RWKV-7’s 51.8. At a larger scale, RWKV-X (3.6B) reaches 71.9, closely matching RWKV-7 (2.9B, 72.8) and Qwen2.5-3B (71.4), while surpassing LLaMA3.2-3B (69.7). These results confirm RWKV-X’s effectiveness as a general-purpose LLM backbone without sacrificing performance on shorter contexts. Moreover, efficiency analysis demonstrates RWKV-X’s superior scaling characteristics for long sequences. At 128K tokens, RWKV-X achieves a 1.37 times speedup over Flash-Attention v3, with this advantage expanding as context length increases.

In this paper, researchers introduced RWKV-X, which emerges as a hybrid language model that successfully combines RWKV’s efficiency for modeling short-range dependencies with a novel sparse attention mechanism designed specifically for long-range context modeling. While RWKV-X demonstrates strong performance and efficiency in long-context language modeling, several limitations remain. First, its sparse attention mechanism, which relies on top-k chunk selection, employs a heuristic approach that may overlook semantically relevant dependencies. Second, the current implementation shows sparse attention decoding running slower than vanilla RWKV, indicating that further engineering efforts are needed to optimize performance.

Check out the Paper. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community - r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/ (30k+ subscribers)

miniCON AI Events - minicon.marktechpost.com

AI Reports & Magazines - magazine.marktechpost.com

AI Dev & Research News - marktechpost.

Disclaimer:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research！

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

Other articles published on Jul 26, 2025