Market Cap: $3.2944T 1.380%
Volume(24h): $85.1867B -23.080%
  • Market Cap: $3.2944T 1.380%
  • Volume(24h): $85.1867B -23.080%
  • Fear & Greed Index:
  • Market Cap: $3.2944T 1.380%
Cryptos
Topics
Cryptospedia
News
CryptosTopics
Videos
Top News
Cryptos
Topics
Cryptospedia
News
CryptosTopics
Videos
bitcoin
bitcoin

$105561.692885 USD

0.87%

ethereum
ethereum

$2513.968322 USD

1.23%

tether
tether

$1.000833 USD

0.01%

xrp
xrp

$2.174793 USD

0.07%

bnb
bnb

$650.191287 USD

0.66%

solana
solana

$149.934483 USD

0.90%

usd-coin
usd-coin

$1.000010 USD

0.02%

dogecoin
dogecoin

$0.183926 USD

1.47%

tron
tron

$0.286479 USD

2.94%

cardano
cardano

$0.659440 USD

0.10%

hyperliquid
hyperliquid

$34.785089 USD

3.71%

sui
sui

$3.248166 USD

-0.30%

chainlink
chainlink

$13.819809 USD

0.66%

avalanche
avalanche

$20.443074 USD

2.76%

unus-sed-leo
unus-sed-leo

$9.231492 USD

2.37%

Cryptocurrency News Articles

RWKV-X: Linear-Time Long-Context Language Model

May 06, 2025 at 02:09 am

LLMs built on Transformer architectures face significant scaling challenges due to their quadratic complexity in sequence length when processing long-context inputs.

RWKV-X: Linear-Time Long-Context Language Model

LLMs built on Transformer architectures face significant scaling challenges due to their quadratic complexity in sequence length when processing long-context inputs. Linear Attention models, State Space Models like Mamba, Linear RNNs like DeltaNet, and RWKV solve this problem. However, these linear architectures struggle with long-context understanding. For instance, RWKV-7 (2.9B) achieves high accuracy on passkey retrieval up to 28K tokens but experiences rapid performance degradation beyond this point. Even with continual pretraining using 128K-length data, long-context limitations persist. This issue extends beyond RWKV to other architectures like Mamba, presenting a fundamental challenge for this class of models.

Linear complexity language models are emerging as alternatives to transformer-based architectures, which suffer from quadratic computational demands when processing long sequences. The RWKV model series combines Transformer parallelizability during training with RNN-like recurrent state representation. RWKV has evolved through multiple iterations, starting with the foundational RWKV-4 and progressing to RWKV-5, RWKV-6, and RWKV-7. Hybrid language models, including Jamba, Zamba, and MiniMax, enhance hybrid designs uniquely. Additionally, Native Sparse Attention (NSA) organizes tokens into temporal blocks with three distinct attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local contextual information. Other attention types include SeerAttention and Block Attention (MoBA).

Researchers from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University, and Qinghai University, Xining, have proposed a novel hybrid architecture called RWKV-X that combines RWKV’s efficiency for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches, RWKV-X achieves linear-time complexity during training and constant-time complexity during inference decoding. It shows near-perfect accuracy on the 64K passkey retrieval benchmark when pretrained on 64K-token sequences continuously. The model consistently outperforms previous RWKV-7 models on long-context benchmarks while maintaining strong performance on short-context tasks.

The authors present a two-stage training method for efficient preheating and fine-tuning of RWKV-X. In the first stage, they use short sequences (4096 tokens) to preheat the model quickly. Subsequently, they perform multi-stage pretraining with increasing sequence lengths to enable the model to process longer sequences gradually. This approach is inspired by LLaMA Pro's zero-initialization technique, where newly added parameters for expanded layers are initialized to zero. In contrast to LLaMA Pro's single-stage training, which may lead to instability, RWKV-X adopts a two-stage approach with a preheating stage to ensure stability.

The Short-context evaluation reveals that RWKV-X maintains competitive performance across standard benchmarks. The smaller variant, RWKV-X (0.22B), achieves an average score of 51.0, comparable to RWKV-7’s 51.8. At a larger scale, RWKV-X (3.6B) reaches 71.9, closely matching RWKV-7 (2.9B, 72.8) and Qwen2.5-3B (71.4), while surpassing LLaMA3.2-3B (69.7). These results confirm RWKV-X’s effectiveness as a general-purpose LLM backbone without sacrificing performance on shorter contexts. Moreover, efficiency analysis demonstrates RWKV-X’s superior scaling characteristics for long sequences. At 128K tokens, RWKV-X achieves a 1.37 times speedup over Flash-Attention v3, with this advantage expanding as context length increases.

In this paper, researchers introduced RWKV-X, which emerges as a hybrid language model that successfully combines RWKV’s efficiency for modeling short-range dependencies with a novel sparse attention mechanism designed specifically for long-range context modeling. While RWKV-X demonstrates strong performance and efficiency in long-context language modeling, several limitations remain. First, its sparse attention mechanism, which relies on top-k chunk selection, employs a heuristic approach that may overlook semantically relevant dependencies. Second, the current implementation shows sparse attention decoding running slower than vanilla RWKV, indicating that further engineering efforts are needed to optimize performance.

Check out the Paper. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community - r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/ (30k+ subscribers)

miniCON AI Events - minicon.marktechpost.com

AI Reports & Magazines - magazine.marktechpost.com

AI Dev & Research News - marktechpost.

Disclaimer:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

Other articles published on Jun 09, 2025