$117535.466428 USD

0.86%

ethereum

$3743.904248 USD

3.27%

xrp

$3.150293 USD

1.92%

tether

$1.000398 USD

-0.01%

bnb

$784.123542 USD

2.96%

solana

$186.703104 USD

3.73%

usd-coin

$1.000194 USD

0.03%

dogecoin

$0.237077 USD

4.66%

tron

$0.316954 USD

1.43%

cardano

$0.825919 USD

3.16%

hyperliquid

$44.329551 USD

6.60%

sui

$3.974508 USD

9.23%

stellar

$0.439026 USD

4.80%

chainlink

$18.426031 USD

5.08%

hedera

$0.267559 USD

12.80%

加密货币新闻

RWKV-X：线性时间长篇小说语言模型

2025/05/06 02:09

在处理长篇下说输入时，基于变压器体系结构建立的LLMS由于其二次复杂性在序列长度上，面临着显着的缩放挑战。

LLMs built on Transformer architectures face significant scaling challenges due to their quadratic complexity in sequence length when processing long-context inputs. Linear Attention models, State Space Models like Mamba, Linear RNNs like DeltaNet, and RWKV solve this problem. However, these linear architectures struggle with long-context understanding. For instance, RWKV-7 (2.9B) achieves high accuracy on passkey retrieval up to 28K tokens but experiences rapid performance degradation beyond this point. Even with continual pretraining using 128K-length data, long-context limitations persist. This issue extends beyond RWKV to other architectures like Mamba, presenting a fundamental challenge for this class of models.

在处理长篇下说输入时，基于变压器体系结构建立的LLMS由于其二次复杂性在序列长度上，面临着显着的缩放挑战。线性注意模型，诸如MAMBA之类的状态空间模型，Deltanet等线性RNN和RWKV解决了这个问题。但是，这些线性体系结构在长期以来的理解方面挣扎。例如，RWKV-7（2.9b）在Passkey检索中达到了高度的高度准确性，最多可达28K令牌，但在这一点之后经历了快速的性能下降。即使使用128k长度数据进行持续预处理，长期限制仍然存在。这个问题超出了RWKV，到Mamba等其他体系结构，对此类模型提出了根本性的挑战。

Linear complexity language models are emerging as alternatives to transformer-based architectures, which suffer from quadratic computational demands when processing long sequences. The RWKV model series combines Transformer parallelizability during training with RNN-like recurrent state representation. RWKV has evolved through multiple iterations, starting with the foundational RWKV-4 and progressing to RWKV-5, RWKV-6, and RWKV-7. Hybrid language models, including Jamba, Zamba, and MiniMax, enhance hybrid designs uniquely. Additionally, Native Sparse Attention (NSA) organizes tokens into temporal blocks with three distinct attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local contextual information. Other attention types include SeerAttention and Block Attention (MoBA).

线性复杂性语言模型正在成为基于变压器的架构的替代方案，这些架构在处理长序列时会遇到二次计算需求。 RWKV模型系列与RNN样复发状态表示期间的训练过程中的变压器并行性结合在一起。从基础RWKV-4开始，RWKV通过多次迭代发展，并发展为RWKV-5，RWKV-6和RWKV-7。包括Jamba，Zamba和Minimax在内的混合语言模型可以独特地增强混合设计。此外，本地稀疏注意（NSA）将令牌组织到具有三个不同注意力路径的时间块中：压缩的粗粒令牌，选择性保留的细颗粒令牌和滑动窗口以获取本地上下文信息。其他注意力类型包括凝视和阻止注意力（MOBA）。

Researchers from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University, and Qinghai University, Xining, have proposed a novel hybrid architecture called RWKV-X that combines RWKV’s efficiency for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches, RWKV-X achieves linear-time complexity during training and constant-time complexity during inference decoding. It shows near-perfect accuracy on the 64K passkey retrieval benchmark when pretrained on 64K-token sequences continuously. The model consistently outperforms previous RWKV-7 models on long-context benchmarks while maintaining strong performance on short-context tasks.

广东人工智能与数字经济实验室（SZ），深圳，Hohai大学，Nanjing，深圳大学和Xining大学的研究人员提出了一种称为RWKV-X的新型混合体系结构，该架构结合了RWKV的效率，使RWKV效率与RWKV的效率相结合，使其与稀疏的注意力持续式的情节绘制了漫长的范围，以捕获长时间的范围。与以前的混合方法不同，RWKV-X在推理解码过程中训练期间达到线性时间的复杂性。当在64k token序列上预估计时，它显示了64K Passkey检索基准的几乎完美精度。该模型始终在长篇小说基准上胜过以前的RWKV-7模型，同时保持在短篇小说任务上的强大性能。

The authors present a two-stage training method for efficient preheating and fine-tuning of RWKV-X. In the first stage, they use short sequences (4096 tokens) to preheat the model quickly. Subsequently, they perform multi-stage pretraining with increasing sequence lengths to enable the model to process longer sequences gradually. This approach is inspired by LLaMA Pro's zero-initialization technique, where newly added parameters for expanded layers are initialized to zero. In contrast to LLaMA Pro's single-stage training, which may lead to instability, RWKV-X adopts a two-stage approach with a preheating stage to ensure stability.

作者提出了一种两阶段的训练方法，用于有效的RWKV-X进行预热和微调。在第一阶段，他们使用短序列（4096令牌）快速预热模型。随后，它们具有增加序列长度的多阶段预处理，以使模型能够逐渐处理更长的序列。这种方法的灵感来自Llama Pro的零批量化技术，在该技术中，新添加的扩展层参数初始化为零。与Llama Pro的单阶段训练相反，这可能导致不稳定，RWKV-X采用了两阶段的方法，并采用预热阶段来确保稳定。

The Short-context evaluation reveals that RWKV-X maintains competitive performance across standard benchmarks. The smaller variant, RWKV-X (0.22B), achieves an average score of 51.0, comparable to RWKV-7’s 51.8. At a larger scale, RWKV-X (3.6B) reaches 71.9, closely matching RWKV-7 (2.9B, 72.8) and Qwen2.5-3B (71.4), while surpassing LLaMA3.2-3B (69.7). These results confirm RWKV-X’s effectiveness as a general-purpose LLM backbone without sacrificing performance on shorter contexts. Moreover, efficiency analysis demonstrates RWKV-X’s superior scaling characteristics for long sequences. At 128K tokens, RWKV-X achieves a 1.37 times speedup over Flash-Attention v3, with this advantage expanding as context length increases.

短篇小写评估表明，RWKV-X在标准基准中保持竞争性能。较小的变体RWKV-X（0.22B）的平均得分为51.0，可与RWKV-7的51.8相当。在更大范围内，RWKV-X（3.6b）达到71.9，与RWKV-7（2.9b，72.8）和Qwen2.5-3B（71.4）紧密匹配，同时超过Llama3.2-3b（69.7）。这些结果证实了RWKV-X作为通用LLM主链的有效性，而不会在较短的情况下牺牲性能。此外，效率分析证明了RWKV-X对于长序列的出色缩放特性。在128K代币时，RWKV-X在闪存注意力的V3上实现了1.37倍的速度，随着上下文长度的增加，此优势不断扩大。

In this paper, researchers introduced RWKV-X, which emerges as a hybrid language model that successfully combines RWKV’s efficiency for modeling short-range dependencies with a novel sparse attention mechanism designed specifically for long-range context modeling. While RWKV-X demonstrates strong performance and efficiency in long-context language modeling, several limitations remain. First, its sparse attention mechanism, which relies on top-k chunk selection, employs a heuristic approach that may overlook semantically relevant dependencies. Second, the current implementation shows sparse attention decoding running slower than vanilla RWKV, indicating that further engineering efforts are needed to optimize performance.

在本文中，研究人员介绍了RWKV-X，它是一种混合语言模型，该模型成功地将RWKV的效率结合了建模短期依赖性的效率与专门为远程上下文建模设计的新型稀疏注意机制。虽然RWKV-X在长篇小说语言建模中表现出强大的性能和效率，但仍然存在一些局限性。首先，其稀疏的注意机制依赖于Top-K块选择，采用了一种启发式方法，可以忽略语义相关的依赖性。其次，当前的实现表明，与香草RWKV相比，注意力解码较少，这表明需要进一步的工程工作来优化性能。

Check out the Paper. Also, don’t forget to follow us on Twitter.

查看纸。另外，别忘了在Twitter上关注我们。

Here’s a brief overview of what we’re building at Marktechpost:

这是我们在Marktechpost构建的内容的简要概述：

ML News Community - r/machinelearningnews (92k+ members)

ML新闻社区-R/Machinelearningnews（92K+会员）

Newsletter– airesearchinsights.com/ (30k+ subscribers)

新闻通讯 - airesearchinsights.com/（30k+订户）

miniCON AI Events - minicon.marktechpost.com

Minicon AI活动-Minicon.marktechpost.com

AI Reports & Magazines - magazine.marktechpost.com

AI报告和杂志-Magazine.marktechpost.com