$113653.179192 USD

-1.98%

ethereum

$3525.217143 USD

-5.13%

xrp

$2.974588 USD

-1.43%

tether

$0.999613 USD

-0.03%

bnb

$764.503086 USD

-3.02%

solana

$164.558033 USD

-4.03%

usd-coin

$0.999804 USD

-0.02%

tron

$0.326608 USD

-0.14%

dogecoin

$0.201896 USD

-3.61%

cardano

$0.722456 USD

-2.12%

hyperliquid

$38.099997 USD

-7.92%

sui

$3.494024 USD

-3.45%

stellar

$0.385959 USD

-3.14%

chainlink

$16.209093 USD

-4.30%

bitcoin-cash

$540.811075 USD

-4.11%

加密货币新闻

LLM，Tokenizers和Models：字节级革命？

2025/06/25 03:17

探索LLM，Tokenizers和Models的最新趋势，重点关注创新的字节潜在变压器（BLT）及其对AI未来的影响。

LLMs, Tokenizers, and Models: A Byte-Level Revolution?

LLM，Tokenizers和Models：字节级革命？

The world of LLMs is constantly evolving. This article summarizes the latest trends in 'LLM, Tokenizer, Models', focusing on the challenges of tokenization and the rise of byte-level models, as well as providing insights into potential future directions.

LLM的世界在不断发展。本文总结了“ LLM，Tokenizer，Models”的最新趋势，重点介绍了令牌化的挑战和字节级模型的兴起，并提供了对潜在的未来方向的见解。

The Tokenization Bottleneck

令牌化瓶颈

Modern LLMs rely heavily on tokenization, a process that converts text into numerical tokens that the model can understand. However, this process isn't without its flaws. As Pagnoni et al (2024) point out, tokenization can strip away crucial sub-word semantics, leading to inefficiencies and vulnerabilities. Typos, domain-specific language, and low-resource languages can all cause problems for tokenizers, ultimately impacting the model's performance.

Modern LLMS在很大程度上依赖于代币化，该过程将文本转换为模型可以理解的数值令牌。但是，此过程并非没有缺陷。正如Pagnoni等人（2024）所指出的那样，令牌化可以剥离关键的次字语义，从而导致效率低下和脆弱性。错别字，特定于领域的语言和低资源语言都可能引起令牌问题的问题，最终影响了模型的性能。

The Rise of Byte-Level Models: BLT to the Rescue

字节级模型的兴起：BLT进行救援

Enter the Byte Latent Transformer (BLT), a radical new approach that bypasses tokenization altogether. Developed by Meta AI, BLT models language from raw bytes, the most fundamental representation of digital text. This allows the LLM to learn language from the ground up, preserving sub-word semantics and potentially leading to more robust and versatile models.

输入Byte潜在变压器（BLT），这是一种绕过令牌化的根本新方法。由META AI开发，BLT从RAW BYTES模拟语言，这是数字文本的最基本表示。这使LLM可以从头开始学习语言，保留子字的语义，并有可能导致更健壮和多功能的模型。

How BLT Works: A Two-Tiered System

BLT的工作方式：两层系统

BLT employs a clever two-tiered system to handle the computational challenges of processing raw bytes. The Local Encoder compresses easy-to-predict byte segments into latent "patches," significantly shortening the sequence length. The Latent Global Transformer then focuses its computational resources on the more complex linguistic regions. Finally, the Local Decoder decodes the predicted patch vector back into a sequence of raw bytes.

BLT采用巧妙的两层系统来处理处理原始字节的计算挑战。本地编码器将易于预测的字节段压缩为潜在的“斑块”，从而大大缩短了序列长度。然后，潜在的全球变压器将其计算资源集中在更复杂的语言区域上。最后，局部解码器将预测的贴片向量解码为一系列原始字节。

BLT: A Game Changer?

BLT：改变游戏规则？

The BLT architecture offers several potential advantages over traditional token-based models:

BLT体系结构比传统的基于代币的模型具有多种潜在的优势：

Comparable Scaling: BLT can match the scaling behavior of state-of-the-art token-based architectures like LLaMA 3.
Dynamic Compute Allocation: BLT dynamically allocates computation based on input complexity, focusing resources where they are needed most.
Subword Awareness: By processing raw bytes, BLT gains access to the internal structure of words, improving performance on tasks involving fine-grained edits and noisy text.
Improved Performance on Low-Resource Languages: BLT treats all languages equally from the start, leading to better results in machine translation for languages with limited data.

The Future of LLMs: Beyond Tokenization?

LLM的未来：超越象征化？

The BLT represents a significant step forward in LLM research, challenging the long-standing reliance on tokenization. While tokenizers have become deeply ingrained in the AI ecosystem, the potential benefits of byte-level modeling are hard to ignore.

BLT代表了LLM研究中迈出的重要一步，挑战了长期以来对令牌化的依赖。虽然标记器在AI生态系统中已深深地根深蒂固，但字节级建模的潜在好处很难忽略。

While Ozak AI is unrelated to Tokenization, it is an example of an AI project with real world market utility. In the coming year it could very well be the smartest and loudest token due to its use case, and continued AI adoption.

尽管Ozak AI与令牌化无关，但它是具有现实世界市场实用程序的AI项目的一个例子。在来年，由于其用例，它很可能是最聪明，最响亮的令牌，并继续采用AI。

Final Thoughts

最后的想法

Whether BLT or other byte-level approaches become the norm remains to be seen. But one thing is clear: the future of LLMs is likely to involve a move beyond the superficial wrappers we call "languages" and towards a deeper understanding of the raw data itself. Now, if you'll excuse me, I'm going to go ponder the mysteries of bytes and tokens while listening to some bee-themed jazz. It's the buzz!

BLT还是其他字节级的方法成为常态仍然有待观察。但是有一件事很清楚：LLM的未来可能涉及超越我们称之为“语言”的表面包装器，并更深入地了解原始数据本身。现在，如果您能原谅我，我将在听一些以蜜蜂为主题的爵士乐时考虑字节和令牌的奥秘。这是嗡嗡声！

原文来源：towardsdatascience

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资，kdj.com不承担任何责任。加密货币具有高波动性，强烈建议您深入研究后，谨慎投资！

如您认为本网站上使用的内容侵犯了您的版权，请立即联系我们（info@kdj.com），我们将及时删除。

2025年08月03日发表的其他文章