$113653.179192 USD

-1.98%

ethereum

$3525.217143 USD

-5.13%

xrp

$2.974588 USD

-1.43%

tether

$0.999613 USD

-0.03%

bnb

$764.503086 USD

-3.02%

solana

$164.558033 USD

-4.03%

usd-coin

$0.999804 USD

-0.02%

tron

$0.326608 USD

-0.14%

dogecoin

$0.201896 USD

-3.61%

cardano

$0.722456 USD

-2.12%

hyperliquid

$38.099997 USD

-7.92%

sui

$3.494024 USD

-3.45%

stellar

$0.385959 USD

-3.14%

chainlink

$16.209093 USD

-4.30%

bitcoin-cash

$540.811075 USD

-4.11%

加密貨幣新聞文章

LLM，Tokenizers和Models：字節級革命？

2025/06/25 03:17

探索LLM，Tokenizers和Models的最新趨勢，重點關注創新的字節潛在變壓器（BLT）及其對AI未來的影響。

LLMs, Tokenizers, and Models: A Byte-Level Revolution?

LLM，Tokenizers和Models：字節級革命？

The world of LLMs is constantly evolving. This article summarizes the latest trends in 'LLM, Tokenizer, Models', focusing on the challenges of tokenization and the rise of byte-level models, as well as providing insights into potential future directions.

LLM的世界在不斷發展。本文總結了“ LLM，Tokenizer，Models”的最新趨勢，重點介紹了令牌化的挑戰和字節級模型的興起，並提供了對潛在的未來方向的見解。

The Tokenization Bottleneck

令牌化瓶頸

Modern LLMs rely heavily on tokenization, a process that converts text into numerical tokens that the model can understand. However, this process isn't without its flaws. As Pagnoni et al (2024) point out, tokenization can strip away crucial sub-word semantics, leading to inefficiencies and vulnerabilities. Typos, domain-specific language, and low-resource languages can all cause problems for tokenizers, ultimately impacting the model's performance.

Modern LLMS在很大程度上依賴於代幣化，該過程將文本轉換為模型可以理解的數值令牌。但是，此過程並非沒有缺陷。正如Pagnoni等人（2024）所指出的那樣，令牌化可以剝離關鍵的次字語義，從而導致效率低下和脆弱性。錯別字，特定於領域的語言和低資源語言都可能引起令牌問題的問題，最終影響了模型的性能。

The Rise of Byte-Level Models: BLT to the Rescue

字節級模型的興起：BLT進行救援

Enter the Byte Latent Transformer (BLT), a radical new approach that bypasses tokenization altogether. Developed by Meta AI, BLT models language from raw bytes, the most fundamental representation of digital text. This allows the LLM to learn language from the ground up, preserving sub-word semantics and potentially leading to more robust and versatile models.

輸入Byte潛在變壓器（BLT），這是一種繞過令牌化的根本新方法。由META AI開發，BLT從RAW BYTES模擬語言，這是數字文本的最基本表示。這使LLM可以從頭開始學習語言，保留子字的語義，並有可能導致更健壯和多功能的模型。

How BLT Works: A Two-Tiered System

BLT的工作方式：兩層系統

BLT employs a clever two-tiered system to handle the computational challenges of processing raw bytes. The Local Encoder compresses easy-to-predict byte segments into latent "patches," significantly shortening the sequence length. The Latent Global Transformer then focuses its computational resources on the more complex linguistic regions. Finally, the Local Decoder decodes the predicted patch vector back into a sequence of raw bytes.

BLT採用巧妙的兩層系統來處理處理原始字節的計算挑戰。本地編碼器將易於預測的字節段壓縮為潛在的“斑塊”，從而大大縮短了序列長度。然後，潛在的全球變壓器將其計算資源集中在更複雜的語言區域上。最後，局部解碼器將預測的貼片向量解碼為一系列原始字節。

BLT: A Game Changer?

BLT：改變遊戲規則？

The BLT architecture offers several potential advantages over traditional token-based models:

BLT體系結構比傳統的基於代幣的模型具有多種潛在的優勢：

Comparable Scaling: BLT can match the scaling behavior of state-of-the-art token-based architectures like LLaMA 3.
Dynamic Compute Allocation: BLT dynamically allocates computation based on input complexity, focusing resources where they are needed most.
Subword Awareness: By processing raw bytes, BLT gains access to the internal structure of words, improving performance on tasks involving fine-grained edits and noisy text.
Improved Performance on Low-Resource Languages: BLT treats all languages equally from the start, leading to better results in machine translation for languages with limited data.

The Future of LLMs: Beyond Tokenization?

LLM的未來：超越象徵化？

The BLT represents a significant step forward in LLM research, challenging the long-standing reliance on tokenization. While tokenizers have become deeply ingrained in the AI ecosystem, the potential benefits of byte-level modeling are hard to ignore.

BLT代表了LLM研究中邁出的重要一步，挑戰了長期以來對令牌化的依賴。雖然標記器在AI生態系統中已深深地根深蒂固，但字節級建模的潛在好處很難忽略。

While Ozak AI is unrelated to Tokenization, it is an example of an AI project with real world market utility. In the coming year it could very well be the smartest and loudest token due to its use case, and continued AI adoption.

儘管Ozak AI與令牌化無關，但它是具有現實世界市場實用程序的AI項目的一個例子。在來年，由於其用例，它很可能是最聰明，最響亮的令牌，並繼續採用AI。

Final Thoughts

最後的想法

Whether BLT or other byte-level approaches become the norm remains to be seen. But one thing is clear: the future of LLMs is likely to involve a move beyond the superficial wrappers we call "languages" and towards a deeper understanding of the raw data itself. Now, if you'll excuse me, I'm going to go ponder the mysteries of bytes and tokens while listening to some bee-themed jazz. It's the buzz!

BLT還是其他字節級的方法成為常態仍然有待觀察。但是有一件事很清楚：LLM的未來可能涉及超越我們稱之為“語言”的表麵包裝器，並更深入地了解原始數據本身。現在，如果您能原諒我，我將在聽一些以蜜蜂為主題的爵士樂時考慮字節和令牌的奧秘。這是嗡嗡聲！

原始來源：towardsdatascience

免責聲明:info@kdj.com

所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大，建議您充分研究後謹慎投資！

如果您認為本網站使用的內容侵犯了您的版權，請立即聯絡我們（info@kdj.com），我們將及時刪除。

2025年08月03日其他文章發表於