$105398.502299 USD

1.75%

ethereum

$2555.207592 USD

3.43%

tether

$1.000429 USD

-0.02%

xrp

$2.141971 USD

2.09%

bnb

$651.827388 USD

1.41%

solana

$146.611988 USD

2.90%

usd-coin

$0.999805 USD

-0.01%

dogecoin

$0.177273 USD

3.19%

tron

$0.271470 USD

0.86%

cardano

$0.634997 USD

1.86%

hyperliquid

$41.657613 USD

9.72%

sui

$3.026449 USD

2.34%

bitcoin-cash

$444.966315 USD

11.29%

chainlink

$13.256001 USD

2.72%

unus-sed-leo

$9.032403 USD

1.94%

加密貨幣新聞文章

通過更改單個角色，研究人員可以繞過LLMS的安全性和內容調節護欄

2025/06/12 22:13

網絡安全研究人員發現了一種新穎的攻擊技術，稱為“令牌破裂”，可用於繞過大型語言模型（LLM）的安全性和內容調節器護欄

Cybersecurity researchers at HiddenLayer have discovered a novel attack technique called TokenBreak that can be used to bypass a large language model's (LLM) safety and content moderation guardrails with just a single character change.

Hiddenlayer的網絡安全研究人員發現了一種稱為TokenBreak的新型攻擊技術，可用於繞過大型語言模型（LLM）的安全性和內容節制護欄，只有單個角色更改。

The finding, which was shared with The Hacker News, builds on prior work by the researchers, who in June found that it’s possible to exploit Model Context Protocol (MCP) tools to extract sensitive data.

與黑客新聞共享的發現是基於研究人員的先前工作，他們在6月發現可以利用模型上下文協議（MCP）工具來提取敏感數據。

"By inserting specific parameter names within a tool's function, sensitive data, including the full system prompt, can be extracted and exfiltrated," HiddenLayer said.

“通過在工具功能中插入特定的參數名稱，可以提取和刪除敏感的數據，包括完整的系統提示，” Sideendlayer說。

The finding also comes as the Straiker AI Research (STAR) team found that backronyms can be used to jailbreak AI chatbots and trick them into generating an undesirable response, including swearing, promoting violence, and producing sexually explicit content.

這一發現還出現了，因為Straiker AI研究（Star）團隊發現可以使用偏見來越獄AI聊天機器人，並誘使他們產生不良反應，包括宣誓，促進暴力和產生性明確的內容。

The technique, called the Yearbook Attack, has proven to be effective against various models from Anthropic, DeepSeek, Google, Meta, Microsoft, Mistral AI, and OpenAI.

這項稱為年鑑攻擊的技術已被證明對人類，DeepSeek，Google，Meta，Microsoft，Misstral AI和OpenAI的各種模型有效。

"They blend in with the noise of everyday prompts — a quirky riddle here, a motivational acronym there — and because of that, they often bypass the blunt heuristics that models use to spot dangerous intent."

“他們融合了日常提示的噪音 - 這裡的一個古怪的謎語，那是一個激勵性的縮寫 - 因此，他們經常繞過模型用來發現危險意圖的鈍器啟發式方法。”

A phrase like 'Friendship, unity, care, kindness' doesn't raise any flags. But by the time the model has completed the pattern, it has already served the payload, which is the key to successfully executing this trick."

諸如“友誼，團結，關懷，善良”之類的短語不會引起任何旗幟。但是到模型完成模式時，它已經服務於有效載荷，這是成功執行此技巧的關鍵。 ”

"These methods succeed not by overpowering the model's filters, but by slipping beneath them. They exploit completion bias and pattern continuation, as well as the way models weigh contextual coherence over intent analysis."

“這些方法不是通過壓倒模型的過濾器而成功的，而是通過將其滑動在其下面。它們利用完成偏見和模式延續，以及模型權衡上下文一致性而不是意圖分析的方式。”

The TokenBreak attack targets a text classification model's tokenization strategy to induce false negatives, leaving end targets vulnerable to attacks that the implemented protection model was put in place to prevent.

TokenBreak攻擊目標是文本分類模型的誘導假否定因素的象徵化策略，而最終目標則容易受到實施的保護模型的攻擊。

Tokenization is a fundamental step that LLMs use to break down raw text into their atomic units – i.e., tokens – which are common sequences of characters found in a set of text. To that end, the text input is converted into their numerical representation and fed to the model.

令牌化是LLM用於將原始文本分解為其原子單元（即令牌）的基本步驟，即在一組文本中發現的字符序列。為此，文本輸入將轉換為其數值表示形式並饋送到模型。

LLMs work by understanding the statistical relationships between these tokens, and produce the next token in a sequence of tokens. The output tokens are detokenized to human-readable text by mapping them to their corresponding words using the tokenizer's vocabulary.

LLM通過了解這些令牌之間的統計關係而起作用，並以一系列令牌生成下一代幣。輸出令牌通過使用令牌詞的詞彙將其映射到相應的單詞中，將輸出令牌置於可讀文本中。

The attack technique devised by HiddenLayer targets the tokenization strategy to bypass a text classification model's ability to detect malicious input and flag safety, spam, or content moderation-related issues in the textual input.

隱藏層設計的攻擊技術針對令牌化策略，以繞過文本分類模型在文本輸入中檢測惡意輸入和標誌安全性，垃圾郵件或與內容節制相關的問題的能力。

Specifically, the artificial intelligence (AI) security firm found that altering input words by adding letters in certain ways caused a text classification model to break.

具體而言，人工智能（AI）安全公司發現，通過以某些方式添加字母來改變輸入單詞會導致文本分類模型破裂。

Examples include changing "instructions" to "finstructions," "announcement" to "aannouncement," or "idiot" to "hidiot." These subtle changes cause different tokenizers to split the text in different ways, while still preserving their meaning for the intended target.

示例包括將“指示”更改為“罰款”，“公告”為“ Aannounection”，或“白痴”對“ Hidiot”。這些微妙的變化會導致不同的代幣器以不同的方式將文本分開，同時仍保留其對預期目標的含義。

What makes the attack notable is that the manipulated text remains fully understandable to both the LLM and the human reader, causing the model to elicit the same response as what would have been the case if the unmodified text had been passed as input.

使攻擊值得注意的是，操縱文本對LLM和人類讀者都可以完全理解，從而導致模型引起與如果未修改的文本作為輸入的情況相同的響應。

By introducing the manipulations in a way without affecting the model's ability to comprehend it, TokenBreak increases its potential for prompt injection attacks.

通過以某種方式引入操作而不影響模型理解模型的能力，TokenBreak可以提高其快速注射攻擊的潛力。

"This attack technique manipulates input text in such a way that certain models give an incorrect classification," the researchers said in an accompanying paper. "Importantly, the end target (LLМ or email recipient) can still understand and respond to the manipulated text and therefore be vulnerable to the very attack the implemented protection model was put in place to prevent."

研究人員在隨附的論文中說：“這種攻擊技術以某些模型對分類不正確的方式操縱輸入文本。” “重要的是，最終目標（LLHO或電子郵件接收者）仍然可以理解並響應被操縱的文本，因此容易受到實施的保護模型的攻擊，以防止。”

The attack has been found to be successful against text classification models using BPE (Byte Pair Encoding) or WordPiece tokenization strategies, but not against those using Unigram.

已經發現，使用BPE（字節對編碼）或WordPiece令牌化策略對文本分類模型取得了成功，但不是使用使用UMIGRAM的文字分類策略。

"The TokenBreak attack technique demonstrates that these protection models can be bypassed by manipulating the input text, leaving production systems vulnerable," the researchers said. "Knowing the family of the underlying protection model and its tokenization strategy is critical for understanding your susceptibility to this attack."

研究人員說：“令牌攻擊技術表明，可以通過操縱輸入文本來繞開這些保護模型，從而使生產系統易受傷害。” “了解基礎保護模型的家庭及其令牌化策略對於理解您對這次攻擊的敏感性至關重要。”

"Because tokenization strategy typically correlates with model family, a straightforward mitigation exists: Select models that use Unigram tokenizers."

“由於象徵化策略通常與模型家族相關，因此存在直接的緩解措施：選擇使用Unigram tokenizers的模型。”

To defend against TokenBreak, the researchers suggest using Unigram tokenizers when possible, training models with examples of bypass tricks, and checking that tokenization and model logic stays aligned. It also helps to log misclassifications and look for patterns that hint at manipulation.

為了防止象徵性的破壞，研究人員建議在可能的情況下使用Umigram令牌，以旁路技巧的示例培訓模型，並檢查令牌化和模型邏輯保持一致。它還有助於記錄錯誤分類並尋找暗示操縱的模式。

免責聲明:info@kdj.com

所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大，建議您充分研究後謹慎投資！

如果您認為本網站使用的內容侵犯了您的版權，請立即聯絡我們（info@kdj.com），我們將及時刪除。

2025年06月14日其他文章發表於