$105398.502299 USD

1.75%

ethereum

$2555.207592 USD

3.43%

tether

$1.000429 USD

-0.02%

xrp

$2.141971 USD

2.09%

bnb

$651.827388 USD

1.41%

solana

$146.611988 USD

2.90%

usd-coin

$0.999805 USD

-0.01%

dogecoin

$0.177273 USD

3.19%

tron

$0.271470 USD

0.86%

cardano

$0.634997 USD

1.86%

hyperliquid

$41.657613 USD

9.72%

sui

$3.026449 USD

2.34%

bitcoin-cash

$444.966315 USD

11.29%

chainlink

$13.256001 USD

2.72%

unus-sed-leo

$9.032403 USD

1.94%

暗号通貨のニュース記事

単一のキャラクターを変更することにより、研究者はLLMSの安全性とコンテンツモデレートガードレールをバイパスできます

2025/06/12 22:13

サイバーセキュリティの研究者は、大規模な言語モデル（LLM）の安全性およびコンテンツモデレーションガードレールをバイパスするために使用できるトークンブレイクと呼ばれる新しい攻撃技術を発見しました

Cybersecurity researchers at HiddenLayer have discovered a novel attack technique called TokenBreak that can be used to bypass a large language model's (LLM) safety and content moderation guardrails with just a single character change.

HiddenLayerのサイバーセキュリティ研究者は、大規模な言語モデル（LLM）の安全性とコンテンツモデレーションガードレールを単一の文字変更でバイパスするために使用できるトークンブレイクと呼ばれる新しい攻撃技術を発見しました。

The finding, which was shared with The Hacker News, builds on prior work by the researchers, who in June found that it’s possible to exploit Model Context Protocol (MCP) tools to extract sensitive data.

ハッカーニュースと共有されたこの発見は、6月にモデルコンテキストプロトコル（MCP）ツールを悪用して機密データを抽出することが可能であることがわかった研究者による以前の研究に基づいています。

"By inserting specific parameter names within a tool's function, sensitive data, including the full system prompt, can be extracted and exfiltrated," HiddenLayer said.

「ツールの関数内に特定のパラメーター名を挿入することにより、完全なシステムプロンプトを含む機密データを抽出して抽出できます」とHiddenLayer氏は述べています。

The finding also comes as the Straiker AI Research (STAR) team found that backronyms can be used to jailbreak AI chatbots and trick them into generating an undesirable response, including swearing, promoting violence, and producing sexually explicit content.

この発見は、Straiker AI Research（STAR）チームが、背中を使用してAIチャットボットを脱獄し、宣誓、暴力の促進、性的に露骨なコンテンツの作成など、望ましくない反応を生成するためにそれらをだましていることを発見したときにも起こります。

The technique, called the Yearbook Attack, has proven to be effective against various models from Anthropic, DeepSeek, Google, Meta, Microsoft, Mistral AI, and OpenAI.

年鑑攻撃と呼ばれるこの手法は、人類、Deepseek、Google、Meta、Microsoft、Mistral AI、およびOpenaiのさまざまなモデルに対して効果的であることが証明されています。

"They blend in with the noise of everyday prompts — a quirky riddle here, a motivational acronym there — and because of that, they often bypass the blunt heuristics that models use to spot dangerous intent."

「彼らは日常のプロンプトのノイズ - ここでの風変わりななぞなぞ、動機付けの頭字語 - に溶け込みます。そのため、モデルが危険な意図を見つけるために使用する鈍器を迂回することがよくあります。」

A phrase like 'Friendship, unity, care, kindness' doesn't raise any flags. But by the time the model has completed the pattern, it has already served the payload, which is the key to successfully executing this trick."

「友情、団結、ケア、親切」のようなフレーズは、旗を立てません。しかし、モデルがパターンを完了するまでに、すでにペイロードを提供しています。これは、このトリックを正常に実行するための鍵です。」

"These methods succeed not by overpowering the model's filters, but by slipping beneath them. They exploit completion bias and pattern continuation, as well as the way models weigh contextual coherence over intent analysis."

「これらの方法は、モデルのフィルターを圧倒するのではなく、その下に滑ることによって成功します。完成バイアスとパターンの継続、およびモデルの意図分析に対するコンテキストの一貫性を検討する方法を活用します。」

The TokenBreak attack targets a text classification model's tokenization strategy to induce false negatives, leaving end targets vulnerable to attacks that the implemented protection model was put in place to prevent.

トークンブレイク攻撃は、テキスト分類モデルのトークン化戦略をターゲットにして、誤ったネガを誘導し、実装された保護モデルが予防するために導入された攻撃に対して脆弱なエンドターゲットを残します。

Tokenization is a fundamental step that LLMs use to break down raw text into their atomic units – i.e., tokens – which are common sequences of characters found in a set of text. To that end, the text input is converted into their numerical representation and fed to the model.

トークン化は、LLMが原子ユニット（つまり、トークン）に生のテキストを分解するために使用する基本的なステップです。これは、テキストのセットに見られる文字の一般的なシーケンスです。そのために、テキスト入力は数値表現に変換され、モデルに供給されます。

LLMs work by understanding the statistical relationships between these tokens, and produce the next token in a sequence of tokens. The output tokens are detokenized to human-readable text by mapping them to their corresponding words using the tokenizer's vocabulary.

LLMSは、これらのトークン間の統計的関係を理解することにより機能し、一連のトークンで次のトークンを生成します。出力トークンは、トークンザーの語彙を使用して対応する単語にマッピングすることにより、人間の読み取り可能なテキストに描写されます。

The attack technique devised by HiddenLayer targets the tokenization strategy to bypass a text classification model's ability to detect malicious input and flag safety, spam, or content moderation-related issues in the textual input.

HiddenLayerによって考案された攻撃手法は、テキスト入力とフラグの安全性、スパム、またはテキスト入力におけるコンテンツモデレート関連の問題を検出するテキスト分類モデルの能力をバイパスするためのトークン化戦略をターゲットにしています。

Specifically, the artificial intelligence (AI) security firm found that altering input words by adding letters in certain ways caused a text classification model to break.

具体的には、人工知能（AI）セキュリティ会社は、特定の方法で文字を追加することで入力語を変更すると、テキスト分類モデルが壊れることを発見しました。

Examples include changing "instructions" to "finstructions," "announcement" to "aannouncement," or "idiot" to "hidiot." These subtle changes cause different tokenizers to split the text in different ways, while still preserving their meaning for the intended target.

例には、「指示」を「フィンストラクション」に変更する、「Announcement」に「発表」、または「馬鹿」に「Hidiot」に変更することが含まれます。これらの微妙な変更により、異なるトークンザーはテキストをさまざまな方法で分割し、意図したターゲットに対する意味を保持します。

What makes the attack notable is that the manipulated text remains fully understandable to both the LLM and the human reader, causing the model to elicit the same response as what would have been the case if the unmodified text had been passed as input.

攻撃を注目に値するのは、操作されたテキストがLLMと人間の読者の両方にとって完全に理解できるままであり、モデルが未修正のテキストが入力として渡された場合と同じ応答を引き出すことです。

By introducing the manipulations in a way without affecting the model's ability to comprehend it, TokenBreak increases its potential for prompt injection attacks.

モデルを理解する能力に影響を与えることなく操作をある程度導入することにより、トークンブレイクは迅速なインジェクション攻撃の可能性を高めます。

"This attack technique manipulates input text in such a way that certain models give an incorrect classification," the researchers said in an accompanying paper. "Importantly, the end target (LLМ or email recipient) can still understand and respond to the manipulated text and therefore be vulnerable to the very attack the implemented protection model was put in place to prevent."

「この攻撃手法は、特定のモデルが間違った分類を与えるように入力テキストを操作します」と研究者は付随する論文で述べました。「重要なことに、最終ターゲット（LLмまたは電子メール受信者）は、操作されたテキストを理解して応答することができ、したがって、実装された保護モデルが導入されたまさに攻撃に対して脆弱であることです。」

The attack has been found to be successful against text classification models using BPE (Byte Pair Encoding) or WordPiece tokenization strategies, but not against those using Unigram.

この攻撃は、BPE（バイトペアのエンコード）またはワードピーストークン化戦略を使用したテキスト分類モデルに対して成功していることがわかっていますが、Unigramを使用しているものに対してはそうではありません。

"The TokenBreak attack technique demonstrates that these protection models can be bypassed by manipulating the input text, leaving production systems vulnerable," the researchers said. "Knowing the family of the underlying protection model and its tokenization strategy is critical for understanding your susceptibility to this attack."

「トークンブレイク攻撃手法は、これらの保護モデルが入力テキストを操作し、生産システムを脆弱にすることでバイパスできることを示しています」と研究者は言いました。「基礎となる保護モデルのファミリーとそのトークン化戦略を知ることは、この攻撃に対する感受性を理解するために重要です。」

"Because tokenization strategy typically correlates with model family, a straightforward mitigation exists: Select models that use Unigram tokenizers."

「トークン化戦略は通常、モデルファミリと相関するため、単純な緩和が存在します。ユニグラムトークンを使用する選択肢を選択します。」

To defend against TokenBreak, the researchers suggest using Unigram tokenizers when possible, training models with examples of bypass tricks, and checking that tokenization and model logic stays aligned. It also helps to log misclassifications and look for patterns that hint at manipulation.

トークンブレイクを防御するために、研究者は、可能であればユニグラムトークンザーを使用し、バイパストリックの例でモデルをトレーニングし、トークン化とモデルロジックが整合したままであることを確認することを提案します。また、誤分類を記録し、操作を示唆するパターンを探すのにも役立ちます。

免責事項:info@kdj.com

提供される情報は取引に関するアドバイスではありません。 kdj.com は、この記事で提供される情報に基づいて行われた投資に対して一切の責任を負いません。暗号通貨は変動性が高いため、十分な調査を行った上で慎重に投資することを強くお勧めします。

このウェブサイトで使用されているコンテンツが著作権を侵害していると思われる場合は、直ちに当社 (info@kdj.com) までご連絡ください。速やかに削除させていただきます。

2025年06月14日に掲載されたその他の記事

もっと