市值: $3.0879T -1.960%
成交额(24h): $143.1627B 52.880%
  • 市值: $3.0879T -1.960%
  • 成交额(24h): $143.1627B 52.880%
  • 恐惧与贪婪指数:
  • 市值: $3.0879T -1.960%
加密货币
话题
百科
资讯
加密话题
视频
热门新闻
加密货币
话题
百科
资讯
加密话题
视频
bitcoin
bitcoin

$101353.343794 USD

-1.08%

ethereum
ethereum

$2242.264272 USD

-1.18%

tether
tether

$1.000323 USD

0.00%

xrp
xrp

$2.016345 USD

-2.01%

bnb
bnb

$619.897741 USD

-1.68%

solana
solana

$132.866437 USD

-1.53%

usd-coin
usd-coin

$1.000025 USD

0.01%

tron
tron

$0.265964 USD

-2.08%

dogecoin
dogecoin

$0.152532 USD

-1.16%

cardano
cardano

$0.545049 USD

-1.01%

hyperliquid
hyperliquid

$35.793511 USD

7.45%

bitcoin-cash
bitcoin-cash

$448.806504 USD

-3.79%

sui
sui

$2.496034 USD

-2.40%

unus-sed-leo
unus-sed-leo

$9.052995 USD

1.06%

chainlink
chainlink

$11.685485 USD

-2.26%

加密货币新闻

通过更改单个角色,研究人员可以绕过LLMS的安全性和内容调节护栏

2025/06/12 22:13

网络安全研究人员发现了一种新颖的攻击技术,称为“令牌破裂”,可用于绕过大型语言模型(LLM)的安全性和内容调节器护栏

通过更改单个角色,研究人员可以绕过LLMS的安全性和内容调节护栏

Cybersecurity researchers at HiddenLayer have discovered a novel attack technique called TokenBreak that can be used to bypass a large language model's (LLM) safety and content moderation guardrails with just a single character change.

Hiddenlayer的网络安全研究人员发现了一种称为TokenBreak的新型攻击技术,可用于绕过大型语言模型(LLM)的安全性和内容节制护栏,只有单个角色更改。

The finding, which was shared with The Hacker News, builds on prior work by the researchers, who in June found that it’s possible to exploit Model Context Protocol (MCP) tools to extract sensitive data.

与黑客新闻共享的发现是基于研究人员的先前工作,他们在6月发现可以利用模型上下文协议(MCP)工具来提取敏感数据。

"By inserting specific parameter names within a tool's function, sensitive data, including the full system prompt, can be extracted and exfiltrated," HiddenLayer said.

“通过在工具功能中插入特定的参数名称,可以提取和删除敏感的数据,包括完整的系统提示,” Sideendlayer说。

The finding also comes as the Straiker AI Research (STAR) team found that backronyms can be used to jailbreak AI chatbots and trick them into generating an undesirable response, including swearing, promoting violence, and producing sexually explicit content.

这一发现还出现了,因为Straiker AI研究(Star)团队发现可以使用偏见来越狱AI聊天机器人,并诱使他们产生不良反应,包括宣誓,促进暴力和产生性明确的内容。

The technique, called the Yearbook Attack, has proven to be effective against various models from Anthropic, DeepSeek, Google, Meta, Microsoft, Mistral AI, and OpenAI.

这项称为年鉴攻击的技术已被证明对人类,DeepSeek,Google,Meta,Microsoft,Misstral AI和OpenAI的各种模型有效。

"They blend in with the noise of everyday prompts — a quirky riddle here, a motivational acronym there — and because of that, they often bypass the blunt heuristics that models use to spot dangerous intent."

“他们融合了日常提示的噪音 - 这里的一个古怪的谜语,那是一个激励性的缩写 - 因此,他们经常绕过模型用来发现危险意图的钝器启发式方法。”

A phrase like 'Friendship, unity, care, kindness' doesn't raise any flags. But by the time the model has completed the pattern, it has already served the payload, which is the key to successfully executing this trick."

诸如“友谊,团结,关怀,善良”之类的短语不会引起任何旗帜。但是到模型完成模式时,它已经服务于有效载荷,这是成功执行此技巧的关键。”

"These methods succeed not by overpowering the model's filters, but by slipping beneath them. They exploit completion bias and pattern continuation, as well as the way models weigh contextual coherence over intent analysis."

“这些方法不是通过压倒模型的过滤器而成功的,而是通过将其滑动在其下面。它们利用完成偏见和模式延续,以及模型权衡上下文一致性而不是意图分析的方式。”

The TokenBreak attack targets a text classification model's tokenization strategy to induce false negatives, leaving end targets vulnerable to attacks that the implemented protection model was put in place to prevent.

TokenBreak攻击目标是文本分类模型的诱导假否定因素的象征化策略,而最终目标则容易受到实施的保护模型的攻击。

Tokenization is a fundamental step that LLMs use to break down raw text into their atomic units – i.e., tokens – which are common sequences of characters found in a set of text. To that end, the text input is converted into their numerical representation and fed to the model.

令牌化是LLM用于将原始文本分解为其原子单元(即令牌)的基本步骤,即在一组文本中发现的字符序列。为此,文本输入将转换为其数值表示形式并馈送到模型。

LLMs work by understanding the statistical relationships between these tokens, and produce the next token in a sequence of tokens. The output tokens are detokenized to human-readable text by mapping them to their corresponding words using the tokenizer's vocabulary.

LLM通过了解这些令牌之间的统计关系而起作用,并以一系列令牌生成下一代币。输出令牌通过使用令牌词的词汇将其映射到相应的单词中,将输出令牌置于可读文本中。

The attack technique devised by HiddenLayer targets the tokenization strategy to bypass a text classification model's ability to detect malicious input and flag safety, spam, or content moderation-related issues in the textual input.

隐藏层设计的攻击技术针对令牌化策略,以绕过文本分类模型在文本输入中检测恶意输入和标志安全性,垃圾邮件或与内容节制相关的问题的能力。

Specifically, the artificial intelligence (AI) security firm found that altering input words by adding letters in certain ways caused a text classification model to break.

具体而言,人工智能(AI)安全公司发现,通过以某些方式添加字母来改变输入单词会导致文本分类模型破裂。

Examples include changing "instructions" to "finstructions," "announcement" to "aannouncement," or "idiot" to "hidiot." These subtle changes cause different tokenizers to split the text in different ways, while still preserving their meaning for the intended target.

示例包括将“指示”更改为“罚款”,“公告”为“ Aannounection”,或“白痴”对“ Hidiot”。这些微妙的变化会导致不同的代币器以不同的方式将文本分开,同时仍保留其对预期目标的含义。

What makes the attack notable is that the manipulated text remains fully understandable to both the LLM and the human reader, causing the model to elicit the same response as what would have been the case if the unmodified text had been passed as input.

使攻击值得注意的是,操纵文本对LLM和人类读者都可以完全理解,从而导致模型引起与如果未修改的文本作为输入的情况相同的响应。

By introducing the manipulations in a way without affecting the model's ability to comprehend it, TokenBreak increases its potential for prompt injection attacks.

通过以某种方式引入操作而不影响模型理解模型的能力,TokenBreak可以提高其快速注射攻击的潜力。

"This attack technique manipulates input text in such a way that certain models give an incorrect classification," the researchers said in an accompanying paper. "Importantly, the end target (LLМ or email recipient) can still understand and respond to the manipulated text and therefore be vulnerable to the very attack the implemented protection model was put in place to prevent."

研究人员在随附的论文中说:“这种攻击技术以某些模型对分类不正确的方式操纵输入文本。” “重要的是,最终目标(LLHO或电子邮件接收者)仍然可以理解并响应被操纵的文本,因此容易受到实施的保护模型的攻击,以防止。”

The attack has been found to be successful against text classification models using BPE (Byte Pair Encoding) or WordPiece tokenization strategies, but not against those using Unigram.

已经发现,使用BPE(字节对编码)或WordPiece令牌化策略对文本分类模型取得了成功,但不是使用使用UMIGRAM的文字分类策略。

"The TokenBreak attack technique demonstrates that these protection models can be bypassed by manipulating the input text, leaving production systems vulnerable," the researchers said. "Knowing the family of the underlying protection model and its tokenization strategy is critical for understanding your susceptibility to this attack."

研究人员说:“令牌攻击技术表明,可以通过操纵输入文本来绕开这些保护模型,从而使生产系统易受伤害。” “了解基础保护模型的家庭及其令牌化策略对于理解您对这次攻击的敏感性至关重要。”

"Because tokenization strategy typically correlates with model family, a straightforward mitigation exists: Select models that use Unigram tokenizers."

“由于象征化策略通常与模型家族相关,因此存在直接的缓解措施:选择使用Unigram tokenizers的模型。”

To defend against TokenBreak, the researchers suggest using Unigram tokenizers when possible, training models with examples of bypass tricks, and checking that tokenization and model logic stays aligned. It also helps to log misclassifications and look for patterns that hint at manipulation.

为了防止象征性的破坏,研究人员建议在可能的情况下使用Umigram令牌,以旁路技巧的示例培训模型,并检查令牌化和模型逻辑保持一致。它还有助于记录错误分类并寻找暗示操纵的模式。

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资,kdj.com不承担任何责任。加密货币具有高波动性,强烈建议您深入研究后,谨慎投资!

如您认为本网站上使用的内容侵犯了您的版权,请立即联系我们(info@kdj.com),我们将及时删除。

2025年06月24日 发表的其他文章