$105398.502299 USD

1.75%

ethereum

$2555.207592 USD

3.43%

tether

$1.000429 USD

-0.02%

xrp

$2.141971 USD

2.09%

bnb

$651.827388 USD

1.41%

solana

$146.611988 USD

2.90%

usd-coin

$0.999805 USD

-0.01%

dogecoin

$0.177273 USD

3.19%

tron

$0.271470 USD

0.86%

cardano

$0.634997 USD

1.86%

hyperliquid

$41.657613 USD

9.72%

sui

$3.026449 USD

2.34%

bitcoin-cash

$444.966315 USD

11.29%

chainlink

$13.256001 USD

2.72%

unus-sed-leo

$9.032403 USD

1.94%

암호화폐 뉴스 기사

단일 캐릭터를 변경하여 연구원은 LLMS 안전 및 콘텐츠 중재 가드 레일을 우회 할 수 있습니다.

2025/06/12 22:13

사이버 보안 연구원들은 큰 언어 모델 (LLM) 안전 및 컨텐츠 중재 가드 레일을 우회하는 데 사용할 수있는 TokenBreak라는 새로운 공격 기술을 발견했습니다.

Cybersecurity researchers at HiddenLayer have discovered a novel attack technique called TokenBreak that can be used to bypass a large language model's (LLM) safety and content moderation guardrails with just a single character change.

Hiddenlayer의 사이버 보안 연구원들은 단일 캐릭터 변경만으로 대형 언어 모델 (LLM) 안전 및 컨텐츠 중재 가드 레일을 우회하는 데 사용할 수있는 Tokenbreak라는 새로운 공격 기술을 발견했습니다.

The finding, which was shared with The Hacker News, builds on prior work by the researchers, who in June found that it’s possible to exploit Model Context Protocol (MCP) tools to extract sensitive data.

해커 뉴스와 공유 된이 발견은 6 월에 MCP (Model Context Protocol) 도구를 활용하여 민감한 데이터를 추출 할 수 있음을 발견 한 연구원들의 이전 작업을 바탕으로합니다.

"By inserting specific parameter names within a tool's function, sensitive data, including the full system prompt, can be extracted and exfiltrated," HiddenLayer said.

Hiddenlayer는“도구 기능 내에 특정 매개 변수 이름을 삽입하면 전체 시스템 프롬프트를 포함한 민감한 데이터를 추출하여 추방 할 수 있습니다.

The finding also comes as the Straiker AI Research (STAR) team found that backronyms can be used to jailbreak AI chatbots and trick them into generating an undesirable response, including swearing, promoting violence, and producing sexually explicit content.

이번 발견은 Straiker AI Research (Star) 팀이 Backronym이 AI 챗봇을 탈옥하여 욕설, 폭력 증진 및 성적으로 명시적인 내용을 포함한 바람직하지 않은 반응을 일으키는 데 속이는 데 속임수를 사용하는 데 사용될 수 있음을 발견했습니다.

The technique, called the Yearbook Attack, has proven to be effective against various models from Anthropic, DeepSeek, Google, Meta, Microsoft, Mistral AI, and OpenAI.

연감 공격이라고 불리는이 기술은 Anthropic, DeepSeek, Google, Meta, Microsoft, Mistral AI 및 Openai의 다양한 모델에 효과적인 것으로 입증되었습니다.

"They blend in with the noise of everyday prompts — a quirky riddle here, a motivational acronym there — and because of that, they often bypass the blunt heuristics that models use to spot dangerous intent."

"그들은 일상적인 프롬프트의 소음 (여기서 기발한 수수께끼, 동기 부여 약어)의 소음과 조화를 이룹니다. 그리고 그로 인해 그들은 종종 위험 의도를 발견하는 데 모델이 사용하는 둔기 휴리스틱을 우회합니다."

A phrase like 'Friendship, unity, care, kindness' doesn't raise any flags. But by the time the model has completed the pattern, it has already served the payload, which is the key to successfully executing this trick."

'우정, 연합, 보살핌, 친절'과 같은 문구는 깃발을 키우지 않습니다. 그러나 모델이 패턴을 완료 할 때까지 이미 페이로드를 제공했는데, 이는이 트릭을 성공적으로 실행하는 열쇠입니다. "

"These methods succeed not by overpowering the model's filters, but by slipping beneath them. They exploit completion bias and pattern continuation, as well as the way models weigh contextual coherence over intent analysis."

"이러한 방법은 모델의 필터를 압도하는 것이 아니라 모델 필터 아래로 미끄러 져서 성공합니다. 완료 편향 및 패턴 연속 및 모델이 의도 분석에 대한 맥락적인 일관성을 측정하는 방식을 이용합니다."

The TokenBreak attack targets a text classification model's tokenization strategy to induce false negatives, leaving end targets vulnerable to attacks that the implemented protection model was put in place to prevent.

Tokenbreak 공격은 텍스트 분류 모델의 토큰 화 전략을 목표로하여 잘못된 부정을 유도하여 최종 목표를 구현 된 보호 모델이 예방하기 위해 마련된 공격에 취약한 대상을 남깁니다.

Tokenization is a fundamental step that LLMs use to break down raw text into their atomic units – i.e., tokens – which are common sequences of characters found in a set of text. To that end, the text input is converted into their numerical representation and fed to the model.

토큰 화는 LLM이 원자 텍스트를 원자 단위 (즉, 토큰)로 분해하는 데 사용하는 기본 단계입니다. 이는 일련의 텍스트에서 발견되는 일반적인 문자 시퀀스입니다. 이를 위해 텍스트 입력은 수치 표현으로 변환되어 모델로 공급됩니다.

LLMs work by understanding the statistical relationships between these tokens, and produce the next token in a sequence of tokens. The output tokens are detokenized to human-readable text by mapping them to their corresponding words using the tokenizer's vocabulary.

LLM은 이러한 토큰 간의 통계적 관계를 이해하여 작동하며 일련의 토큰으로 다음 토큰을 생성합니다. 출력 토큰은 Tokenizer의 어휘를 사용하여 해당 단어에 매핑하여 사람이 읽을 수있는 텍스트로 탈락됩니다.

The attack technique devised by HiddenLayer targets the tokenization strategy to bypass a text classification model's ability to detect malicious input and flag safety, spam, or content moderation-related issues in the textual input.

Hiddenlayer가 고안 한 공격 기술은 텍스트 입력 및 플래그 안전, 스팸 또는 컨텐츠 중재 관련 문제를 탐지하는 텍스트 분류 모델의 능력을 우회하기위한 토큰 화 전략을 대상으로합니다.

Specifically, the artificial intelligence (AI) security firm found that altering input words by adding letters in certain ways caused a text classification model to break.

구체적으로, 인공 지능 (AI) 보안 회사는 특정 방식으로 문자를 추가하여 입력 단어를 변경하면 텍스트 분류 모델이 깨지는 것을 발견했습니다.

Examples include changing "instructions" to "finstructions," "announcement" to "aannouncement," or "idiot" to "hidiot." These subtle changes cause different tokenizers to split the text in different ways, while still preserving their meaning for the intended target.

예를 들어 "지침"을 "지침"으로 바꾸는 "Finstructions", "ANANNANCERCT"또는 "HiDIOT"로의 "ANNANNUCTION"또는 "IDIOT"로 변경하는 것이 포함됩니다. 이러한 미묘한 변화로 인해 다른 토큰 화제는 텍스트를 다른 방식으로 분할하면서도 의도 된 대상에 대한 의미를 유지합니다.

What makes the attack notable is that the manipulated text remains fully understandable to both the LLM and the human reader, causing the model to elicit the same response as what would have been the case if the unmodified text had been passed as input.

공격을 주목할만한 것은 조작 된 텍스트가 LLM과 휴먼 리더 모두에게 완전히 이해할 수 있다는 것입니다. 모델이 변비되지 않은 텍스트가 입력으로 전달 된 경우에 대한 경우와 동일한 응답을 이끌어냅니다.

By introducing the manipulations in a way without affecting the model's ability to comprehend it, TokenBreak increases its potential for prompt injection attacks.

모델이이를 이해하는 능력에 영향을 미치지 않고 조작을 도입함으로써 TokenBreak는 신속한 주입 공격의 잠재력을 높입니다.

"This attack technique manipulates input text in such a way that certain models give an incorrect classification," the researchers said in an accompanying paper. "Importantly, the end target (LLМ or email recipient) can still understand and respond to the manipulated text and therefore be vulnerable to the very attack the implemented protection model was put in place to prevent."

"이 공격 기술은 특정 모델이 잘못된 분류를 제공하는 방식으로 입력 텍스트를 조작한다"고 연구원들은 동반 논문에서 말했다. "중요하게도, 최종 대상 (LL, 이메일 수신자)은 여전히 조작 된 텍스트를 이해하고 응답 할 수 있으므로 구현 된 보호 모델이 예방하기 위해 마련된 공격에 취약합니다."

The attack has been found to be successful against text classification models using BPE (Byte Pair Encoding) or WordPiece tokenization strategies, but not against those using Unigram.

이 공격은 BPE (바이트 쌍 인코딩) 또는 워드 피스 토큰 화 전략을 사용한 텍스트 분류 모델에 대해 성공한 것으로 밝혀졌지만 Unigram을 사용하는 사람들에 대해서는 언급하지 않았습니다.

"The TokenBreak attack technique demonstrates that these protection models can be bypassed by manipulating the input text, leaving production systems vulnerable," the researchers said. "Knowing the family of the underlying protection model and its tokenization strategy is critical for understanding your susceptibility to this attack."

연구원들은“토큰 브레이크 공격 기술은 입력 텍스트를 조작하여 생산 시스템이 취약 해져 이러한 보호 모델을 우회 할 수 있음을 보여줍니다. "기본 보호 모델과 토큰 화 전략의 가족을 아는 것은이 공격에 대한 감수성을 이해하는 데 중요합니다."

"Because tokenization strategy typically correlates with model family, a straightforward mitigation exists: Select models that use Unigram tokenizers."

"토큰 화 전략은 일반적으로 모델 패밀리와 관련이 있기 때문에 간단한 완화가 존재합니다. 유니 그램 토큰 화제를 사용하는 모델 선택 모델."

To defend against TokenBreak, the researchers suggest using Unigram tokenizers when possible, training models with examples of bypass tricks, and checking that tokenization and model logic stays aligned. It also helps to log misclassifications and look for patterns that hint at manipulation.

TokenBreak를 방어하기 위해 연구원들은 가능한 경우 UniGram 토큰 화제를 사용하고 우회 트릭의 예를 가진 모델을 훈련시키고 토큰 화 및 모델 논리가 정렬되는지 확인하는 것이 좋습니다. 또한 잘못 분류를 기록하고 조작을 암시하는 패턴을 찾는 데 도움이됩니다.

부인 성명:info@kdj.com

제공된 정보는 거래 조언이 아닙니다. kdj.com은 이 기사에 제공된 정보를 기반으로 이루어진 투자에 대해 어떠한 책임도 지지 않습니다. 암호화폐는 변동성이 매우 높으므로 철저한 조사 후 신중하게 투자하는 것이 좋습니다!

2025年06月14日 에 게재된 다른 기사

더