$93113.538616 USD

-0.11%

ethereum

$1748.590950 USD

-2.15%

tether

$1.000392 USD

0.02%

xrp

$2.177851 USD

-1.16%

bnb

$600.317897 USD

-0.84%

solana

$151.339663 USD

1.47%

usd-coin

$0.999927 USD

0.01%

dogecoin

$0.179240 USD

2.45%

cardano

$0.707230 USD

2.73%

tron

$0.243466 USD

-0.61%

sui

$3.323843 USD

10.76%

chainlink

$14.828095 USD

0.41%

avalanche

$21.905207 USD

-0.82%

stellar

$0.275988 USD

4.91%

unus-sed-leo

$9.206268 USD

0.44%

암호화폐 뉴스 기사

COE (Experts) 접근법 소개 : 희소 신경망을위한 새로운 패러다임

2025/03/04 13:57

대형 언어 모델은 인공 지능에 대한 이해를 크게 발전 시켰지만 이러한 모델을 효율적으로 확장하는 것은 여전히 어려운 일입니다.

Large language models (LLMs) have revolutionized our understanding of artificial intelligence (AI), yet scaling these models efficiently remains a critical challenge. Traditional Mixture-of-Experts (MoE) architectures are designed to activate only a subset of experts per token in order to economize on computation. However, this design leads to two main issues. Firstly, experts process tokens in complete isolation—each expert performs its task without any cross-communication with others, which may limit the model’s ability to integrate diverse perspectives during processing. Secondly, although MoE models employ a sparse activation pattern, they still require considerable memory. This is because the overall parameter count is high, even if only a few experts are actively used at any given time. These observations suggest that while MoE models are a step forward in scalability, their inherent design may limit both performance and resource efficiency.

대형 언어 모델 (LLM)은 인공 지능 (AI)에 대한 우리의 이해에 혁명을 일으켰지 만 이러한 모델을 효율적으로 확장하는 것은 여전히 중요한 과제로 남아 있습니다. MOE (전통적인 혼합 experts) 아키텍처는 계산을 경제하기 위해 토큰 당 전문가의 하위 집합 만 활성화하도록 설계되었습니다. 그러나이 디자인은 두 가지 주요 문제로 이어집니다. 첫째, 전문가들은 토큰을 완전히 격리하여 처리합니다. 각 전문가는 다른 사람과의 교차 의사 소통없이 작업을 수행하여 처리 중에 다양한 관점을 통합하는 모델의 능력을 제한 할 수 있습니다. 둘째, MOE 모델은 희소 활성화 패턴을 사용하지만 여전히 상당한 메모리가 필요합니다. 이는 주어진 시간에 몇몇 전문가 만 활발하게 사용 되더라도 전체 매개 변수 수가 높기 때문입니다. 이러한 관찰은 MOE 모델이 확장 성이 한 단계 앞으로 나아가지 만 고유 한 설계는 성능과 자원 효율을 모두 제한 할 수 있음을 시사합니다.

Chain-of-Experts (CoE)

체인-참가자 (COE)

Chain-of-Experts (CoE) offers a fresh perspective on MoE architectures by introducing a mechanism for sequential communication among experts. Unlike the independent processing seen in traditional MoE models, CoE allows tokens to be processed in a series of iterations within each layer. In this arrangement, the output of one expert serves as the input for the next, creating a communicative chain that enables experts to build upon one another’s work. This sequential interaction does not simply stack layers; it facilitates a more integrated approach to token processing, where each expert refines the token’s meaning based on previous outputs. The goal is to use memory more efficiently.

COE (Chain-of-Experts)는 전문가 간의 순차적 의사 소통을위한 메커니즘을 도입하여 MOE 아키텍처에 대한 새로운 관점을 제공합니다. 전통적인 MOE 모델에서 볼 수있는 독립적 인 처리와 달리 COE는 각 층 내의 일련의 반복으로 토큰을 처리 할 수 있도록합니다. 이 배열에서 한 전문가의 출력은 다음에 대한 입력 역할을하며 전문가가 서로의 작업을 구축 할 수있는 의사 소통 체인을 만듭니다. 이 순차적 인 상호 작용은 단순히 층을 스택하는 것이 아닙니다. 토큰 처리에 대한보다 통합 된 접근 방식을 용이하게합니다. 여기서 각 전문가는 이전 출력에 따라 토큰의 의미를 개선합니다. 목표는 메모리를보다 효율적으로 사용하는 것입니다.

Technical Details and Benefits

기술적 인 세부 사항 및 혜택

At the heart of the CoE method is an iterative process that redefines how experts interact. For instance, consider a configuration described as CoE-2(4/64): the model operates with two iterations per token, with four experts selected from a pool of 64 at each cycle. This contrasts with traditional MoE, which uses a single pass through a pre-selected group of experts.

COE 방법의 핵심은 전문가가 상호 작용하는 방식을 재정의하는 반복 프로세스입니다. 예를 들어, COE-2 (4/64)로 설명 된 구성을 고려하십시오. 모델은 토큰 당 2 개의 반복으로 작동하며 각주기마다 64 개의 풀에서 4 개의 전문가가 선택됩니다. 이것은 사전 선택된 전문가 그룹을 통한 단일 패스를 사용하는 전통적인 MOE와 대조적입니다.

Another key technical element in CoE is the independent gating mechanism. In conventional MoE models, the gating function decides which experts should process a token, and these decisions are made once per token per layer. However, CoE takes this a step further by allowing each expert’s gating decision to be made independently during each iteration. This flexibility encourages a form of specialization, as an expert can adjust its processing based on the information received from earlier iterations.

COE의 또 다른 주요 기술 요소는 독립적 인 게이팅 메커니즘입니다. 기존의 MOE 모델에서 게이팅 기능은 어떤 전문가가 토큰을 처리 해야하는지 결정하며, 이러한 결정은 층당 토큰 당 한 번 이루어집니다. 그러나 COE는 각 반복 중에 각 전문가의 게이팅 결정을 독립적으로 내릴 수있게함으로써이를 한 단계 더 발전시킵니다. 이 유연성은 전문가가 이전 반복에서받은 정보를 기반으로 처리를 조정할 수 있으므로 전문화 형태를 장려합니다.

Furthermore, the use of inner residual connections in CoE enhances the model. Instead of simply adding the original token back after the entire sequence of processing (an outer residual connection), CoE integrates residual connections within each iteration. This design helps to maintain the integrity of the token’s information while allowing for incremental improvements at every step.

또한 COE에서 내부 잔류 연결을 사용하면 모델이 향상됩니다. COE는 전체 처리 시퀀스 (외부 잔류 연결) 후에 원래 토큰을 다시 추가하는 대신 각 반복 내에서 잔류 연결을 통합합니다. 이 디자인은 토큰 정보의 무결성을 유지하면서 모든 단계에서 점진적인 개선을 허용하는 데 도움이됩니다.

These technical innovations combine to create a model that aims to retain performance with fewer resources and provides a more nuanced processing pathway, which could be valuable for tasks requiring layered reasoning.

이러한 기술 혁신은 더 적은 리소스로 성능을 유지하고보다 미묘한 처리 경로를 제공하는 모델을 만들기 위해 결합하여 계층화 된 추론이 필요한 작업에 유용 할 수 있습니다.

Experimental Results and Insights

실험 결과 및 통찰력

Preliminary experiments, such as pretraining on math-related tasks, show promise for the Chain-of-Experts method. In a configuration denoted as CoE-2(4/64), two iterations of four experts from a pool of 64 were used in each layer. Compared with traditional MoE operating under the same computational constraints, CoE-2(4/64) achieved a lower validation loss (1.12 vs. 1.20) without any increase in memory or computational cost.

수학 관련 작업에 대한 사전 조정과 같은 예비 실험은 운동 체인 방법에 대한 약속을 보여줍니다. COE-2 (4/64)로 표시되는 구성에서, 각 층에서 64 풀에서 4 명의 전문가의 2 개의 반복이 사용되었다. 동일한 계산 제약 조건 하에서 작동하는 전통적인 MOE와 비교하여 COE-2 (4/64)는 메모리 또는 계산 비용의 증가없이 더 낮은 검증 손실 (1.12 vs. 1.20)을 달성했습니다.

The researchers also varied the configurations of Chain-of-Experts and compared them with traditional Mixture-of-Experts (MoE) models. For example, they tested CoE-2(4/64), CoE-1(8/64), and MoE(8) models, all operating within similar computational and memory footprints. Their findings showed that increasing the iteration count in Chain-of-Experts yielded benefits comparable to or even better than increasing the number of experts selected in a single pass. Even when the models were deployed on the same hardware and subjected to the same computational constraints, Chain-of-Experts demonstrated an advantage in terms of both performance and resource utilization.

연구원들은 또한 운동 체인의 구성을 변화 시켰으며이를 전통적인 혼합 운동 (MOE) 모델과 비교했습니다. 예를 들어, 그들은 COE-2 (4/64), COE-1 (8/64) 및 MOE (8) 모델을 테스트했으며, 모두 유사한 계산 및 메모리 풋 프린트 내에서 작동합니다. 그들의 연구 결과에 따르면 운동 체인에서 반복 수를 늘리면 단일 패스에서 선택된 전문가의 수를 늘리는 것보다 더 나은 혜택을 얻었습니다. 모델이 동일한 하드웨어에 배치되어 동일한 계산 제약 조건에 노출 된 경우에도 실험 체인은 성능 및 리소스 활용 측면에서 이점을 보여주었습니다.

In one experiment, a single layer of MoE with eight experts was compared with two layers of Chain-of-Experts, each selecting four experts. Despite having fewer experts in each layer, Chain-of-Experts achieved better performance. Moreover, when varying the experts' capacity (output dimension) while keeping the total parameters constant, Chain-of-Experts configurations showed up to an 18% reduction in memory usage while realizing similar or slightly better performance.

한 실험에서, 8 명의 전문가가있는 단일 MOE 층을 각각 4 명의 전문가를 선택하는 2 개의 층의 운동 체인과 비교되었다. 각 계층에 전문가가 적음에도 불구하고 체인의 체인은 더 나은 성능을 달성했습니다. 또한, 총 매개 변수를 일정하게 유지하면서 전문가의 용량 (출력 차원)을 변경하면 유사하거나 약간 더 나은 성능을 실현하면서 메모리 사용량이 18% 감소했습니다.

Another key finding was the dramatic increase in the number of possible expert combinations. With two iterations of four experts from a pool of 64, there were 3.8 x 10¹⁰⁴ different expert combinations in a single layer of Chain-of-Experts. In contrast, a single layer of MoE with eight experts had only 2.2 x 10⁴² combinations

또 다른 주요 발견은 가능한 전문가 조합의 수의 극적인 증가였습니다. 64 풀에서 4 명의 전문가가 두 번 반복되면서 단일 체인의 체인에 다른 전문가 조합이 3.8 x 10¹에 다른 전문가 조합이있었습니다. 대조적으로, 8 명의 전문가가있는 단일 MOE 층은 2.2 x 10 ℃ 조합에 불과했습니다.

부인 성명:info@kdj.com

제공된 정보는 거래 조언이 아닙니다. kdj.com은 이 기사에 제공된 정보를 기반으로 이루어진 투자에 대해 어떠한 책임도 지지 않습니다. 암호화폐는 변동성이 매우 높으므로 철저한 조사 후 신중하게 투자하는 것이 좋습니다!

2025年04月25日 에 게재된 다른 기사

더