![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
大型语言模型显着提高了我们对人工智能的理解,但是对这些模型进行有效扩展仍然具有挑战性。
Large language models (LLMs) have revolutionized our understanding of artificial intelligence (AI), yet scaling these models efficiently remains a critical challenge. Traditional Mixture-of-Experts (MoE) architectures are designed to activate only a subset of experts per token in order to economize on computation. However, this design leads to two main issues. Firstly, experts process tokens in complete isolation—each expert performs its task without any cross-communication with others, which may limit the model’s ability to integrate diverse perspectives during processing. Secondly, although MoE models employ a sparse activation pattern, they still require considerable memory. This is because the overall parameter count is high, even if only a few experts are actively used at any given time. These observations suggest that while MoE models are a step forward in scalability, their inherent design may limit both performance and resource efficiency.
大型语言模型(LLM)彻底改变了我们对人工智能(AI)的理解,但是对这些模型进行有效扩展仍然是一个关键的挑战。传统的Experts(MOE)架构旨在仅激活每个令牌的一部分专家,以节省计算。但是,此设计导致两个主要问题。首先,专家以完全隔离的方式处理令牌 - 每个专家无需与他人进行任何交叉通信执行任务,这可能会限制模型在处理过程中整合各种观点的能力。其次,尽管MOE模型采用了稀疏的激活模式,但它们仍然需要相当大的内存。这是因为总体参数计数很高,即使在任何给定时间仅积极使用少数专家。这些观察结果表明,虽然MOE模型是可伸缩性的一步,但它们的固有设计可能会限制性能和资源效率。
Chain-of-Experts (CoE)
专家链(COE)
Chain-of-Experts (CoE) offers a fresh perspective on MoE architectures by introducing a mechanism for sequential communication among experts. Unlike the independent processing seen in traditional MoE models, CoE allows tokens to be processed in a series of iterations within each layer. In this arrangement, the output of one expert serves as the input for the next, creating a communicative chain that enables experts to build upon one another’s work. This sequential interaction does not simply stack layers; it facilitates a more integrated approach to token processing, where each expert refines the token’s meaning based on previous outputs. The goal is to use memory more efficiently.
专家链(COE)通过引入专家之间的连续沟通机制来提供有关MOE体系结构的新视角。与传统MOE模型中看到的独立处理不同,COE允许将令牌在每一层的一系列迭代中进行处理。在这种安排中,一位专家的输出是下一个专家的输入,创建了一个沟通链,使专家能够在彼此的工作基础上建立。这种顺序相互作用不仅仅是堆叠层。它促进了一种更集成的代币处理方法,每个专家都根据先前的输出来完善令牌的含义。目标是更有效地使用内存。
Technical Details and Benefits
技术细节和好处
At the heart of the CoE method is an iterative process that redefines how experts interact. For instance, consider a configuration described as CoE-2(4/64): the model operates with two iterations per token, with four experts selected from a pool of 64 at each cycle. This contrasts with traditional MoE, which uses a single pass through a pre-selected group of experts.
COE方法的核心是一个迭代过程,它重新定义了专家的相互作用。例如,考虑一个描述为COE-2(4/64)的配置:该模型在每个令牌中以两次迭代效果运行,每个循环中从一个64个池中选择了四个专家。这与传统教育部形成鲜明对比的是,传统教育部使用了一个通过预先选择的专家组。
Another key technical element in CoE is the independent gating mechanism. In conventional MoE models, the gating function decides which experts should process a token, and these decisions are made once per token per layer. However, CoE takes this a step further by allowing each expert’s gating decision to be made independently during each iteration. This flexibility encourages a form of specialization, as an expert can adjust its processing based on the information received from earlier iterations.
COE中的另一个关键技术要素是独立的门控机制。在常规的MOE模型中,门控函数决定了哪些专家应处理令牌,并且这些决策是每层代币一次。但是,COE通过允许在每次迭代期间独立做出每个专家的门控决定来进一步迈出一步。这种灵活性鼓励了一种专业化形式,因为专家可以根据早期迭代的信息来调整其处理。
Furthermore, the use of inner residual connections in CoE enhances the model. Instead of simply adding the original token back after the entire sequence of processing (an outer residual connection), CoE integrates residual connections within each iteration. This design helps to maintain the integrity of the token’s information while allowing for incremental improvements at every step.
此外,在COE中使用内部残留连接可以增强模型。 COE不简单地将原始令牌添加回整个处理序列(外部残留连接),而是在每次迭代中整合了残留连接。这种设计有助于保持令牌信息的完整性,同时在每个步骤都可以进行逐步改进。
These technical innovations combine to create a model that aims to retain performance with fewer resources and provides a more nuanced processing pathway, which could be valuable for tasks requiring layered reasoning.
这些技术创新结合起来创建一个模型,旨在通过更少的资源保留性能并提供更细微的处理途径,这对于需要分层推理的任务可能很有价值。
Experimental Results and Insights
实验结果和见解
Preliminary experiments, such as pretraining on math-related tasks, show promise for the Chain-of-Experts method. In a configuration denoted as CoE-2(4/64), two iterations of four experts from a pool of 64 were used in each layer. Compared with traditional MoE operating under the same computational constraints, CoE-2(4/64) achieved a lower validation loss (1.12 vs. 1.20) without any increase in memory or computational cost.
初步实验,例如在与数学相关的任务上进行预处理,对专家链方法显示了希望。在表示为COE-2(4/64)的配置中,每层使用了来自64个池的四个专家的两个迭代。与在相同的计算约束下运行的传统MOE相比,COE-2(4/64)实现了较低的验证损失(1.12 vs. 1.20),而记忆或计算成本的增加。
The researchers also varied the configurations of Chain-of-Experts and compared them with traditional Mixture-of-Experts (MoE) models. For example, they tested CoE-2(4/64), CoE-1(8/64), and MoE(8) models, all operating within similar computational and memory footprints. Their findings showed that increasing the iteration count in Chain-of-Experts yielded benefits comparable to or even better than increasing the number of experts selected in a single pass. Even when the models were deployed on the same hardware and subjected to the same computational constraints, Chain-of-Experts demonstrated an advantage in terms of both performance and resource utilization.
研究人员还改变了专家链的配置,并将其与传统的专家(MOE)模型进行了比较。例如,他们测试了COE-2(4/64),COE-1(8/64)和MOE(8)模型,它们都在相似的计算和内存足迹中运行。他们的发现表明,增加专家的迭代计数所产生的收益可与增加单次通过中选择的专家数量相当甚至更好。即使将模型部署在相同的硬件上并受到相同的计算约束,专家链也证明了在性能和资源利用率方面都有优势。
In one experiment, a single layer of MoE with eight experts was compared with two layers of Chain-of-Experts, each selecting four experts. Despite having fewer experts in each layer, Chain-of-Experts achieved better performance. Moreover, when varying the experts' capacity (output dimension) while keeping the total parameters constant, Chain-of-Experts configurations showed up to an 18% reduction in memory usage while realizing similar or slightly better performance.
在一个实验中,将一层具有八个专家的MOE与两层专家进行了比较,每个专家都选择了四个专家。尽管每层专家的专家较少,但专家的链条取得了更好的性能。此外,当改变专家的容量(输出维度)的同时保持总参数持续时,Experts配置的记忆使用量降低了18%,同时实现了相似或稍好的性能。
Another key finding was the dramatic increase in the number of possible expert combinations. With two iterations of four experts from a pool of 64, there were 3.8 x 10¹⁰⁴ different expert combinations in a single layer of Chain-of-Experts. In contrast, a single layer of MoE with eight experts had only 2.2 x 10⁴² combinations
另一个关键发现是可能的专家组合数量的急剧增加。有两个来自64个池的四个专家的迭代,在单层专家链中有3.8 x10⁰⁴不同的专家组合。相比之下,一层拥有八个专家的MOE只有2.2 x10⁴²的组合
免责声明:info@kdj.com
所提供的信息并非交易建议。根据本文提供的信息进行的任何投资,kdj.com不承担任何责任。加密货币具有高波动性,强烈建议您深入研究后,谨慎投资!
如您认为本网站上使用的内容侵犯了您的版权,请立即联系我们(info@kdj.com),我们将及时删除。
-
-
-
- Echo作为可转换虚拟货币的指定业务启动
- 2025-04-25 21:25:12
- ECHO在人类金融服务局的监管框架下作为指定的可转换虚拟货币的指定业务运营。希瑟·里吉奥(Heather Riggio)。
-
-
-
-
-
- 保罗·阿特金(Paul Atkin)就职典礼后,加密货币市场弹跳
- 2025-04-25 21:15:12
- 在过去的24小时内,特朗普官方的特朗普模因硬币,特朗普官方的趋势飙升9%
-