MoE混合专家模型

什么是混合专家模型？
混合专家模型是一种基于Transformer架构的模型，所以需要先了解Transformer架构
下图详细的展示了transformer架构
混合专家模型主要由两个关键部分组成：
* 稀疏 MoE 层: 这些层代替了传统 Transformer 模型中的前馈网络 (FFN) 层。MoE 层包含若干“专家”(例如 8 个)，每个专家本身是一个独立的神经网络。在实际应用中，这些专家通常是前馈网络 (FFN)，但它们也可以是更复杂的网络结构，甚至可以是 MoE 层本身，从而形成层级式的 MoE 结构。 * 门控网络或路由: 这个部分用于决定哪些令牌 (token) 被发送到哪个专家。例如，在下图中，“More”这个令牌可能被发送到第二个专家，而“Parameters”这个令牌被发送到第一个专家。有时，一个令牌甚至可以被发送到多个专家。令牌的路由方式是 MoE 使用中的一个关键点，因为路由器由学习的参数组成，并且与网络的其他部分一同进行预训练。 # TUTEL: ADAPTIVE MIXTURE-OF-EXPERTS AT SCALE ## abstract Based on adaptive parallelism/pipelining optimization-> Flexible All-to-All, two-dimensional hierarchical (2DH) All-to-All, fast encode/decode(contuibution)
training and inference ## introduction dynamic nature of MoE:This implies that the workload of experts is fundamentally uncertain ### three methods * adjust paralleism at runtime:large redistribution overhead and GPU memeory
* load balance loss:harm model arrcuacy * Tutel: dynamically switches the parallelism strategy at every iteration without any extra overhead of switching. ## BACKGROUND & MOTIVATION ### Sparsely-gated Mixture-of-Experts (MoE) ### Dynamic Workload of MoE f? set f to a static upper bound of capacity factor? ### Static Parallelism the best parallelism method depends on the workload and switching between different parallelism methods during runtime would incur a substantial overhead.
### Static Pipelining depending on different MoE settings and scales,the corresponding optimal pipelining strategy consists of various All-to-All algorithms (Linear or 2DH3) and pipelining degrees. ## ADAPTIVE MOE WITH TUTEL ### Adaptive Parallelism Switching DP
EP+DP+MP ### DP a single all-reduce naturally consists of a reduce-scatter and an all-gather. ### DP+EP+MP ## IMPLEMENTATION ### Features * Dynamic Top-ANY MoE Gating * Dynamic Capacity Factor

自我理解和总结：通过单一的适用于所有可能的最优策略数据布局，使得切换并行策略时不需要对数据进行迁移。通过数据划分，实现all2all通信和专家计算的重叠以此实现流水线。将各种参数的性能存在字典里，训练时查字典实现自适应性。

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Abstract

DeepSpeed-MoE:novel MoE architecture designs and model compression techniques,a highly optimized inference system.
## Introduction set of challenges: * Limited Scope The scope of MoE based models in the NLP area is primarily limited to encoder-decoder models and sequence-to-sequence tasks * Massive Memory Requirements: need significantly more number of parameters * Limited Inference Performance:On one hand, the larger parameter size requires more GPUs to fit. On the other hand, as inference is often memory bandwidth bound

three corresponding solutions: * We expand the scope of MoE based models * We improve parameter efficiency of MoE based models:PR-MoE MoS
* We develop DeepSpeed-MoE inference system

PR-MoE and MoS: Reducing the Model Size and Improving Parameter Efficiency

现象1：Deeper layers benefit more from large number of experts.
现象2：We find out that the generalization performance of these two (aka Top2-MoE and Residual-MoE) is on-par with each other.

->PR-MoE ### MoS 什么是知识蒸馏？ ## DeepSpeed-MoE Inference the MoE inference performance depends on two main factors: the overall model size and the overall achievable memory bandwidth. ### Design of DeepSpeed-MoE Inference System #### Expert, Tensor and Data parallelism #### Hierarchical All-to-all #### Parallelism Coordinated Communication #### Kernel Optimizations ？？

理解和总结：模型的优化有PR-MoE和MoS，推理的优化：分布式切分，通信优化，kernel优化。 # GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding ## Abstract GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding ## Introduction While the final model quality was found to have a power-law relationship with the amount of data, compute and model size [18, 3], the significant quality gains brought by larger models also come with various practical challenges. ### Practical Challenges for Scaling * Architecture-specific model parallelism support： users typically need to invest a lot of engineering work, for example, migrating the model code to special frameworks。 * Super-linear scaling of computation cost vs model size * Infrastructure scalability for giant model representation：Such increase in the graph size would result in an infeasible amount of graph building and compilation time for massive-scale models.

Design Principles for Efficient Training at Scale

Sub-linear Scaling：Scaling capacity of RNN-based machine translation and language models by adding Position-wise Sparsely Gated Mixture-of-Experts (MoE) layers [ 16] allowed to achieve state-of-the-art results with sublinear computation cost.
Second, the model description should be separated from the partitioning implementation and optimization.
he system infrastructure, including the computation representation and compilation, must scale with thousands of devices for parallel execution. ## Model
### Sparse scaling of the Transformer architecture we sparsely scale Transformer withconditional computation by replacing every other feed-forward layer with a Position-wise Mixture of Experts (MoE) layer [ 16] with a variant of top-2 gating in both the encoder and the decoder ### Position-wise Mixture-of-Experts Layer xs is the input token to the MoE layer, wi and wo being the input and output projection matrices for the feed-forward layer (an expert).
gating function must satisfy two goals:
Balanced load: better design of the gating function would distribute processing burden more evenly across all experts.
Efficiency at scale:we need an efficient parallel implementation of the gating function to leverage many devices.

echanisms in the gating function GATE(·) to meet the above requirements (details illustrated in Algorithm 1): * Expert capacity: To ensure the load is balanced, we enforce that the number of tokens processed by one expert is below some uniform threshold * Local group dispatching GATE(·):in this way, we can ensure that expert capacity is still enforced and the overall load is balanced. * Auxiliary loss:如果把token都分给一个人，loss就很高，分的越均匀（最好是彻底均分），loss越小 * Random routing: if the weight for the 2nd expert is very small, we can simply ignore the 2nd expert to conserve the overall expert capacity.
具体过程：

总结和理解：这篇论文把MoE结构放入Transformer模型中，通过限制专家容量，每个token分配的专家数量和一个额外的loss实现门控函数的负载均衡。

doovvv

paper report

MoE混合专家模型

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Abstract

PR-MoE and MoS: Reducing the Model Size and Improving Parameter Efficiency

Design Principles for Efficient Training at Scale