MoE混合专家模型
什么是混合专家模型?
混合专家模型是一种基于Transformer架构的模型,所以需要先了解Transformer架构
下图详细的展示了transformer架构
混合专家模型主要由两个关键部分组成:
* 稀疏 MoE 层: 这些层代替了传统 Transformer 模型中的前馈网络 (FFN)
层。MoE 层包含若干“专家”(例如 8
个),每个专家本身是一个独立的神经网络。在实际应用中,这些专家通常是前馈网络
(FFN),但它们也可以是更复杂的网络结构,甚至可以是 MoE
层本身,从而形成层级式的 MoE 结构。 * 门控网络或路由:
这个部分用于决定哪些令牌 (token)
被发送到哪个专家。例如,在下图中,“More”这个令牌可能被发送到第二个专家,而“Parameters”这个令牌被发送到第一个专家。有时,一个令牌甚至可以被发送到多个专家。令牌的路由方式是
MoE
使用中的一个关键点,因为路由器由学习的参数组成,并且与网络的其他部分一同进行预训练。
# TUTEL: ADAPTIVE
MIXTURE-OF-EXPERTS AT SCALE ## abstract Based on adaptive
parallelism/pipelining optimization-> Flexible All-to-All,
two-dimensional hierarchical (2DH) All-to-All, fast
encode/decode(contuibution)
training and inference ## introduction dynamic nature of MoE:This
implies that the workload of experts is fundamentally uncertain ###
three methods * adjust paralleism at runtime:large redistribution
overhead and GPU memeory
* load balance loss:harm model arrcuacy * Tutel: dynamically switches
the parallelism strategy at every iteration without any extra overhead
of switching. ## BACKGROUND & MOTIVATION ### Sparsely-gated
Mixture-of-Experts (MoE) ### Dynamic Workload of
MoE f? set f to a static upper bound of capacity factor? ### Static
Parallelism the best parallelism method depends on the workload and
switching between different parallelism methods during runtime would
incur a substantial overhead.
### Static Pipelining
depending on different MoE settings and scales,the corresponding optimal
pipelining strategy consists of various All-to-All algorithms (Linear or
2DH3) and pipelining degrees. ## ADAPTIVE MOE WITH TUTEL ### Adaptive
Parallelism Switching DP
EP+DP+MP ### DP a single all-reduce naturally consists of a
reduce-scatter and an all-gather. ### DP+EP+MP ## IMPLEMENTATION ###
Features * Dynamic Top-ANY MoE Gating * Dynamic Capacity Factor
自我理解和总结:通过单一的适用于所有可能的最优策略数据布局,使得切换并行策略时不需要对数据进行迁移。通过数据划分,实现all2all通信和专家计算的重叠以此实现流水线。将各种参数的性能存在字典里,训练时查字典实现自适应性。
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
Abstract
DeepSpeed-MoE:novel MoE architecture designs and model compression
techniques,a highly optimized inference system.
## Introduction set of challenges: * Limited Scope The scope of MoE
based models in the NLP area is primarily limited to encoder-decoder
models and sequence-to-sequence tasks * Massive Memory Requirements:
need significantly more number of parameters * Limited Inference
Performance:On one hand, the larger parameter size requires more GPUs to
fit. On the other hand, as inference is often memory bandwidth bound
three corresponding solutions: * We expand the scope of MoE based
models * We improve parameter efficiency of MoE based models:PR-MoE
MoS
* We develop DeepSpeed-MoE inference system
PR-MoE and MoS: Reducing the Model Size and Improving Parameter Efficiency
- 现象1:Deeper layers benefit more from large number of experts.
- 现象2:We find out that the generalization performance of these two
(aka Top2-MoE and Residual-MoE) is on-par with each other.
->PR-MoE ### MoS 什么是知识蒸馏? ## DeepSpeed-MoE Inference the MoE inference performance depends on two main factors: the overall model size and the overall achievable memory bandwidth. ### Design of DeepSpeed-MoE Inference System #### Expert, Tensor and Data parallelism #### Hierarchical All-to-all #### Parallelism Coordinated Communication #### Kernel Optimizations ??
理解和总结:模型的优化有PR-MoE和MoS,推理的优化:分布式切分,通信优化,kernel优化。 # GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding ## Abstract GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding ## Introduction While the final model quality was found to have a power-law relationship with the amount of data, compute and model size [18, 3], the significant quality gains brought by larger models also come with various practical challenges. ### Practical Challenges for Scaling * Architecture-specific model parallelism support: users typically need to invest a lot of engineering work, for example, migrating the model code to special frameworks。 * Super-linear scaling of computation cost vs model size * Infrastructure scalability for giant model representation:Such increase in the graph size would result in an infeasible amount of graph building and compilation time for massive-scale models.
Design Principles for Efficient Training at Scale
- Sub-linear Scaling:Scaling capacity of RNN-based machine translation and language models by adding Position-wise Sparsely Gated Mixture-of-Experts (MoE) layers [ 16] allowed to achieve state-of-the-art results with sublinear computation cost.
- Second, the model description should be separated from the partitioning implementation and optimization.
- he system infrastructure, including the computation representation
and compilation, must scale with thousands of devices for parallel
execution. ## Model
### Sparse scaling of the Transformer architecture we sparsely scale Transformer withconditional computation by replacing every other feed-forward layer with a Position-wise Mixture of Experts (MoE) layer [ 16] with a variant of top-2 gating in both the encoder and the decoder ### Position-wise Mixture-of-Experts Layer xs is the input token to the MoE layer, wi and wo being the input and output projection matrices for the feed-forward layer (an expert).
gating function must satisfy two goals:
- Balanced load: better design of the gating function would distribute processing burden more evenly across all experts.
- Efficiency at scale:we need an efficient parallel implementation of the gating function to leverage many devices.
echanisms in the gating function GATE(·) to meet the above
requirements (details illustrated in Algorithm 1): * Expert capacity: To
ensure the load is balanced, we enforce that the number of tokens
processed by one expert is below some uniform threshold * Local group
dispatching GATE(·):in this way, we can ensure that expert capacity is
still enforced and the overall load is balanced. * Auxiliary
loss:如果把token都分给一个人,loss就很高,分的越均匀(最好是彻底均分),loss越小
* Random routing: if the weight for the 2nd expert is very small, we can
simply ignore the 2nd expert to conserve the overall expert
capacity.
具体过程:
总结和理解:这篇论文把MoE结构放入Transformer模型中,通过限制专家容量,每个token分配的专家数量和一个额外的loss实现门控函数的负载均衡。