【分布式训练中各种并行方案分别用什么通信为什么？比如DP会用到 ALL reduce】-拓冰建站

并行方式	切分对象	主要通信	为什么需要
DP	Batch	AllReduce	每张卡算出的梯度不同，需要同步梯度
TP	Weight（权重）	AllReduce / AllGather / ReduceScatter	每张卡只有部分权重或部分输出，需要恢复完整计算
PP	Layer	Send / Recv (P2P)	下一层在另一张 GPU，需要传递激活值和梯度
ZeRO-1	Optimizer State	AllReduce	参数完整，只同步梯度
ZeRO-2	Optimizer + Gradient	ReduceScatter + AllGather（或 AllReduce 的等价实现）	梯度分片存储，需要分发和聚合
ZeRO-3 / FSDP	Parameter + Gradient + Optimizer	AllGather + ReduceScatter	参数也是分片的，每层计算前需要恢复完整参数
Sequence Parallel	Sequence	AllGather + ReduceScatter	Attention 等算子需要完整 Sequence
Context Parallel	Context	AllGather / AllToAll（实现相关）	长上下文 Attention 需要跨 GPU 的 KV
Expert Parallel（MoE）	Expert	AllToAll	Token 要发送到负责该 Expert 的 GPU

【分布式训练中 各种并行方案 分别用什么通信 为什么？比如DP会用到 ALL reduce】