Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
Parallelism
Parallelism
After repeatedly hearing terms like FSDP, Tensor Parallelism, Model
Parallelism and Pipeline Parallelism I wanted to write them out in my
own words.
Looking around, there's also work trying to use more heterogenous
systems.
- All Reduce: reduce data across all the ranks and write to all ranks
- Reduce: reduce data across all ranks but write to a single rank
- Broadcast: copy data across all ranks
- All Gather: gather data from all ranks into a single array available
on all ranks
- Reduce Scatter: reduce across all ranks and write chunks to each rank
Types of Parallelism
Data Parallel
- Split the data across all the trainers, but the model stays constant
- After back propagation, all reduce the gradients of the model
Model Parallel
- Split the model across trainers
Tensor Parallel
- Tensors are split across multiple GPUs, and results of calculations
must constantly be combined
Pipeline Parallel
- Split model into layers, and propagate values across the layers
(like instructions flowing thorugh a CPU pipeline)
Shard model parameters, gradients, optimizer states across gpus; saves
a lot of memory and can be very convenient.
References
— Kunal