Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Parallelism

Parallelism

After repeatedly hearing terms like FSDP, Tensor Parallelism, Model Parallelism and Pipeline Parallelism I wanted to write them out in my own words.

Looking around, there's also work trying to use more heterogenous systems.

Collectives

All Reduce: reduce data across all the ranks and write to all ranks
Reduce: reduce data across all ranks but write to a single rank
Broadcast: copy data across all ranks
All Gather: gather data from all ranks into a single array available on all ranks
Reduce Scatter: reduce across all ranks and write chunks to each rank

Types of Parallelism

Data Parallel

Split the data across all the trainers, but the model stays constant
After back propagation, all reduce the gradients of the model

Model Parallel

Split the model across trainers

Tensor Parallel

Tensors are split across multiple GPUs, and results of calculations must constantly be combined

Pipeline Parallel

Split model into layers, and propagate values across the layers (like instructions flowing thorugh a CPU pipeline)

Fully Sharded Data Parallel, FSDP

Shard model parameters, gradients, optimizer states across gpus; saves a lot of memory and can be very convenient.

References

Paradigms of Parallelism