Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Lecture 17 Nccl

Video
Really vivid metaphors that stuck with me
- Broadcast -- like a radio, same data sent everywhere
- Scatter -- like a mailman, carry a piece of information to all the nodes
Actual code for using NCCL natively; follow up -- write and try this out by hand
NCCL all reduce
- can do different methods
- estimates how long each of these methods can take based on the topology
- then actually performs
all reduce from reducescatter + allgather [diagram in video is much better]
- could do reduce scatter that do individual reductions on individual ranks
- all gather will take the reductions and put them on all ranks
ring algorithm running on all gpus in parallel
- create a ncclRing datastructure for collectives
- primitives for send and recv
- work on chunks of the send buffer
  - grab some chunk of the buffer on each gpu
  - then push data to the next gpu
- prims.send actually sends data to other gpus
  - can rely on direct connections between gpus (nvlink)
  - can put data on other gpus directly (infiniband, roce) -- rdma
  - determines which connections to use
- each gpu sends a chunk of data from send buffer to next gpu
- these are offset to do it in an optimal way
- keep going down a ring saving the reduction to an appropriate location
also can use tree
different collection operation primitives to maximize bandwidth
follow up
- check introduction to infiniband
can compile nccl against nccltest and run operations across as many gpus
- get bus bandwidth of all the operations
- goes through the entire nccl initialization

— Kunal