-
Video
-
Really vivid metaphors that stuck with me
- Broadcast -- like a radio, same data sent everywhere
- Scatter -- like a mailman, carry a piece of information to all the nodes
-
Actual code for using NCCL natively; follow up -- write and try this out by hand
-
NCCL all reduce
- can do different methods
- estimates how long each of these methods can take based on the topology
- then actually performs
-
all reduce from reducescatter + allgather [diagram in video is much better]
- could do reduce scatter that do individual reductions on individual ranks
- all gather will take the reductions and put them on all ranks
-
ring algorithm running on all gpus in parallel
- create a ncclRing datastructure for collectives
- primitives for send and recv
- work on chunks of the send buffer
- grab some chunk of the buffer on each gpu
- then push data to the next gpu
- prims.send actually sends data to other gpus
- can rely on direct connections between gpus (nvlink)
- can put data on other gpus directly (infiniband, roce) -- rdma
- determines which connections to use
- each gpu sends a chunk of data from send buffer to next gpu
- these are offset to do it in an optimal way
- keep going down a ring saving the reduction to an appropriate location
-
also can use tree
-
different collection operation primitives to maximize bandwidth
-
follow up
- check introduction to infiniband
-
can compile nccl against nccltest and run operations across as many gpus
- get bus bandwidth of all the operations
- goes through the entire nccl initialization