- 
Video 
- 
Really vivid metaphors that stuck with me 
- Broadcast -- like a radio, same data sent everywhere
- Scatter -- like a mailman, carry a piece of information to all the nodes
 
- 
Actual code for using NCCL natively; follow up -- write and try this out by hand 
- 
NCCL all reduce 
- can do different methods
- estimates how long each of these methods can take based on the topology
- then actually performs
 
- 
all reduce from reducescatter + allgather [diagram in video is much better] 
- could do reduce scatter that do individual reductions on individual ranks
- all gather will take the reductions and put them on all ranks
 
- 
ring algorithm running on all gpus in parallel 
- create a ncclRing datastructure for collectives
- primitives for send and recv
- work on chunks of the send buffer
- grab some chunk of the buffer on each gpu
- then push data to the next gpu
 
- prims.send actually sends data to other gpus
- can rely on direct connections between gpus (nvlink)
- can put data on other gpus directly (infiniband, roce) -- rdma
- determines which connections to use
 
- each gpu sends a chunk of data from send buffer to next gpu
- these are offset to do it in an optimal way
- keep going down a ring saving the reduction to an appropriate location
 
- 
also can use tree 
- 
different collection operation primitives to maximize bandwidth 
- 
follow up 
- check introduction to infiniband
 
- 
can compile nccl against nccltest and run operations across as many gpus 
- get bus bandwidth of all the operations
- goes through the entire nccl initialization