- 
2 gpu comms libraries 
- nccl
- address simple patterns of AI Training & now inference
- different pattern support
- data parallelism: all reduce, gather, reduce-scatter
- tensor: all reduce, all gather, reduce-scatter
- pipeline: send/recv
- expert: all to all
 
 
- nvshmem
- different model
- partitioned global address space
- indexing using pointers into all the gpu memory
- partitions imply that it's not a shared memory domain -- different address spaces combined together
 
- historical api: memcpy with a gpu index
- stream apis
- only model that supports device initiated comms
- can also do device initiated collectives -- have to do a cooperative group launch
- can have everything inside the cuda kernel without having to return to the host
 
 
- 
nvshmem python based apis 
- generally were getting wrapped anyways
- custom comm kernels, fused compute-comm kernel, zero-sm and low latency collectives, one side pt to pt comms
 
- 
history of hpc 
- mpi 1993
- portable
- send / recv (could even be done with posix sockets)
- designed when cpu were much faster than networks
- 2 sided comms -- combined sync and data movement, ordering in the program
 
- 
shmem was created for the T3D 
- distribute load/store network
- one sided communication
- standardized as openshmem
 
- 
2 sided comms 
- sender knows input buffer details
- receiver knows output buffer details
- natural sync flow where the input buffer gets written to the output buffer
 
- 
if sync doesn't line up -- can cause sends to block 
- 
comm patterns are persistent, so we should amortize away setup cost 
- 
don't want a gpu idling while the host is waiting for something to happen 
- 
data transmission should go to specialized hardware 
- 
nccl vs mpi 
- 
nccl is at 10 years 
- 
forward looking: nvshmem 
- 
mpi 
- having collectives as a first class object is important
- 8byte vs 1gb all reduce is very different in implementation
- abstract it away from the user entirely
- user shouldn't decide between tree, ring or multi ring
 
- send/recv is a good way to write many algorithms
- concept and host initiated version is a safe/general purpose way to write code
- can look at the code and say what's going to be fast/slow
- single memory space can be a bad idea -- can accidentally do reads across the network
 
- design of collectives
- consistency
- have never broken backwards compat
- standard ABI
- make the usability high
 
- has too many goodies for gpu execution
- great for structured programs
 
- 
mpi vs nccl 
- allows underflow: recv be much larger than the send
- only the sender can do protocol selection
- mpi datatypes allow anything
- supports tags, wildcards, multiple ranks per gpu
 
- 
nccl does allow multiple rakns per process 
- 
matrix transpose 
- do things that are invertible to be able to easily test
 
- 
dgx-h100: loose upper bounds 
- 24 TB/s memory badwidth
- 3TB/s nvlink bandwidth
 
- 
benchmrks -- parallel research kernels / high performance computing 
- 
githu.com/ParRes/Kernels