-
2 gpu comms libraries
- nccl
- address simple patterns of AI Training & now inference
- different pattern support
- data parallelism: all reduce, gather, reduce-scatter
- tensor: all reduce, all gather, reduce-scatter
- pipeline: send/recv
- expert: all to all
- nvshmem
- different model
- partitioned global address space
- indexing using pointers into all the gpu memory
- partitions imply that it's not a shared memory domain -- different address spaces combined together
- historical api: memcpy with a gpu index
- stream apis
- only model that supports device initiated comms
- can also do device initiated collectives -- have to do a cooperative group launch
- can have everything inside the cuda kernel without having to return to the host
-
nvshmem python based apis
- generally were getting wrapped anyways
- custom comm kernels, fused compute-comm kernel, zero-sm and low latency collectives, one side pt to pt comms
-
history of hpc
- mpi 1993
- portable
- send / recv (could even be done with posix sockets)
- designed when cpu were much faster than networks
- 2 sided comms -- combined sync and data movement, ordering in the program
-
shmem was created for the T3D
- distribute load/store network
- one sided communication
- standardized as openshmem
-
2 sided comms
- sender knows input buffer details
- receiver knows output buffer details
- natural sync flow where the input buffer gets written to the output buffer
-
if sync doesn't line up -- can cause sends to block
-
comm patterns are persistent, so we should amortize away setup cost
-
don't want a gpu idling while the host is waiting for something to happen
-
data transmission should go to specialized hardware
-
nccl vs mpi
-
nccl is at 10 years
-
forward looking: nvshmem
-
mpi
- having collectives as a first class object is important
- 8byte vs 1gb all reduce is very different in implementation
- abstract it away from the user entirely
- user shouldn't decide between tree, ring or multi ring
- send/recv is a good way to write many algorithms
- concept and host initiated version is a safe/general purpose way to write code
- can look at the code and say what's going to be fast/slow
- single memory space can be a bad idea -- can accidentally do reads across the network
- design of collectives
- consistency
- have never broken backwards compat
- standard ABI
- make the usability high
- has too many goodies for gpu execution
- great for structured programs
-
mpi vs nccl
- allows underflow: recv be much larger than the send
- only the sender can do protocol selection
- mpi datatypes allow anything
- supports tags, wildcards, multiple ranks per gpu
-
nccl does allow multiple rakns per process
-
matrix transpose
- do things that are invertible to be able to easily test
-
dgx-h100: loose upper bounds
- 24 TB/s memory badwidth
- 3TB/s nvlink bandwidth
-
benchmrks -- parallel research kernels / high performance computing
-
githu.com/ParRes/Kernels