Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Multi Gpu Programming: Nccl, Nvshmem

2 gpu comms libraries
- nccl
  - address simple patterns of AI Training & now inference
  - different pattern support
    - data parallelism: all reduce, gather, reduce-scatter
    - tensor: all reduce, all gather, reduce-scatter
    - pipeline: send/recv
    - expert: all to all
- nvshmem
  - different model
  - partitioned global address space
    - indexing using pointers into all the gpu memory
    - partitions imply that it's not a shared memory domain -- different address spaces combined together
  - historical api: memcpy with a gpu index
  - stream apis
  - only model that supports device initiated comms
  - can also do device initiated collectives -- have to do a cooperative group launch
  - can have everything inside the cuda kernel without having to return to the host
nvshmem python based apis
- generally were getting wrapped anyways
- custom comm kernels, fused compute-comm kernel, zero-sm and low latency collectives, one side pt to pt comms
history of hpc
- mpi 1993
- portable
- send / recv (could even be done with posix sockets)
- designed when cpu were much faster than networks
- 2 sided comms -- combined sync and data movement, ordering in the program
shmem was created for the T3D
- distribute load/store network
- one sided communication
- standardized as openshmem
2 sided comms
- sender knows input buffer details
- receiver knows output buffer details
- natural sync flow where the input buffer gets written to the output buffer
if sync doesn't line up -- can cause sends to block
comm patterns are persistent, so we should amortize away setup cost
don't want a gpu idling while the host is waiting for something to happen
data transmission should go to specialized hardware
nccl vs mpi
nccl is at 10 years
forward looking: nvshmem
mpi
- having collectives as a first class object is important
  - 8byte vs 1gb all reduce is very different in implementation
  - abstract it away from the user entirely
  - user shouldn't decide between tree, ring or multi ring
- send/recv is a good way to write many algorithms
  - concept and host initiated version is a safe/general purpose way to write code
  - can look at the code and say what's going to be fast/slow
  - single memory space can be a bad idea -- can accidentally do reads across the network
- design of collectives
- consistency
  - have never broken backwards compat
  - standard ABI
  - make the usability high
- has too many goodies for gpu execution
- great for structured programs
mpi vs nccl
- allows underflow: recv be much larger than the send
- only the sender can do protocol selection
- mpi datatypes allow anything
- supports tags, wildcards, multiple ranks per gpu
nccl does allow multiple rakns per process
matrix transpose
- do things that are invertible to be able to easily test
dgx-h100: loose upper bounds
- 24 TB/s memory badwidth
- 3TB/s nvlink bandwidth
benchmrks -- parallel research kernels / high performance computing
githu.com/ParRes/Kernels

steps

All to all
transpose local blocks
all to all

implementations

spread data, local transform, recover
or do user managed flow control to overlap comms/compute

pairwise nccl send recv

nccl send recv is designed for collective usage patterns
and they throttle themselves, and may not saturate nvl
multiple streams should bypass this logic
cudecomp library

shmem avoids synchronizaiton at all costs

need to try and implement these things myself

nvshmem pointer acess only works within an nvlink domain

nvlink domain can also be done with cuda ipc
combining is hard
shmem -- everyone takes the same size slab and contributes it
nvshmem put/get -- don't have to think
do a put
try an nshmem load, get null; then try ncclput/get
nccl has logic in send/recv to avoid spamming the network
moe
- perplexity blog on MoE comms
- post
nccl
- 2.27 -- symmetric memory
- 2.28 -- coming august
follow up
- hoti.org/tutorials-nccl-nvshmem.html
- mark harris gpu transpose on nvidia blog
- lots of interesting experiments to try

— Kunal