Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
Programming Massively Parallel Processors
- tags
- ['AI', 'GPU']
- author
- W Hwu
- The most recommended book on cuda and parallel programming
Ch1
- Single cores are hitting limits on how fast they can go
- Instead we run in parallel
- Increasing speed of single cores costs much more
Ch2
Because I always get lost about this:
- when host code calls a kernel, the runtime launches a
grid
of threads
- the grid is organized into a 2 level hierarchy
thread blocks
= blocks
of the same size
thread
within each block
- threads per block is defined by host code
blockDim
== number of threads in a block, up to 3d
- preferably keep this as a multiple of 32
gridDim
== dimensions of grid
threadIdx
== allow thread identification
blockIdx
== allow block identification in grid
__global__
can be called from host or device
__host__
is the default, called from host
__device__
can be called from the device to device
- can use both to generate multiple versions of the code
- cuda c code -> nvcc -> (host code -> host compiler/linker + device code / PTX -> device JIT)
Ch3 Multidimensional grids and data
<<<dim3 number of blocks = griddim, dim3 threads in a block = blockdim>>>
dim3
is a special type for this, eg dim3 dimGrid(32, 1, 1)
- sugar to just use arithmetic for a single dimension using constructors
- blocks are limited to 1024 threads
row-major
all elements in a row are adjacent, index by row * width + col
col-major
all elements in a column are adjacent, index by column * height + row
blas
== Basic Linear Algebra Subprograms
- level 1 = vector operations like
y = ax + y
- level 2 = vector matrix operations,
y = aAx + by
, A is a Matrix
- level 3 = matrix matricx,
C = aAB + bC
, A,B,C are matrices
- Tiling is to visualize is pretty cool
N xxVxxxxxx
xxVxxxxxx
xxVxxxxxx
xxVxxxxxx
xxVxxxxxx
xxVxxxxxx
xxVxxxxxx
.
M yyyyyyy P xxxxxxxxx
yyyyyyy xxxxxxxxx
yyyyyyy xxxxxxxxx
>>>>>>>.....xx#xxxxxx
yyyyyyy xxxxxxxxx
- can be implemented elementwise in cuda
Ch4 Compute arch & scheduling
- Streaming multiprocessors
- has processing units called streaming processors == cuda cores
- share control logic & memory
- SMs have on chip memory
- as well as separate global memory / DRAM
- Threads are assigned to SMs on a block by block basis / all threads on a block ar in the same SM
__syncthreads()
barrier synchronization function -- till all the threads reach that point
- must be executed by all threads in a block, otherwise deadlock
- threads across blocks cannot perform barrier synchronization
- warps are 32 thread units for scheduling in SMs
- partitioned based on threadidx
- single instruction, multiple thread
- For multidimensional, dimensions are projected into a linearized row major layout
- SM executes all threads in a warp following single-instruction, multiple data model
- one instruction is fetched and executed for all threads in the warp
SIMD
== Single-Instruction Multiple-Data
- this shares the cost of the control hardware across many units, maximizing arithmetic throughput
- control divergence will make the hardware take multiple passes
- volta onwards can execute independently, using independent thread scheduling
- as a consequence, all threads can have different timings
__syncwarp()
is another barrier to make sure the full warp completes execution first
- (having access to this feels like a boundary violation)
- oversubscribe SMs with more warps that they can help so that they can execute while waiting for memory to load
- known as latency tolerance
- GPUs have zero-overhead scheduling to swap out warps without adding idle cycles by having state for all warps saved in hardware
- occupany is the ratio of warps assigned to SM to the maximum number it supports
- resources for an SM are: registers, shared memory, thread block slots, thread solts
- dynamically partitioned across threads for execution
- examples of not hitting max occupancy
- too many blocks (limit to blocks that can be scheduled), even though we aren't using all threads
- if number of threads per block is not divsible by block size
- A100 can allow 32 registers per thread
- shows up as a performance cliff
- cuda occupancy calc in nsight
- accessing available gpu count
int devCount;
cudaGetDeviceCount(&devCount);
- most new pcs have multiple devices
- get details with
cudaGetDeviceProperties
, filling a cudaDeviceProp
- fields
maxThreadsPerBlock
multiProcessorCount
== number of SMs
clockRate
== clock frequency
maxThreadDims
== max threads per dimension
maxGridSize
== max blocks per grid dimension
regsPerBlock
== registers per SM
warpSize
- Running on my laptop
count = 1
Device 0:
prop.name = NVIDIA GeForce GTX 1650 Ti
prop.totalGlobalMem = 3899326464
prop.sharedMemPerBlock = 49152
prop.regsPerBlock = 65536
prop.clockRate = 1485000
prop.warpSize = 32
prop.maxThreadsPerBlock = 1024
prop.multiProcessorCount = 16
prop.integrated = 0
prop.major = 7
prop.minor = 5
prop.maxThreadsDim[j] = 1024
prop.maxThreadsDim[j] = 1024
prop.maxThreadsDim[j] = 64
prop.maxGridSize[j] = 2147483647
prop.maxGridSize[j] = 65535
prop.maxGridSize[j] = 65535
Ch5 Memory Architecture & Data Locality
- memory introduces additional constraints, latency introduced by fetching data
- compute to global memory access ratio == FLOPs performed for each byte access from global memory
- also known as arithmetic intensity, computational intensity
- eg.
- A100 has peak global memory bandwidth of 1555 GB/s and 19,500 GFLOPs single precision output
- Tensor cores single precision floating point has 156,000 GFLOPs
- memory bound programs limited by memory bandwidth
- Roofline Model
- assess performance achieved by an application relative to limits of hardware
- plot computational throughput in GFLOPs to computational intensity
Computational
Throughput
^
|
| Peak bandwidth * FLOP/B
| x
| x
| x Peak throughput (GFLOPs)
| x........................
| x^
| x ^
| x ^
| x ^
| x ^
|x ^
x---------------------------------------------------->
Computational intensity
- types of memory for device
- per thread registers
- per thread local memory
- per block shared memory
- per grid global memory
- per grid constant memory
- host can transfer data to & from per grid global/constant memory
- fewer instructions to use registers
- energy consumed for reading from registers is an order of magnitude lower than global memory
- shared memory has longer latency and lower bandwidth
- I wish this was much more concrete with absolute time values -- follow up
- qualifiers
| declaration | memory | scope | lifetime |
| automatic variables except arrays | register | thread | grid |
| automatic arrays | local | thread | grid |
|
__device__ __shared__ int
| shared | block | grid |
| __device__ int
| global | grid | application |
| __device__ __constant__ int
| constant | grid | application |
- constants must be declared outside functions
- can be accessed fast and parallel
- total size of constant memory is limited to 65,536
- tiling to reduce memory traffic
- partition data into
tile
subsets that can be independently operated
- reduction in memory traffic is proportional to tile size
- (reducing re-reading the same elements)
- not sure about the tiling implementation: I think the results need to be accumulated, (+= in the code, not =)
- if shared memory is exhausted we'll limit occupancy again
- can dynamically allocate size in data structures by adding another var
- this allows the kernel to adjust shared memory usage at runtime
- but apparently only one such array
// call
kernel<<<dimGrid, dimBlock, *param*>>>
// inside kernel
extern __shared__ Mds_Nds[];
- generally need to do bottleneck analysis
- memory coalecsing
- used with tiling to use memory bandwidth
- dram == small capacitors; reads take 10s of ns
- because of small capacitance that needs to be read
- clock cycle time is sub-ns
- so we parallelize dram reads
- sensors in dram detect bits in consecutive locations
- memory is read in bursts
- ideally all threads read consecutive memory
- sometimes the data is not going to be accessed favourably
- can change how threads map to data
- or change how data is laid out
- or transfer data from global -> shared with bursts, and access shared badly
- hiding latency
- drams have banks and channels
- each channel is a memory controller with a bus that connects a set of dram banks to the processor
- typically 1-8 channels per processor, with a large number of banks per channel
- data transfer bandwidth of a bus is defined by width and clock frequency
- double data rate ddr buses perform 2 transfers per cycle (rising/falling edge)
- eg. 64bit DDR bus at 1ghz = 8B * 2 * 1GHz = 16GB/s
- modern cpus: at least 32GB/s, gpus: 256GB/s
- each bank has an array of dram cells, sensing amplifiers for access, and interface to deliver bursts to the bus
- data transfer timing
- latency for decoder to enable cells & cells to share stored chanrge (much longer)
- latency for data transfer through the bus (shorter)
- given this, single bus significantly underuses channel; can overlap multiple bus requests instead for latency for decoder
- for a ratio R for cell array access latency to data transfer time
- need R+1 banks to utilize
- having more banks also reduces the odds of bank conflict
- when multiple accesses go to the same bank
- size of each cell array is chosen to achieve reasonable latency & manufacturability
- maximizing occupancy means having enough threads making requests in parallel
- interleaved data distribution
- spreads elements across banks across channels
- makes small arrays spread out
- thread coarsening
- doing too fine granularity adds overhead for parallelism
- eg redundant data loading
- only makes sense when eliminating redundant work
- avoid accidentally reducing occupancy
- checklist
- maximize occupancy
- hides pipeline latency
- more parallel memory access
- apply tuning
- enable coalesced global memory access
- fewer pipeline stalls
- less global memory traffic, better burst util
- rearrange data, threads to data, or move to shared first
- minimize contol divergence
- higher simd efficiency
- rearrange threads / data layout
- tiling reused data
- fewer stalls
- less global traffic
- place shared data in a block
- privatization
- fewer stalls for atomic updates
- less contention
- partial updates to a private copy before updating global copy
- thread coarsening
- less redundant work
- less redundant global memory traffic
- reduce parallelism
- identifying bottlenecks: use a profiler per hardware type
— Kunal