Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Programming Massively Parallel Processors

tags: ['AI', 'GPU']
author: W Hwu

The most recommended book on cuda and parallel programming

Ch1

Single cores are hitting limits on how fast they can go
Instead we run in parallel
Increasing speed of single cores costs much more

Ch2

Because I always get lost about this:

when host code calls a kernel, the runtime launches a grid of threads
the grid is organized into a 2 level hierarchy
- thread blocks = blocks of the same size
- thread within each block
threads per block is defined by host code
blockDim == number of threads in a block, up to 3d
- preferably keep this as a multiple of 32
gridDim == dimensions of grid
threadIdx == allow thread identification
blockIdx == allow block identification in grid
__global__ can be called from host or device
__host__ is the default, called from host
__device__ can be called from the device to device
can use both to generate multiple versions of the code
cuda c code -> nvcc -> (host code -> host compiler/linker + device code / PTX -> device JIT)

Ch3 Multidimensional grids and data

<<<dim3 number of blocks = griddim, dim3 threads in a block = blockdim>>>
dim3 is a special type for this, eg dim3 dimGrid(32, 1, 1)
sugar to just use arithmetic for a single dimension using constructors
blocks are limited to 1024 threads
row-major all elements in a row are adjacent, index by row * width + col
col-major all elements in a column are adjacent, index by column * height + row
blas == Basic Linear Algebra Subprograms
- level 1 = vector operations like y = ax + y
- level 2 = vector matrix operations, y = aAx + by, A is a Matrix
- level 3 = matrix matricx, C = aAB + bC, A,B,C are matrices
Tiling is to visualize is pretty cool

			N xxVxxxxxx
			  xxVxxxxxx
			  xxVxxxxxx
			  xxVxxxxxx
			  xxVxxxxxx
			  xxVxxxxxx
			  xxVxxxxxx
				.
M yyyyyyy   P xxxxxxxxx
  yyyyyyy     xxxxxxxxx
  yyyyyyy     xxxxxxxxx
  >>>>>>>.....xx#xxxxxx
  yyyyyyy     xxxxxxxxx

can be implemented elementwise in cuda

Ch4 Compute arch & scheduling

Streaming multiprocessors
- has processing units called streaming processors == cuda cores
- share control logic & memory
SMs have on chip memory
as well as separate global memory / DRAM
Threads are assigned to SMs on a block by block basis / all threads on a block ar in the same SM
__syncthreads() barrier synchronization function -- till all the threads reach that point
- must be executed by all threads in a block, otherwise deadlock
threads across blocks cannot perform barrier synchronization
warps are 32 thread units for scheduling in SMs
- partitioned based on threadidx
- single instruction, multiple thread
For multidimensional, dimensions are projected into a linearized row major layout
SM executes all threads in a warp following single-instruction, multiple data model
- one instruction is fetched and executed for all threads in the warp
SIMD == Single-Instruction Multiple-Data
this shares the cost of the control hardware across many units, maximizing arithmetic throughput
control divergence will make the hardware take multiple passes
volta onwards can execute independently, using independent thread scheduling
as a consequence, all threads can have different timings
__syncwarp() is another barrier to make sure the full warp completes execution first
- (having access to this feels like a boundary violation)
oversubscribe SMs with more warps that they can help so that they can execute while waiting for memory to load
- known as latency tolerance
GPUs have zero-overhead scheduling to swap out warps without adding idle cycles by having state for all warps saved in hardware
occupany is the ratio of warps assigned to SM to the maximum number it supports
- resources for an SM are: registers, shared memory, thread block slots, thread solts
- dynamically partitioned across threads for execution
examples of not hitting max occupancy
- too many blocks (limit to blocks that can be scheduled), even though we aren't using all threads
- if number of threads per block is not divsible by block size
- A100 can allow 32 registers per thread
- shows up as a performance cliff
- cuda occupancy calc in nsight
accessing available gpu count

int devCount;
cudaGetDeviceCount(&devCount);

most new pcs have multiple devices
get details with cudaGetDeviceProperties, filling a cudaDeviceProp
- fields
  - maxThreadsPerBlock
  - multiProcessorCount == number of SMs
  - clockRate == clock frequency
  - maxThreadDims == max threads per dimension
  - maxGridSize == max blocks per grid dimension
  - regsPerBlock == registers per SM
  - warpSize
Running on my laptop

count = 1

Device 0:
prop.name = NVIDIA GeForce GTX 1650 Ti
prop.totalGlobalMem = 3899326464
prop.sharedMemPerBlock = 49152
prop.regsPerBlock = 65536
prop.clockRate = 1485000
prop.warpSize = 32
prop.maxThreadsPerBlock = 1024
prop.multiProcessorCount = 16
prop.integrated = 0
prop.major = 7
prop.minor = 5
prop.maxThreadsDim[j] = 1024
prop.maxThreadsDim[j] = 1024
prop.maxThreadsDim[j] = 64
prop.maxGridSize[j] = 2147483647
prop.maxGridSize[j] = 65535
prop.maxGridSize[j] = 65535

Ch5 Memory Architecture & Data Locality

memory introduces additional constraints, latency introduced by fetching data
compute to global memory access ratio == FLOPs performed for each byte access from global memory
also known as arithmetic intensity, computational intensity
eg.
- A100 has peak global memory bandwidth of 1555 GB/s and 19,500 GFLOPs single precision output
- Tensor cores single precision floating point has 156,000 GFLOPs
memory bound programs limited by memory bandwidth
Roofline Model
- assess performance achieved by an application relative to limits of hardware
- plot computational throughput in GFLOPs to computational intensity

Computational
Throughput
 ^
 |
 |   Peak bandwidth * FLOP/B
 |         x
 |        x
 |       x              Peak throughput (GFLOPs)
 |      x........................
 |     x^
 |    x ^
 |   x  ^
 |  x   ^
 | x    ^
 |x     ^
 x---------------------------------------------------->
   Computational intensity

types of memory for device
- per thread registers
- per thread local memory
- per block shared memory
- per grid global memory
- per grid constant memory
host can transfer data to & from per grid global/constant memory
fewer instructions to use registers
energy consumed for reading from registers is an order of magnitude lower than global memory
shared memory has longer latency and lower bandwidth
I wish this was much more concrete with absolute time values -- follow up
qualifiers | declaration | memory | scope | lifetime | | automatic variables except arrays | register | thread | grid | | automatic arrays | local | thread | grid | | __device__ __shared__ int | shared | block | grid | | __device__ int | global | grid | application | | __device__ __constant__ int | constant | grid | application |
constants must be declared outside functions
- can be accessed fast and parallel
- total size of constant memory is limited to 65,536
tiling to reduce memory traffic
- partition data into tile subsets that can be independently operated
reduction in memory traffic is proportional to tile size
- (reducing re-reading the same elements)
not sure about the tiling implementation: I think the results need to be accumulated, (+= in the code, not =)
if shared memory is exhausted we'll limit occupancy again
- sharedMemPerBlock
can dynamically allocate size in data structures by adding another var
- this allows the kernel to adjust shared memory usage at runtime
- but apparently only one such array

// call
kernel<<<dimGrid, dimBlock, *param*>>>

// inside kernel
extern __shared__ Mds_Nds[];

Ch6 Performance Considerations

generally need to do bottleneck analysis
memory coalecsing
- used with tiling to use memory bandwidth
- dram == small capacitors; reads take 10s of ns
  - because of small capacitance that needs to be read
- clock cycle time is sub-ns
- so we parallelize dram reads
- sensors in dram detect bits in consecutive locations
- memory is read in bursts
  - ideally all threads read consecutive memory
sometimes the data is not going to be accessed favourably
- can change how threads map to data
- or change how data is laid out
- or transfer data from global -> shared with bursts, and access shared badly
  - corner turning
hiding latency
- drams have banks and channels
- each channel is a memory controller with a bus that connects a set of dram banks to the processor
- typically 1-8 channels per processor, with a large number of banks per channel
- data transfer bandwidth of a bus is defined by width and clock frequency
- double data rate ddr buses perform 2 transfers per cycle (rising/falling edge)
  - eg. 64bit DDR bus at 1ghz = 8B * 2 * 1GHz = 16GB/s
  - modern cpus: at least 32GB/s, gpus: 256GB/s
- each bank has an array of dram cells, sensing amplifiers for access, and interface to deliver bursts to the bus
- data transfer timing
  - latency for decoder to enable cells & cells to share stored chanrge (much longer)
  - latency for data transfer through the bus (shorter)
  - given this, single bus significantly underuses channel; can overlap multiple bus requests instead for latency for decoder
- for a ratio R for cell array access latency to data transfer time
  - need R+1 banks to utilize
- having more banks also reduces the odds of bank conflict
  - when multiple accesses go to the same bank
  - size of each cell array is chosen to achieve reasonable latency & manufacturability
- maximizing occupancy means having enough threads making requests in parallel
interleaved data distribution
- spreads elements across banks across channels
- makes small arrays spread out
thread coarsening
- doing too fine granularity adds overhead for parallelism
  - eg redundant data loading
- only makes sense when eliminating redundant work
- avoid accidentally reducing occupancy
checklist
- maximize occupancy
  - hides pipeline latency
  - more parallel memory access
  - apply tuning
- enable coalesced global memory access
  - fewer pipeline stalls
  - less global memory traffic, better burst util
  - rearrange data, threads to data, or move to shared first
- minimize contol divergence
  - higher simd efficiency
  - rearrange threads / data layout
- tiling reused data
  - fewer stalls
  - less global traffic
  - place shared data in a block
- privatization
  - fewer stalls for atomic updates
  - less contention
  - partial updates to a private copy before updating global copy
- thread coarsening
  - less redundant work
  - less redundant global memory traffic
  - reduce parallelism
identifying bottlenecks: use a profiler per hardware type

— Kunal