Working Notes: a commonplace notebook for recording & exploring ideas.
  Home. Site Map. Subscribe. More at expLog.
Programming Massively Parallel Processors
- tags
- ['AI', 'GPU']
- author
- W Hwu
- The most recommended book on cuda and parallel programming
Ch1
- Single cores are hitting limits on how fast they can go
- Instead we run in parallel
- Increasing speed of single cores costs much more
Ch2
Because I always get lost about this:
- when host code calls a kernel, the runtime launches a gridof threads
- the grid is organized into a 2 level hierarchy
- thread blocks=- blocksof the same size
- threadwithin each block
 
- threads per block is defined by host code
- blockDim== number of threads in a block, up to 3d- 
- preferably keep this as a multiple of 32
 
- gridDim== dimensions of grid
- threadIdx== allow thread identification
- blockIdx== allow block identification in grid
- __global__can be called from host or device
- __host__is the default, called from host
- __device__can be called from the device to device
- can use both to generate multiple versions of the code
- cuda c code -> nvcc  -> (host code -> host compiler/linker + device code / PTX -> device JIT)
Ch3 Multidimensional grids and data
- <<<dim3 number of blocks = griddim, dim3 threads in a block = blockdim>>>
- dim3is a special type for this, eg- dim3 dimGrid(32, 1, 1)
- sugar to just use arithmetic for a single dimension using constructors
- blocks are limited to 1024 threads
- row-majorall elements in a row are adjacent, index by- row * width + col
- col-majorall elements in a column are adjacent, index by- column * height + row
- blas== Basic Linear Algebra Subprograms- 
- level 1 = vector operations like y = ax + y
- level 2 = vector matrix operations, y = aAx + by, A is a Matrix
- level 3 = matrix matricx, C = aAB + bC, A,B,C are matrices
 
- Tiling is to visualize is pretty cool
			N xxVxxxxxx
			  xxVxxxxxx
			  xxVxxxxxx
			  xxVxxxxxx
			  xxVxxxxxx
			  xxVxxxxxx
			  xxVxxxxxx
				.
M yyyyyyy   P xxxxxxxxx
  yyyyyyy     xxxxxxxxx
  yyyyyyy     xxxxxxxxx
  >>>>>>>.....xx#xxxxxx
  yyyyyyy     xxxxxxxxx
- can be implemented elementwise in cuda
Ch4 Compute arch & scheduling
- Streaming multiprocessors
- has processing units called streaming processors == cuda cores
- share control logic & memory
 
- SMs have on chip memory
- as well as separate global memory / DRAM
- Threads are assigned to SMs on a block by block basis / all threads on a block ar in the same SM
- __syncthreads()barrier synchronization function -- till all the threads reach that point- 
- must be executed by all threads in a block, otherwise deadlock
 
- threads across blocks cannot perform barrier synchronization
- warps are 32 thread units for scheduling in SMs
- partitioned based on threadidx
- single instruction, multiple thread
 
- For multidimensional, dimensions are projected into a linearized row major layout
- SM executes all threads in a warp following single-instruction, multiple data model
- one instruction is fetched and executed for all threads in the warp
 
- SIMD== Single-Instruction Multiple-Data
- this shares the cost of the control hardware across many units, maximizing arithmetic throughput
- control divergence will make the hardware take multiple passes
- volta onwards can execute independently, using independent thread scheduling
- as a consequence, all threads can have different timings
- __syncwarp()is another barrier to make sure the full warp completes execution first- 
- (having access to this feels like a boundary violation)
 
- oversubscribe SMs with more warps that they can help so that they can execute while waiting for memory to load
- known as latency tolerance
 
- GPUs have zero-overhead scheduling to swap out warps without adding idle cycles by having state for all warps saved in hardware
- occupany is the ratio of warps assigned to SM to the maximum number it supports
- resources for an SM are: registers, shared memory, thread block slots, thread solts
- dynamically partitioned across threads for execution
 
- examples of not hitting max occupancy
- too many blocks (limit to blocks that can be scheduled), even though we aren't using all threads
- if number of threads per block is not divsible by block size
- A100 can allow 32 registers per thread
- shows up as a performance cliff
- cuda occupancy calc in nsight
 
- accessing available gpu count
int devCount;
cudaGetDeviceCount(&devCount);
- most new pcs have multiple devices
- get details with cudaGetDeviceProperties, filling acudaDeviceProp
- fields
- maxThreadsPerBlock
- multiProcessorCount== number of SMs
- clockRate== clock frequency
- maxThreadDims== max threads per dimension
- maxGridSize== max blocks per grid dimension
- regsPerBlock== registers per SM
- warpSize
 
 
- Running on my laptop
count = 1
Device 0:
prop.name = NVIDIA GeForce GTX 1650 Ti
prop.totalGlobalMem = 3899326464
prop.sharedMemPerBlock = 49152
prop.regsPerBlock = 65536
prop.clockRate = 1485000
prop.warpSize = 32
prop.maxThreadsPerBlock = 1024
prop.multiProcessorCount = 16
prop.integrated = 0
prop.major = 7
prop.minor = 5
prop.maxThreadsDim[j] = 1024
prop.maxThreadsDim[j] = 1024
prop.maxThreadsDim[j] = 64
prop.maxGridSize[j] = 2147483647
prop.maxGridSize[j] = 65535
prop.maxGridSize[j] = 65535
Ch5 Memory Architecture & Data Locality
- memory introduces additional constraints, latency introduced by fetching data
- compute to global memory access ratio == FLOPs performed for each byte access from global memory
- also known as arithmetic intensity, computational intensity
- eg.
- A100 has peak global memory bandwidth of 1555 GB/s and 19,500 GFLOPs single precision output
- Tensor cores single precision floating point has 156,000 GFLOPs
 
- memory bound programs limited by memory bandwidth
- Roofline Model
- assess performance achieved by an application relative to limits of hardware
- plot computational throughput in GFLOPs to computational intensity
 
Computational
Throughput
 ^
 |
 |   Peak bandwidth * FLOP/B
 |         x
 |        x
 |       x              Peak throughput (GFLOPs)
 |      x........................
 |     x^
 |    x ^
 |   x  ^
 |  x   ^
 | x    ^
 |x     ^
 x---------------------------------------------------->
   Computational intensity
- types of memory for device
- per thread registers
- per thread local memory
- per block shared memory
- per grid global memory
- per grid constant memory
 
- host can transfer data to & from per grid global/constant memory
- fewer instructions to use registers
- energy consumed for reading from registers is an order of magnitude lower than global memory
- shared memory has longer latency and lower bandwidth
- I wish this was much more concrete with absolute time values -- follow up
- qualifiers
| declaration                       | memory   | scope  | lifetime    |
| automatic variables except arrays | register | thread | grid        |
| automatic arrays                  | local    | thread | grid        |
| __device__ __shared__ int| shared   | block  | grid        |
|__device__ int| global   | grid   | application |
|__device__ __constant__ int| constant | grid   | application |
- constants must be declared outside functions
- can be accessed fast and parallel
- total size of constant memory is limited to 65,536
 
- tiling to reduce memory traffic
- partition data into tilesubsets that can be independently operated
 
- reduction in memory traffic is proportional to tile size
- (reducing re-reading the same elements)
 
- not sure about the tiling implementation: I think the results need to be accumulated, (+= in the code, not =)
- if shared memory is exhausted we'll limit occupancy again
- can dynamically allocate size in data structures by adding another var
- this allows the kernel to adjust shared memory usage at runtime
- but apparently only one such array
 
// call
kernel<<<dimGrid, dimBlock, *param*>>>
// inside kernel
extern __shared__ Mds_Nds[];
- generally need to do bottleneck analysis
- memory coalecsing
- used with tiling to use memory bandwidth
- dram == small capacitors; reads take 10s of ns
- because of small capacitance that needs to be read
 
- clock cycle time is sub-ns
- so we parallelize dram reads
- sensors in dram detect bits in consecutive locations
- memory is read in bursts
- ideally all threads read consecutive memory
 
 
- sometimes the data is not going to be accessed favourably
- can change how threads map to data
- or change how data is laid out
- or transfer data from global -> shared with bursts, and access shared badly
 
- hiding latency
- drams have banks and channels
- each channel is a memory controller with a bus that connects a set of dram banks to the processor
- typically 1-8 channels per processor, with a large number of banks per channel
- data transfer bandwidth of a bus is defined by width and clock frequency
- double data rate ddr buses perform 2 transfers per cycle (rising/falling edge)
- eg. 64bit DDR bus at 1ghz = 8B * 2 * 1GHz = 16GB/s
- modern cpus: at least 32GB/s, gpus: 256GB/s
 
- each bank has an array of dram cells, sensing amplifiers for access, and interface to deliver bursts to the bus
- data transfer timing
- latency for decoder to enable cells & cells to share stored chanrge (much longer)
- latency for data transfer through the bus (shorter)
- given this, single bus significantly underuses channel; can overlap multiple bus requests instead for latency for decoder
 
- for a ratio R for cell array access latency to data transfer time
- need R+1 banks to utilize
 
- having more banks also reduces the odds of bank conflict
- when multiple accesses go to the same bank
- size of each cell array is chosen to achieve reasonable latency & manufacturability
 
- maximizing occupancy means having enough threads making requests in parallel
 
- interleaved data distribution
- spreads elements across banks across channels
- makes small arrays spread out
 
- thread coarsening
- doing too fine granularity adds overhead for parallelism
- eg redundant data loading
 
- only makes sense when eliminating redundant work
- avoid accidentally reducing occupancy
 
- checklist
- maximize occupancy
- hides pipeline latency
- more parallel memory access
- apply tuning
 
- enable coalesced global memory access
- fewer pipeline stalls
- less global memory traffic, better burst util
- rearrange data, threads to data, or move to shared first
 
- minimize contol divergence
- higher simd efficiency
- rearrange threads / data layout
 
- tiling reused data
- fewer stalls
- less global traffic
- place shared data in a block
 
- privatization
- fewer stalls for atomic updates
- less contention
- partial updates to a private copy before updating global copy
 
- thread coarsening
- less redundant work
- less redundant global memory traffic
- reduce parallelism
 
 
- identifying bottlenecks: use a profiler per hardware type
— Kunal