Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Cuda

Useful references
- nvida gencode
- cuda architecture
Hello World
- Everything

// For all CUDA calls
void e(cudaError_t result) {
  if (result != cudaSuccess) {
    printf("%s\n", cudaGetErrorString(result));
  }
}

// Sanity checks after running a kernel
e(cudaGetLastError()); 
e(cudaDeviceSynchronize());

// Common functions
cudaMemcpy(dest, source, size, cudaMemcpyHostToDevice | cudaMemcpyDeviceToHost)
cudaMalloc(&<pointer>, size)
  
// Calling a kernel
kernel<<<blocks, threadsinblock>>>()
threadIdx.x
blockDim.x // threads in a block in that direction
gridDim.x // blocks in a grid in that direction

// Shared memory
 __shared__ int s[]; <- size defined by parameter to the kernel

References
[[Thread Block]]
- https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
- group of threads is called a thread block
- CUDA blocks are grouped into a [[grid]]
- Kernel is executed as a grid of blocks of threads
- each block is executed by one [[Streaming Multiprocessor]] (SM)
- One SM can run several CUDA blocks (depending on resources)
- Cuda limits threads per block to 1024
- =kernel_call<<<numBlocks, threadsPerBlock>>>(...)=
[[Warp]]
- Groups of threads with consecutive thread indexes are bundled into warps
- One full warp is executed on a single cuda core
- blocks are divided into multiple warps for execution on corers of an SM
- [[SIMT]] -- single instruction, multiple threads
  - multiple threads issue common instructions to arbitrary data
  - each thread accesses its own registers, loads/stores from different addresses
- https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
[[Parallel Reduce]]
- https://sodocumentation.net/cuda/topic/6566/parallel-reduction--e-g--how-to-sum-an-array-
- https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
Shared memory
- https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/
Printf in Cuda
- Pre-decides a buffer before kernel launch, and then will over-write it if exceeded
- =cudaDeviceGetLimit(size_t *ptr, cudaLimitPrintfFifoSize)=, cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size_t <val>) to override
- (Couldn't actually get this to behave how I'd hoped)

— Kunal