Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

2024-03-17

I had a week long vacation, and spent time exploring. These letters were originally published separately, but I ultimately decided to club them together into a single page just to maintain consistency.

XDG-Desktop-Portal

I spent a large part of today debugging my laptop setup, and learning more about Flatpak and xdg-desktop-portal than I would have liked to. The short of it was that I couldn't get file open dialogs to work in Chrome -- and I couldn't get configuration based fallbacks to work correctly by updating .config/xdg-desktop-portal/portals.conf. The solution ended up being directly modifying the configurations at /usr/share/xdg-desktop-portal/portals/gtk.portal and adding sway to it.

Working through this always makes me wonder if it's worth the time to use a Linux laptop, but in the end I'd rather deepen my knowledge of Linux instead of wrestling MacOS or Windows. I wouldn't mind more powerful ChromeOS laptops though.

[Edit: 2024-03-24] This has still been plaguing me; there seems to be some bug that doesn't manifest immediately after a restart, but probably after putting the laptop to sleep and restarting.

Revisiting Cuda

Spending some time reading about Cuda and multiprocessing today; I inevitably forget what an SM or a Warp is, just because I don't get enough of a chance to use them daily. So, some definitions:

Compute Optimal Language Models

Notes from DeepMind's Paper:

Visualizing a model

I really want be able to easily look at a full model's definition without needing to read and hand annotate code: the modules, dimensions passed in and handled, etc. Intermediate Logging got close to it with the way it worked, but I still want to play with more sophisticated visualizations.

Linker Visibility

I stumbled across this detailed answer on Stack Overflow. T

I'm exploring this because I'm curious if I can combine native python extensions that use different versions of the same library in the same process.

Scoping

ChatGPT pointed me to man dlopen to read more about library linking.

namespaces: within a namespace, dependent shared objects are implicitly loaded. These allow more flexibility than RTLD_LOCAL, with up to 16 namespaces allowed.

References

Acronyms

Open AI's Transformer Debugger

First pass at playing with the transformer debugger / reading through the repo

Book: Linkers & Loaders

Reading through a description of file paging and different formats for shared libraries and binaries, including a.out, elf, etc. I'm not quite sure how to make the most of this book -- at some point I'll probably want to try and implement my own linker for a single format/architecture as an exercise.

Book: The Coming Wave

Received the book as a gift today, and I started skimming through the book (and still need to re-read it slowly). The most interesting chapters came towards the end and recommends a very careful path through to the future balancing several different approaches to control the effects of AI.

I'm not quite sure where I stand with the book, but I'm looking forward to going through it again to see how AI is expected to affect the future; all the changes so far have been good but not that large.

Book: Stripe's Letter

Stripe has always been an interesting company, and they talked a little bit about reliability.

Cuda

Playing with Cuda & NCCL on PI Day; I'm trying out a programming experiment to estimate Pi using GPUs. The only way I knew of to estimate PI was to use random points to estimate the ratio of points that fall outside / inside the circle -- and as ChatGPT reminded me, that's extremely parallelizable. As a trivial first attempt:

#include <cuda_runtime.h>

#include <math.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>

#define CHECK(call) do { \
    cudaError_t error = call; \
    if (error != cudaSuccess) { \
        fprintf(stderr, "CUDA Error at %s:%d - %s\n", __FILE__, __LINE__, cudaGetErrorString(error)); \
        exit(EXIT_FAILURE); \
    } \
} while(0)


__global__
void count(int N, bool *out) {
  int i = threadIdx.x + blockDim.x * blockIdx.x;
  int j = threadIdx.y + blockDim.y * blockIdx.y;
  /* Edit: this is almost certainly incorrect */
  int p = i * blockDim.x * gridDim.x + j;

  float x = (i + .5) / N;
  float y = (j + .5) / N;
  out[p] = (x * x + y * y) <= 1;
}


int main(void) {

  int devCount;
  cudaGetDeviceCount(&devCount);

  cudaDeviceProp props;
  for (unsigned int i = 0; i < devCount; i++) {
    cudaGetDeviceProperties(&props, i);
    printf("Device %d | Max threads: %d", i, props.maxThreadsPerBlock);
  }

  int N = 64; // Size of grid
  bool *out;
  cudaMallocManaged(&out, N * N * sizeof(bool));
  count<<<dim3(2, 2), dim3(N/2, N/2)>>>(N, out);
  CHECK(cudaDeviceSynchronize());
  CHECK(cudaGetLastError());

  unsigned int count = 0;
  for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
      if (out[i * N + j]) {
        // printf(".");
        count++;
      }
    }
    // printf("\n");
  }

  printf("π = %f\n", 4 * (float)count / (N * N));
}

Since writing the program, I've been experimenting with moving the reduction to another kernel, and benchmarking it aggressively.

ncu

For some reason this is hard to reach, but ncu is nvidia's nsights compute CLI. So far I've directly used it with ncu <binary> -o profile.

References

Building large models

A little guide to building large language models is a treasure chest of useful links and data. - Mamba the hard way

Kunal