Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

2024-03-17

I had a week long vacation, and spent time exploring. These letters were originally published separately, but I ultimately decided to club them together into a single page just to maintain consistency.

XDG-Desktop-Portal

I spent a large part of today debugging my laptop setup, and learning more about Flatpak and xdg-desktop-portal than I would have liked to. The short of it was that I couldn't get file open dialogs to work in Chrome -- and I couldn't get configuration based fallbacks to work correctly by updating .config/xdg-desktop-portal/portals.conf. The solution ended up being directly modifying the configurations at /usr/share/xdg-desktop-portal/portals/gtk.portal and adding sway to it.

Working through this always makes me wonder if it's worth the time to use a Linux laptop, but in the end I'd rather deepen my knowledge of Linux instead of wrestling MacOS or Windows. I wouldn't mind more powerful ChromeOS laptops though.

[Edit: 2024-03-24] This has still been plaguing me; there seems to be some bug that doesn't manifest immediately after a restart, but probably after putting the laptop to sleep and restarting.

Revisiting Cuda

Spending some time reading about Cuda and multiprocessing today; I inevitably forget what an SM or a Warp is, just because I don't get enough of a chance to use them daily. So, some definitions:

SM = Streaming Multiprocessor; a collection of cores in the GPU.
Warp = collection of threads executing simultaneously; generally assuming similar behavior.

Compute Optimal Language Models

Notes from DeepMind's Paper:

most models aren't trained on enough data
double the parameters needs double the tokens
Chinchilla: 70B params, 1.4 trillion tokens
MoE models don't scale the same, and must consider the number of components
The appendix is pretty interesting, and includes the data mix

Visualizing a model

I really want be able to easily look at a full model's definition without needing to read and hand annotate code: the modules, dimensions passed in and handled, etc. Intermediate Logging got close to it with the way it worked, but I still want to play with more sophisticated visualizations.

Linker Visibility

I stumbled across this detailed answer on Stack Overflow. T

#pragma GCC visibility push(hidden) / #pragma GCC visibility pop for having sections of visibility
-fvisibility=[default|internal|hidden|protected] at compile time
__attribute__ ((visibility("default"))) per symbol in code

I'm exploring this because I'm curious if I can combine native python extensions that use different versions of the same library in the same process.

Scoping

ChatGPT pointed me to man dlopen to read more about library linking.

RTLD_GLOBAL, RTLD_LOCAL to set up how symbols are resolved going forward.
RTLD_DEEPBIND sets local scope ahead of global scope.

namespaces: within a namespace, dependent shared objects are implicitly loaded. These allow more flexibility than RTLD_LOCAL, with up to 16 namespaces allowed.

References

Code gen options
GCC on visibility
ChatGPT actually helped a lot in testing out ideas and discussing options

Acronyms

PLT: Procedure Linkage Table that lazily binds functions and updates when symbols are resolved
PIC: Position independent code, necessary for shared libraries

Open AI's Transformer Debugger

First pass at playing with the transformer debugger / reading through the repo

Collects details on activation at inference time, and then provides useful visualizations and analysis on the results. Includes models summarizing debugging data.
models looks like an interesting file
- They build a custom transformer that supports additional hooks
- And a gradient hook that works exactly like mine for capturing gradients from a forward hook
A paper on mechanistic interpretability to explore when I get another break.

Book: Linkers & Loaders

Reading through a description of file paging and different formats for shared libraries and binaries, including a.out, elf, etc. I'm not quite sure how to make the most of this book -- at some point I'll probably want to try and implement my own linker for a single format/architecture as an exercise.

Book: The Coming Wave

Received the book as a gift today, and I started skimming through the book (and still need to re-read it slowly). The most interesting chapters came towards the end and recommends a very careful path through to the future balancing several different approaches to control the effects of AI.

I'm not quite sure where I stand with the book, but I'm looking forward to going through it again to see how AI is expected to affect the future; all the changes so far have been good but not that large.

Book: Stripe's Letter

Stripe has always been an interesting company, and they talked a little bit about reliability.

Releases (the numbers are interesting, but not very satisfying)
- 400 releases a day, or a release every 4 minutes.
- 6 billion test runs a day, using 500,000 CPUs which block the release.
  - This suggests 15 million tests run every 5 minutes per release? I'm not sure good integration tests run that fast.
- Tested against mock production to validate/canary, and then incrementally rolled out from 1 machine to 20%.
- Tested against 55,000 metrics for anomalies.
- I suspect there are actually several different releases going on and these are the summed up numbers across releases, so I shouldn't assume it's a single service deploying.

Cuda

Playing with Cuda & NCCL on PI Day; I'm trying out a programming experiment to estimate Pi using GPUs. The only way I knew of to estimate PI was to use random points to estimate the ratio of points that fall outside / inside the circle -- and as ChatGPT reminded me, that's extremely parallelizable. As a trivial first attempt:

#include <cuda_runtime.h>

#include <math.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>

#define CHECK(call) do { \
    cudaError_t error = call; \
    if (error != cudaSuccess) { \
        fprintf(stderr, "CUDA Error at %s:%d - %s\n", __FILE__, __LINE__, cudaGetErrorString(error)); \
        exit(EXIT_FAILURE); \
    } \
} while(0)


__global__
void count(int N, bool *out) {
  int i = threadIdx.x + blockDim.x * blockIdx.x;
  int j = threadIdx.y + blockDim.y * blockIdx.y;
  /* Edit: this is almost certainly incorrect */
  int p = i * blockDim.x * gridDim.x + j;

  float x = (i + .5) / N;
  float y = (j + .5) / N;
  out[p] = (x * x + y * y) <= 1;
}


int main(void) {

  int devCount;
  cudaGetDeviceCount(&devCount);

  cudaDeviceProp props;
  for (unsigned int i = 0; i < devCount; i++) {
    cudaGetDeviceProperties(&props, i);
    printf("Device %d | Max threads: %d", i, props.maxThreadsPerBlock);
  }

  int N = 64; // Size of grid
  bool *out;
  cudaMallocManaged(&out, N * N * sizeof(bool));
  count<<<dim3(2, 2), dim3(N/2, N/2)>>>(N, out);
  CHECK(cudaDeviceSynchronize());
  CHECK(cudaGetLastError());

  unsigned int count = 0;
  for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
      if (out[i * N + j]) {
        // printf(".");
        count++;
      }
    }
    // printf("\n");
  }

  printf("π = %f\n", 4 * (float)count / (N * N));
}

Since writing the program, I've been experimenting with moving the reduction to another kernel, and benchmarking it aggressively.

ncu

For some reason this is hard to reach, but ncu is nvidia's nsights compute CLI. So far I've directly used it with ncu <binary> -o profile.

References

Cuda Mode A collection of excellent resources & lectures, involves several Meta-mates
Cuda introduction, NVidia A great article, includes shared memory
A full solution by someone who tried it every way
Reductions to calculate sums

Building large models

A little guide to building large language models is a treasure chest of useful links and data. - Mamba the hard way

— Kunal