Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
I had a week long vacation, and spent time exploring. These letters were originally published separately, but I ultimately decided to club them together into a single page just to maintain consistency.
I spent a large part of today debugging my laptop setup, and learning more about
Flatpak and
xdg-desktop-portal
than I would have liked to. The short of it was that I couldn't get file open dialogs
to work in Chrome -- and I couldn't get configuration based fallbacks to work correctly
by updating .config/xdg-desktop-portal/portals.conf
. The solution ended up being
directly modifying the configurations at /usr/share/xdg-desktop-portal/portals/gtk.portal
and adding sway to it.
Working through this always makes me wonder if it's worth the time to use a Linux laptop, but in the end I'd rather deepen my knowledge of Linux instead of wrestling MacOS or Windows. I wouldn't mind more powerful ChromeOS laptops though.
[Edit: 2024-03-24] This has still been plaguing me; there seems to be some bug that doesn't manifest immediately after a restart, but probably after putting the laptop to sleep and restarting.
Spending some time reading about Cuda and multiprocessing today; I inevitably forget what an SM or a Warp is, just because I don't get enough of a chance to use them daily. So, some definitions:
Notes from DeepMind's Paper:
I really want be able to easily look at a full model's definition without needing to read and hand annotate code: the modules, dimensions passed in and handled, etc. Intermediate Logging got close to it with the way it worked, but I still want to play with more sophisticated visualizations.
I stumbled across this detailed answer on Stack Overflow. T
#pragma GCC visibility push(hidden)
/ #pragma GCC visibility pop
for having sections of visibility-fvisibility=[default|internal|hidden|protected]
at compile time__attribute__ ((visibility("default")))
per symbol in codeI'm exploring this because I'm curious if I can combine native python extensions that use different versions of the same library in the same process.
ChatGPT pointed me to man dlopen to read more about library linking.
RTLD_GLOBAL
, RTLD_LOCAL
to set up how symbols are resolved going forward.RTLD_DEEPBIND
sets local scope ahead of global scope.namespaces
: within a namespace, dependent shared objects are implicitly loaded. These allow more flexibility than RTLD_LOCAL
, with up to 16 namespaces allowed.
First pass at playing with the transformer debugger / reading through the repo
Reading through a description of file paging and different formats for shared libraries and binaries, including a.out, elf, etc. I'm not quite sure how to make the most of this book -- at some point I'll probably want to try and implement my own linker for a single format/architecture as an exercise.
Received the book as a gift today, and I started skimming through the book (and still need to re-read it slowly). The most interesting chapters came towards the end and recommends a very careful path through to the future balancing several different approaches to control the effects of AI.
I'm not quite sure where I stand with the book, but I'm looking forward to going through it again to see how AI is expected to affect the future; all the changes so far have been good but not that large.
Stripe has always been an interesting company, and they talked a little bit about reliability.
Playing with Cuda & NCCL on PI Day; I'm trying out a programming experiment to estimate Pi using GPUs. The only way I knew of to estimate PI was to use random points to estimate the ratio of points that fall outside / inside the circle -- and as ChatGPT reminded me, that's extremely parallelizable. As a trivial first attempt:
#include <cuda_runtime.h>
#include <math.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#define CHECK(call) do { \
cudaError_t error = call; \
if (error != cudaSuccess) { \
fprintf(stderr, "CUDA Error at %s:%d - %s\n", __FILE__, __LINE__, cudaGetErrorString(error)); \
exit(EXIT_FAILURE); \
} \
} while(0)
__global__
void count(int N, bool *out) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
/* Edit: this is almost certainly incorrect */
int p = i * blockDim.x * gridDim.x + j;
float x = (i + .5) / N;
float y = (j + .5) / N;
out[p] = (x * x + y * y) <= 1;
}
int main(void) {
int devCount;
cudaGetDeviceCount(&devCount);
cudaDeviceProp props;
for (unsigned int i = 0; i < devCount; i++) {
cudaGetDeviceProperties(&props, i);
printf("Device %d | Max threads: %d", i, props.maxThreadsPerBlock);
}
int N = 64; // Size of grid
bool *out;
cudaMallocManaged(&out, N * N * sizeof(bool));
count<<<dim3(2, 2), dim3(N/2, N/2)>>>(N, out);
CHECK(cudaDeviceSynchronize());
CHECK(cudaGetLastError());
unsigned int count = 0;
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
if (out[i * N + j]) {
// printf(".");
count++;
}
}
// printf("\n");
}
printf("π = %f\n", 4 * (float)count / (N * N));
}
Since writing the program, I've been experimenting with moving the reduction to another kernel, and benchmarking it aggressively.
For some reason this is hard to reach, but ncu is nvidia's nsights compute CLI. So far I've directly used it with ncu <binary> -o profile
.
A little guide to building large language models is a treasure chest of useful links and data. - Mamba the hard way
— Kunal