Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Lecture 41 Flashinfer

Video
Attention engine for llm inference
Reason for another attention mechanism
- VLLM tackles memory fragmentation for variable length
- Pre-set a max seq length causes a lot of fragmentation
- Have a datastructure called page attention that organizes kv cache as pages
  - with 16token granularity as a block
  - same technique as os page table
  - using swapping to manage the cache efficiently
- for each request
  - directly access entries inside the kvcache
  - 2 years ago paper for page table design
Then sglang used radix attention
- page size = 1 for less fragmentation and higher cache hit rate
- duplicate prefix organized as a radix tree
Recently: token tree verification and tree speculative decoding
- medusa, specinfer, sequoia
- draft model generates candidates; validate them in parallel
- increase efficiency of generating candidates
  - generate a tree of candidates
  - feed tree to the large model to see if the tree matches outputs
KV cache future?
- active research
- all of the design focused on gpu only
Organize it in page table
- When computing attention
- Use reduced keys for each block
- remove unnecessary computation
More friendly for gpu tensor cores -- block compressed format
In graphics take sparse structures and combine them
in kvcache take a page table datastructure but make it look like a block sparse matrix for tensor cores
tree mask is very sparse; still wasting most of the flops

Follow ups