-
Video
-
Attention engine for llm inference
-
Reason for another attention mechanism
- VLLM tackles memory fragmentation for variable length
- Pre-set a max seq length causes a lot of fragmentation
- Have a datastructure called page attention that organizes kv cache as pages
- with 16token granularity as a block
- same technique as os page table
- using swapping to manage the cache efficiently
- for each request
- directly access entries inside the kvcache
- 2 years ago paper for page table design
-
Then sglang used radix attention
- page size = 1 for less fragmentation and higher cache hit rate
- duplicate prefix organized as a radix tree
-
Recently: token tree verification and tree speculative decoding
- medusa, specinfer, sequoia
- draft model generates candidates; validate them in parallel
- increase efficiency of generating candidates
- generate a tree of candidates
- feed tree to the large model to see if the tree matches outputs
-
KV cache future?
- active research
- all of the design focused on gpu only
-
Organize it in page table
- When computing attention
- Use reduced keys for each block
- remove unnecessary computation
-
More friendly for gpu tensor cores -- block compressed format
-
In graphics take sparse structures and combine them
-
in kvcache take a page table datastructure but make it look like a block sparse matrix for tensor cores
-
tree mask is very sparse; still wasting most of the flops