- 
Video 
- 
Attention engine for llm inference 
- 
Reason for another attention mechanism 
- VLLM tackles memory fragmentation for variable length
- Pre-set a max seq length causes a lot of fragmentation
- Have a datastructure called page attention that organizes kv cache as pages
- with 16token granularity as a block
- same technique as os page table
- using swapping to manage the cache efficiently
 
- for each request
- directly access entries inside the kvcache
- 2 years ago paper for page table design
 
 
- 
Then sglang used radix attention 
- page size = 1 for less fragmentation and higher cache hit rate
- duplicate prefix organized as a radix tree
 
- 
Recently: token tree verification and tree speculative decoding 
- medusa, specinfer, sequoia
- draft model generates candidates; validate them in parallel
- increase efficiency of generating candidates
- generate a tree of candidates
- feed tree to the large model to see if the tree matches outputs
 
 
- 
KV cache future? 
- active research
- all of the design focused on gpu only
 
- 
Organize it in page table 
- When computing attention
- Use reduced keys for each block
- remove unnecessary computation
 
- 
More friendly for gpu tensor cores -- block compressed format 
- 
In graphics take sparse structures and combine them 
- 
in kvcache take a page table datastructure but make it look like a block sparse matrix for tensor cores 
- 
tree mask is very sparse; still wasting most of the flops