For instance, in 70B models, a single request with a 16K-token prompt generates 640 MB of KV cache, split into 2048 disjoint 4KB blocks for
a single GPU out of eight. (Page 4)
2048 * 4KB != 640MB
16×128×2B = 8192B. (Page 6)
maybe 16 x 256 x 2B?
cache[B][KV][L][H][D]
how deal with paged attention such as k/v: [max_num_pages, num_layers, num_kv_heads, page_size, head_dim]? Whether Block coalescing can work with paged kv_cache?