[Bug]: Misleading Data in related paper.

>  For instance, in 70B models, a single request with a 16K-token prompt generates 640 MB of KV cache, split into 2048 disjoint 4KB blocks for
a single GPU out of eight. (Page 4)

2048 * 4KB != 640MB

> 16×128×2B = 8192B. (Page 6)

maybe 16 x 256 x 2B?

> cache[B][KV][L][H][D]

how deal with paged attention such as k/v: [max_num_pages, num_layers, num_kv_heads, page_size, head_dim]? Whether Block coalescing can work with paged kv_cache?





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Misleading Data in related paper. #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Misleading Data in related paper. #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions