Issue/253: feat: support offline int8 kv cache quantization by qinyiqun · Pull Request #254 · InfiniTensor/InfiniLM

qinyiqun · 2026-03-04T07:20:45Z

Support offline int8 kv cache quantization for static kv cache

examples/jiuge.py

csrc/cache/kv_cache.hpp

csrc/config/model_config.hpp

csrc/config/quant_config.hpp

csrc/models/llama/llama_attention.cpp

python/infinilm/cache/cache.py

csrc/pybind11/cache/cache.hpp

examples/jiuge.py

…quant.cpp; (2)update kv_cache_dtype handling; (3)Update Python test scripts

PanZezhong1725 · 2026-03-23T01:34:32Z

csrc/cache/kv_cache.hpp

+    std::optional<infinicore::DataType> kv_cache_dtype_ = std::nullopt;
 };

 class PagedKVCache final : public Cache {


在实际参与了这段代码的开发后我有个更基础的疑问：为什么kv cache的量化方式要通过kv cache config传给KVCache构造器，而不是直接作为KVCache构造器参数的一部分。我的理解，kv cache的量化方式是个模型runtime算法，应该是模型设置的一部分，本来KVCache的实例化也需要从模型设置得到dim等，那么为什么kvcache quant dtype不是模型config的一部分，不从模型config里把kvcache量化方法也拿到，而还有再从前端拿到kvcache config里？

一是为了区分数据类型和量化算法，模型部分拿到的是量化算法，根据不同的量化算法可能会有不同的结构，数据类型等；而kv cache本身只需要支持对应的量化的数据类型即可，也就是说模型拿到的是scheme/algo，而kv cache拿到的是dtype。
二是最开始并没有想要为model_config增加量化kv 的选项，想要通过kv_config向模型传输信息，但是这种方式由于kv 和 model实例化的独立性而只能为model_config增加选项。
三是为了预留在线量化的方案，使用在线量化，kv 量化不应受model_config影响，而是由kv_config管理即可。
所以本质原因是我认为kv 量化是一个模型不太相关的方法，不应完全由model_config控制。

现在的KVCache构建是由两部分配置决定的，一部分是由模型决定的，包括nhead、headdim等；另一部分是由推理调度系统决定的，比如分配多少个block，支持多少个batch，这部分与模型无关，不影响模型的计算也不影响模型参数，这部分叫做KVCacheConfig，因为我们的推理系统调度是python做的，所以这个class才有python暴露，而kvcache其它部分是没有python接口的。KVCache quant dtype是个模型相关的参数，因为它改变了模型存储的张量和计算方式。所以显然按照当前设计，它不应该在KVCache Config里。现在的修改破坏了这个设计，所以才不得不改推理脚本和reset cache等很多其它接口。

qinyiqun requested review from a team and wooway777 March 4, 2026 07:20

qinyiqun force-pushed the Issue/253 branch from 211913f to d3be4cc Compare March 4, 2026 07:21

Issue/253: feat: support custom KV cache dtype for quantization

1fc301f

qinyiqun force-pushed the Issue/253 branch from d3be4cc to 240464b Compare March 18, 2026 09:11

Issue/253: Support offline int8 inference with calibrated models

a2a2dac

qinyiqun force-pushed the Issue/253 branch from 240464b to a2a2dac Compare March 18, 2026 09:28

qinyiqun changed the title ~~Issue/253: feat: support custom KV cache dtype for quantization~~ Issue/253: feat: support offline int8 kv cache quantization Mar 18, 2026

wooway777 reviewed Mar 18, 2026

View reviewed changes

examples/jiuge.py Show resolved Hide resolved

PanZezhong1725 requested changes Mar 19, 2026

View reviewed changes

csrc/pybind11/cache/cache.hpp Outdated Show resolved Hide resolved

qinyiqun requested review from PanZezhong1725 and wooway777 March 19, 2026 08:32

InfiniTensor deleted a comment from qinyiqun Mar 20, 2026

qinyiqun force-pushed the Issue/253 branch from 4aa8c3e to e9e5b64 Compare March 20, 2026 02:32

pengcheng888 reviewed Mar 20, 2026

View reviewed changes

examples/jiuge.py Show resolved Hide resolved

Issue/253: (1) Refactor attention KV cache quantization to layers/kv_…

7796a76

…quant.cpp; (2)update kv_cache_dtype handling; (3)Update Python test scripts

qinyiqun force-pushed the Issue/253 branch from e9e5b64 to 7796a76 Compare March 20, 2026 03:10

wooway777 approved these changes Mar 20, 2026

View reviewed changes

PanZezhong1725 force-pushed the Issue/253 branch 3 times, most recently from ae78252 to 788532e Compare March 20, 2026 09:10

issue/253 refine static kv cache init

51d0a81

PanZezhong1725 force-pushed the Issue/253 branch from 788532e to 51d0a81 Compare March 23, 2026 01:09

PanZezhong1725 requested changes Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue/253: feat: support offline int8 kv cache quantization#254

Issue/253: feat: support offline int8 kv cache quantization#254
qinyiqun wants to merge 4 commits intomainfrom
Issue/253

qinyiqun commented Mar 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PanZezhong1725 Mar 23, 2026

Uh oh!

qinyiqun Mar 23, 2026

Uh oh!

PanZezhong1725 Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

qinyiqun commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PanZezhong1725 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

qinyiqun Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

PanZezhong1725 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qinyiqun commented Mar 4, 2026 •

edited

Loading