arXiv Open Access 2026

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

Sayed Pedram Haeri Boroujeni Niloufar Mehrabi Patrick Woods Gabriel Hillesheim Abolfazl Razi
Lihat Sumber

Abstrak

Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding's principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding. This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks. Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off. For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.

Topik & Kata Kunci

Penulis (5)

S

Sayed Pedram Haeri Boroujeni

N

Niloufar Mehrabi

P

Patrick Woods

G

Gabriel Hillesheim

A

Abolfazl Razi

Format Sitasi

Boroujeni, S.P.H., Mehrabi, N., Woods, P., Hillesheim, G., Razi, A. (2026). Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs. https://arxiv.org/abs/2604.04722

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓