arXiv Open Access 2026

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

Sayed Pedram Haeri Boroujeni Niloufar Mehrabi Patrick Woods Gabriel Hillesheim Abolfazl Razi

Lihat Sumber

Abstrak

Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding's principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding. This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks. Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off. For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.

Topik & Kata Kunci

cs.CV

Penulis (5)

Sayed Pedram Haeri Boroujeni

Niloufar Mehrabi

Patrick Woods

Gabriel Hillesheim

Abolfazl Razi

Format Sitasi

APA MLA BibTeX

Boroujeni, S.P.H., Mehrabi, N., Woods, P., Hillesheim, G., Razi, A. (2026). Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs. https://arxiv.org/abs/2604.04722

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓