arXiv Open Access 2024

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

Yuxuan Yue Zhihang Yuan Haojie Duanmu Sifan Zhou Jianlong Wu +1 lainnya
Lihat Sumber

Abstrak

Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers. We critically analyze the existing quantization approaches, identifying their limitations in balancing the accuracy and efficiency of the quantized LLMs. To advance beyond these limitations, we propose WKVQuant, a PTQ framework especially designed for quantizing weights and the key/value (KV) cache of LLMs. Specifically, we incorporates past-only quantization to improve the computation of attention. Additionally, we introduce two-dimensional quantization strategy to handle the distribution of KV cache, along with a cross-block reconstruction regularization for parameter optimization. Experiments show that WKVQuant achieves almost comparable memory savings to weight-activation quantization, while also approaching the performance of weight-only quantization.

Topik & Kata Kunci

Penulis (6)

Y

Yuxuan Yue

Z

Zhihang Yuan

H

Haojie Duanmu

S

Sifan Zhou

J

Jianlong Wu

L

Liqiang Nie

Format Sitasi

Yue, Y., Yuan, Z., Duanmu, H., Zhou, S., Wu, J., Nie, L. (2024). WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More. https://arxiv.org/abs/2402.12065

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓