arXiv Open Access 2025

Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

Yudi Zhang Weilin Zhao Xu Han Tiejun Zhao Wang Xu +2 lainnya
Lihat Sumber

Abstrak

Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at https://github.com/AI9Stars/SpecMQuant.

Topik & Kata Kunci

Penulis (7)

Y

Yudi Zhang

W

Weilin Zhao

X

Xu Han

T

Tiejun Zhao

W

Wang Xu

H

Hailong Cao

C

Conghui Zhu

Format Sitasi

Zhang, Y., Zhao, W., Han, X., Zhao, T., Xu, W., Cao, H. et al. (2025). Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design. https://arxiv.org/abs/2505.22179

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓