arXiv Open Access 2024

GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding

Cunxiao Du Jing Jiang Xu Yuanchen Jiawei Wu Sicheng Yu +6 lainnya

Lihat Sumber

Abstrak

Speculative decoding is a relatively new decoding framework that leverages small and efficient draft models to reduce the latency of LLMs. In this study, we introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding to further improve the decoding speed of a frozen LLM. Specifically, GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM, while CaPE is a proposal expansion method that uses the draft model's confidence scores to help select additional candidate tokens for verification. Extensive experiments on different benchmarks demonstrate that our proposed GliDe draft model significantly reduces the expected decoding latency. Additional evaluation using walltime reveals that GliDe can accelerate Vicuna models up to 2.17x and further extend the improvement to 2.61x with CaPE. We will release our code, data, and the trained draft models.

Topik & Kata Kunci

cs.CL

Penulis (11)

Cunxiao Du

Jing Jiang

Xu Yuanchen

Jiawei Wu

Sicheng Yu

Yongqi Li

Shenggui Li

Kai Xu

Liqiang Nie

Zhaopeng Tu

Yang You

Format Sitasi

APA MLA BibTeX

Du, C., Jiang, J., Yuanchen, X., Wu, J., Yu, S., Li, Y. et al. (2024). GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding. https://arxiv.org/abs/2402.02082

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓