arXiv Open Access 2025

Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

Bangsheng Tang Carl Chengyan Fu Fei Kou Grigory Sizov Haoci Zhang +33 lainnya

Lihat Sumber

Abstrak

Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.

Topik & Kata Kunci

cs.CL

Penulis (38)

Bangsheng Tang

Carl Chengyan Fu

Fei Kou

Grigory Sizov

Haoci Zhang

Jason Park

Jiawen Liu

Jie You

Qirui Yang

Sachin Mehta

Shengyong Cai

Xiaodong Wang

Xingyu Liu

Yunlu Li

Yanjun Zhou

Wei Wei

Zhiwei Zhao

Zixi Qi

Adolfo Victoria

Aya Ibrahim

Bram Wasti

Changkyu Kim

Daniel Haziza

Fei Sun

Giancarlo Delfin

Emily Guo

Jialin Ouyang

Jaewon Lee

Jianyu Huang

Jeremy Reizenstein

Lu Fang

Quinn Zhu

Ria Verma

Vlad Mihailescu

Xingwen Guo

Yan Cui

Ye Hu

Yejin Lee

Format Sitasi

APA MLA BibTeX

Tang, B., Fu, C.C., Kou, F., Sizov, G., Zhang, H., Park, J. et al. (2025). Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions. https://arxiv.org/abs/2508.08192

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓