Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
Abstrak
Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.
Topik & Kata Kunci
Penulis (38)
Bangsheng Tang
Carl Chengyan Fu
Fei Kou
Grigory Sizov
Haoci Zhang
Jason Park
Jiawen Liu
Jie You
Qirui Yang
Sachin Mehta
Shengyong Cai
Xiaodong Wang
Xingyu Liu
Yunlu Li
Yanjun Zhou
Wei Wei
Zhiwei Zhao
Zixi Qi
Adolfo Victoria
Aya Ibrahim
Bram Wasti
Changkyu Kim
Daniel Haziza
Fei Sun
Giancarlo Delfin
Emily Guo
Jialin Ouyang
Jaewon Lee
Jianyu Huang
Jeremy Reizenstein
Lu Fang
Quinn Zhu
Ria Verma
Vlad Mihailescu
Xingwen Guo
Yan Cui
Ye Hu
Yejin Lee
Akses Cepat
- Tahun Terbit
- 2025
- Bahasa
- en
- Sumber Database
- arXiv
- Akses
- Open Access ✓