arXiv Open Access 2025

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

Weilin Zhao Tengyu Pan Xu Han Yudi Zhang Ao Sun +7 lainnya
Lihat Sumber

Abstrak

Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.

Topik & Kata Kunci

Penulis (12)

W

Weilin Zhao

T

Tengyu Pan

X

Xu Han

Y

Yudi Zhang

A

Ao Sun

Y

Yuxiang Huang

K

Kaihuo Zhang

W

Weilun Zhao

Y

Yuxuan Li

J

Jianyong Wang

Z

Zhiyuan Liu

M

Maosong Sun

Format Sitasi

Zhao, W., Pan, T., Han, X., Zhang, Y., Sun, A., Huang, Y. et al. (2025). FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling. https://arxiv.org/abs/2502.14856

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓