arXiv Open Access 2025

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

Weilin Zhao Tengyu Pan Xu Han Yudi Zhang Ao Sun +7 lainnya

Lihat Sumber

Abstrak

Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.

Topik & Kata Kunci

cs.CL cs.AI cs.LG

Penulis (12)

Weilin Zhao

Tengyu Pan

Xu Han

Yudi Zhang

Ao Sun

Yuxiang Huang

Kaihuo Zhang

Weilun Zhao

Yuxuan Li

Jianyong Wang

Zhiyuan Liu

Maosong Sun

Format Sitasi

APA MLA BibTeX

Zhao, W., Pan, T., Han, X., Zhang, Y., Sun, A., Huang, Y. et al. (2025). FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling. https://arxiv.org/abs/2502.14856

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓