arXiv Open Access 2025

DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

Jinbin Zhang Nasib Ullah Erik Schultheis Rohit Babbar
Lihat Sumber

Abstrak

Speculative decoding accelerates LLM inference by letting a small drafter propose multiple tokens which a large target model verifies once per speculation step. As vocabularies scale past 10e5 tokens,verification cost in the target model is largely unchanged, but the drafter can become bottlenecked by its O(|V|d) output projection. Recent approaches (e.g., FR-Spec, VocabTrim) mitigate this by restricting drafting to a fixed, frequency-ranked shortlist; however, such static truncation is corpus-dependent and suppresses rare or domain-specific tokens, reducing acceptance and limiting speedups. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism for large-vocabulary speculative decoding. DynaSpec trains lightweight meta-classifiers that route each context to a small set of coarse token clusters; the union of the top-selected clusters defines the drafter's shortlist, while the target model still verifies over the full vocabulary, preserving exactness. Systems-wise, routing is overlapped with draft computation via parallel execution streams, reducing end-to-end overhead. Across standard speculative decoding benchmarks, DynaSpec consistently improves mean accepted length-recovering 98.4% of full-vocabulary performance for Llama-3-8B versus 93.6% for fixed-shortlist baselines-and achieves up to a 2.23x throughput gain compared to 1.91x for static approaches on the dataset with rare tokens.

Topik & Kata Kunci

Penulis (4)

J

Jinbin Zhang

N

Nasib Ullah

E

Erik Schultheis

R

Rohit Babbar

Format Sitasi

Zhang, J., Ullah, N., Schultheis, E., Babbar, R. (2025). DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models. https://arxiv.org/abs/2510.13847

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓