arXiv Open Access 2025

SpecVLM: Fast Speculative Decoding in Vision-Language Models

Haiduo Huang Fuwei Yang Zhenhua Liu Xuanwu Yin Dong Li +2 lainnya
Lihat Sumber

Abstrak

Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens whose count scales with image resolution and video length, inflating both compute and memory, especially the key-value (KV) cache. We study speculative decoding for VLMs and introduce SpecVLM, a practical system that (1) establishes a strong EAGLE-2-style baseline, EagleVLM, delivering 1.5--2.3x end-to-end speedups over full autoregressive inference, and (2) further accelerates VLM inference with an elastic visual compressor that adaptively selects among pruning, pooling, convolution, and resampler primitives to balance FLOPs/parameters and accuracy per input. To avoid costly offline distillation corpora, we propose an online-logit distillation protocol that trains the draft model with on-the-fly teacher logits and penultimate features using a combined cross-entropy and Smooth L1 objective, eliminating storage and preprocessing while remaining compute-efficient. This protocol reveals a training-time scaling effect: longer online training monotonically increases the draft model's average accepted length, improving speculative efficiency. Empirically, SpecVLM achieves additional acceleration, culminating in 2.5--2.9x end-to-end speedups within 5 epochs across LLaVA and MMMU, consistently over resolutions and task difficulties, while preserving the target model's output distribution (lossless decoding). Our code is available at https://github.com/haiduo/SpecVLM.

Topik & Kata Kunci

Penulis (7)

H

Haiduo Huang

F

Fuwei Yang

Z

Zhenhua Liu

X

Xuanwu Yin

D

Dong Li

P

Pengju Ren

E

Emad Barsoum

Format Sitasi

Huang, H., Yang, F., Liu, Z., Yin, X., Li, D., Ren, P. et al. (2025). SpecVLM: Fast Speculative Decoding in Vision-Language Models. https://arxiv.org/abs/2509.11815

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓