arXiv Open Access 2025

SpecVLM: Fast Speculative Decoding in Vision-Language Models

Haiduo Huang Fuwei Yang Zhenhua Liu Xuanwu Yin Dong Li +2 lainnya

Lihat Sumber

Abstrak

Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens whose count scales with image resolution and video length, inflating both compute and memory, especially the key-value (KV) cache. We study speculative decoding for VLMs and introduce SpecVLM, a practical system that (1) establishes a strong EAGLE-2-style baseline, EagleVLM, delivering 1.5--2.3x end-to-end speedups over full autoregressive inference, and (2) further accelerates VLM inference with an elastic visual compressor that adaptively selects among pruning, pooling, convolution, and resampler primitives to balance FLOPs/parameters and accuracy per input. To avoid costly offline distillation corpora, we propose an online-logit distillation protocol that trains the draft model with on-the-fly teacher logits and penultimate features using a combined cross-entropy and Smooth L1 objective, eliminating storage and preprocessing while remaining compute-efficient. This protocol reveals a training-time scaling effect: longer online training monotonically increases the draft model's average accepted length, improving speculative efficiency. Empirically, SpecVLM achieves additional acceleration, culminating in 2.5--2.9x end-to-end speedups within 5 epochs across LLaVA and MMMU, consistently over resolutions and task difficulties, while preserving the target model's output distribution (lossless decoding). Our code is available at https://github.com/haiduo/SpecVLM.

Topik & Kata Kunci

cs.CV cs.AI

Penulis (7)

Haiduo Huang

Fuwei Yang

Zhenhua Liu

Xuanwu Yin

Dong Li

Pengju Ren

Emad Barsoum

Format Sitasi

APA MLA BibTeX

Huang, H., Yang, F., Liu, Z., Yin, X., Li, D., Ren, P. et al. (2025). SpecVLM: Fast Speculative Decoding in Vision-Language Models. https://arxiv.org/abs/2509.11815

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓