arXiv Open Access 2025

AdaSpec: Adaptive Speculative Decoding for Fast, SLO-Aware Large Language Model Serving

Kaiyu Huang Hao Wu Zhubo Shi Han Zou Minchen Yu +1 lainnya

Lihat Sumber

Abstrak

Cloud-based Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for drafting and LLMs for verification, has emerged as a compelling technique to accelerate LLM inference. However, existing speculative decoding solutions often fail to adapt to fluctuating workloads and dynamic system environments, resulting in impaired performance and SLO violations. In this paper, we introduce AdaSpec, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads and system configurations. AdaSpec proposes a theoretical model to analyze and predict the efficiency of speculative strategies across diverse scenarios. Additionally, it implements intelligent drafting and verification algorithms to maximize performance while ensuring high SLO attainment. Experimental results on real-world LLM service traces demonstrate that AdaSpec consistently meets SLOs and achieves substantial performance improvements, delivering up to 66% speedup compared to state-of-the-art speculative inference systems. The source code is publicly available at https://github.com/cerebellumking/AdaSpec

Topik & Kata Kunci

cs.CL

Penulis (6)

Kaiyu Huang

Hao Wu

Zhubo Shi

Han Zou

Minchen Yu

Qingjiang Shi

Format Sitasi

APA MLA BibTeX

Huang, K., Wu, H., Shi, Z., Zou, H., Yu, M., Shi, Q. (2025). AdaSpec: Adaptive Speculative Decoding for Fast, SLO-Aware Large Language Model Serving. https://arxiv.org/abs/2503.05096

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓