arXiv Open Access 2024

TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput

Xiaoxuan Liu Jongseok Park Langxiang Hu Woosuk Kwon Zhuohan Li +8 lainnya
Lihat Sumber

Abstrak

Large Language Model (LLM) serving systems batch concurrent user requests to achieve efficient serving. However, in real-world deployments, such inter-request parallelism from batching is often limited by external factors such as low request rates or memory constraints. Recent works focus on intra-request parallelism from speculative decoding as a solution to this problem. Unfortunately, benefits from intra-request parallelism are often fragile, as speculative decoding causes overhead, and speculated tokens may miss. We observe that speculative decoding may degrade LLM serving performance if added naively without tuning to the incoming requests and the speculation method. To alleviate the need for expert tuning and make speculative decoding more robust, we present TurboSpec, a speculation control system that automatically profiles the execution environment and utilizes a feedback-based algorithm to dynamically adjust the amount of intra-request parallelism in LLM serving. TurboSpec predicts "goodput" - the amount of successfully generated tokens - to evaluate and adjust intra-request parallelism amount to that with the highest goodput in runtime. We implement TurboSpec on a real-world LLM serving system vLLM and demonstrate its effectiveness across diverse workloads and hardware configurations, providing consistent performance improvements across all test scenarios.

Topik & Kata Kunci

Penulis (13)

X

Xiaoxuan Liu

J

Jongseok Park

L

Langxiang Hu

W

Woosuk Kwon

Z

Zhuohan Li

C

Chen Zhang

K

Kuntai Du

X

Xiangxi Mo

K

Kaichao You

A

Alvin Cheung

Z

Zhijie Deng

I

Ion Stoica

H

Hao Zhang

Format Sitasi

Liu, X., Park, J., Hu, L., Kwon, W., Li, Z., Zhang, C. et al. (2024). TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput. https://arxiv.org/abs/2406.14066

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓