arXiv Open Access 2024

Accelerating Production LLMs with Combined Token/Embedding Speculators

Davis Wertheimer Joshua Rosenkranz Thomas Parnell Sahil Suneja Pavithra Ranganathan +2 lainnya
Lihat Sumber

Abstrak

This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allows us to effectively predict multiple tokens per inference forward pass, accelerating wall-clock inference speeds of highly optimized base model implementations by a factor of 2-3x. We explore these initial results and describe next steps for further improvements.

Topik & Kata Kunci

Penulis (7)

D

Davis Wertheimer

J

Joshua Rosenkranz

T

Thomas Parnell

S

Sahil Suneja

P

Pavithra Ranganathan

R

Raghu Ganti

M

Mudhakar Srivatsa

Format Sitasi

Wertheimer, D., Rosenkranz, J., Parnell, T., Suneja, S., Ranganathan, P., Ganti, R. et al. (2024). Accelerating Production LLMs with Combined Token/Embedding Speculators. https://arxiv.org/abs/2404.19124

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓