arXiv Open Access 2024

Accelerating Production LLMs with Combined Token/Embedding Speculators

Davis Wertheimer Joshua Rosenkranz Thomas Parnell Sahil Suneja Pavithra Ranganathan +2 lainnya

Lihat Sumber

Abstrak

This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allows us to effectively predict multiple tokens per inference forward pass, accelerating wall-clock inference speeds of highly optimized base model implementations by a factor of 2-3x. We explore these initial results and describe next steps for further improvements.

Topik & Kata Kunci

cs.CL

Penulis (7)

Davis Wertheimer

Joshua Rosenkranz

Thomas Parnell

Sahil Suneja

Pavithra Ranganathan

Raghu Ganti

Mudhakar Srivatsa

Format Sitasi

APA MLA BibTeX

Wertheimer, D., Rosenkranz, J., Parnell, T., Suneja, S., Ranganathan, P., Ganti, R. et al. (2024). Accelerating Production LLMs with Combined Token/Embedding Speculators. https://arxiv.org/abs/2404.19124

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓