arXiv Open Access 2024

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM

Zhongkai Yu Shengwen Liang Tianyun Ma Yunke Cai Ziyuan Nan +10 lainnya
Lihat Sumber

Abstrak

Deploying advanced large language models on edge devices, such as smartphones and robotics, is a growing trend that enhances user data privacy and network connectivity resilience while preserving intelligent capabilities. However, such a task exhibits single-batch computing with incredibly low arithmetic intensity, which poses the significant challenges of huge memory footprint and bandwidth demands on limited edge resources. To address these issues, we introduce Cambricon-LLM, a chiplet-based hybrid architecture with NPU and a dedicated NAND flash chip to enable efficient on-device inference of 70B LLMs. Such a hybrid architecture utilizes both the high computing capability of NPU and the data capacity of the NAND flash chip, with the proposed hardware-tiling strategy that minimizes the data movement overhead between NPU and NAND flash chip. Specifically, the NAND flash chip, enhanced by our innovative in-flash computing and on-die ECC techniques, excels at performing precise lightweight on-die processing. Simultaneously, the NPU collaborates with the flash chip for matrix operations and handles special function computations beyond the flash's on-die processing capabilities. Overall, Cambricon-LLM enables the on-device inference of 70B LLMs at a speed of 3.44 token/s, and 7B LLMs at a speed of 36.34 token/s, which is over 22X to 45X faster than existing flash-offloading technologies, showing the potentiality of deploying powerful LLMs in edge devices.

Topik & Kata Kunci

Penulis (15)

Z

Zhongkai Yu

S

Shengwen Liang

T

Tianyun Ma

Y

Yunke Cai

Z

Ziyuan Nan

D

Di Huang

X

Xinkai Song

Y

Yifan Hao

J

Jie Zhang

T

Tian Zhi

Y

Yongwei Zhao

Z

Zidong Du

X

Xing Hu

Q

Qi Guo

T

Tianshi Chen

Format Sitasi

Yu, Z., Liang, S., Ma, T., Cai, Y., Nan, Z., Huang, D. et al. (2024). Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM. https://arxiv.org/abs/2409.15654

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓