arXiv Open Access 2024

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM

Zhongkai Yu Shengwen Liang Tianyun Ma Yunke Cai Ziyuan Nan +10 lainnya

Lihat Sumber

Abstrak

Deploying advanced large language models on edge devices, such as smartphones and robotics, is a growing trend that enhances user data privacy and network connectivity resilience while preserving intelligent capabilities. However, such a task exhibits single-batch computing with incredibly low arithmetic intensity, which poses the significant challenges of huge memory footprint and bandwidth demands on limited edge resources. To address these issues, we introduce Cambricon-LLM, a chiplet-based hybrid architecture with NPU and a dedicated NAND flash chip to enable efficient on-device inference of 70B LLMs. Such a hybrid architecture utilizes both the high computing capability of NPU and the data capacity of the NAND flash chip, with the proposed hardware-tiling strategy that minimizes the data movement overhead between NPU and NAND flash chip. Specifically, the NAND flash chip, enhanced by our innovative in-flash computing and on-die ECC techniques, excels at performing precise lightweight on-die processing. Simultaneously, the NPU collaborates with the flash chip for matrix operations and handles special function computations beyond the flash's on-die processing capabilities. Overall, Cambricon-LLM enables the on-device inference of 70B LLMs at a speed of 3.44 token/s, and 7B LLMs at a speed of 36.34 token/s, which is over 22X to 45X faster than existing flash-offloading technologies, showing the potentiality of deploying powerful LLMs in edge devices.

Topik & Kata Kunci

cs.AR

Penulis (15)

Zhongkai Yu

Shengwen Liang

Tianyun Ma

Yunke Cai

Ziyuan Nan

Di Huang

Xinkai Song

Yifan Hao

Jie Zhang

Tian Zhi

Yongwei Zhao

Zidong Du

Xing Hu

Qi Guo

Tianshi Chen

Format Sitasi

APA MLA BibTeX

Yu, Z., Liang, S., Ma, T., Cai, Y., Nan, Z., Huang, D. et al. (2024). Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM. https://arxiv.org/abs/2409.15654

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓