arXiv Open Access 2026

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

Aojie Jiang Kang Zhu Zhiheng Zhang Zhengxu Su Juntao Liu +2 lainnya

Lihat Sumber

Abstrak

In-network computing techniques, exemplified by NVLink SHARP (NVLS), offer a promising approach to addressing the communication bottlenecks in LLM inference by offloading collective operations such as All-Reduce to switches. However, the accelerator-centric architecture of NVLS suffers from two fundamental limitations: 1) it relies on GPU load instructions to trigger in-switch reduction, which means that the data reduced in the switch must be transferred back to the initiating GPU rather than being broadcast directly, thereby introducing unnecessary communication overhead; 2) due to its architectural constraints, NVLS cannot offload operators that are not decomposable into memory-semantic instructions, such as the in-network quantization (INQ) proposed in this work. As a result, All-Reduce in NVLS during inference still operates at 16-bit precision, leading to substantial bandwidth waste. To address these limitations, we propose SCIN, the first switch-centric in-network architecture for multi-accelerator shared-memory networks, enabling both low-latency and high-bandwidth All-Reduce. Specifically, we introduce an in-switch accelerator (ISA) capable of directly accessing the memory regions in attached accelerators for in-network processing, together with a co-designed communication fabric that enables such access with negligible protocol overhead. SCIN delivers lower All-Reduce latency than NVLS by eliminating redundant data movement. Moreover, SCIN enables INQ for All-Reduce, reducing its precision to 8 bits and nearly doubling bandwidth with negligible accuracy loss. We also present a multi-FPGA prototype of SCIN to validate its feasibility and effectiveness. Simulation results for an 8-GPU system show that our design accelerates All-Reduce by up to 8.7x for small messages and 3.8x for large messages, yielding up to 1.74x TTFT speedup and 1.34x TPOT speedup on LLaMA-2 models.

Topik & Kata Kunci

cs.AR

Penulis (7)

Aojie Jiang

Kang Zhu

Zhiheng Zhang

Zhengxu Su

Juntao Liu

Yuan Du

Li Du

Format Sitasi

APA MLA BibTeX

Jiang, A., Zhu, K., Zhang, Z., Su, Z., Liu, J., Du, Y. et al. (2026). A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network. https://arxiv.org/abs/2603.28239

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓