DOAJ Open Access 2025

Optimizing BFloat16 Deployment of Tiny Transformers on Ultra-Low Power Extreme Edge SoCs

Alberto Dequino Luca Bompani Luca Benini Francesco Conti

Abstrak

Transformers have emerged as the central backbone architecture for modern generative AI. However, most ML applications targeting low-power, low-cost SoCs (TinyML apps) do not employ Transformers as these models are thought to be challenging to quantize and deploy on small devices. This work proposes a methodology to reduce Transformer dimensions with an extensive pruning search. We exploit the intrinsic redundancy of these models to fit them on resource-constrained devices with a well-controlled accuracy tradeoff. We then propose an optimized library to deploy the reduced models using BFLoat16 with no accuracy loss on Commercial Off-The-Shelf (COTS) RISC-V multi-core micro-controllers, enabling the execution of these models at the extreme edge, without the need for complex and accuracy-critical quantization schemes. Our solution achieves up to <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>220</mn><mo>×</mo></mrow></semantics></math></inline-formula> speedup with respect to a naïve C port of the Multi-Head Self Attention PyTorch kernel: we reduced MobileBert and TinyViT memory footprint up to ∼94% and ∼57%, respectively, and we deployed a tinyLLAMA SLM on microcontroller, achieving a throughput of 1219 tokens/s with an average power of just 57 mW.

Penulis (4)

A

Alberto Dequino

L

Luca Bompani

L

Luca Benini

F

Francesco Conti

Format Sitasi

Dequino, A., Bompani, L., Benini, L., Conti, F. (2025). Optimizing BFloat16 Deployment of Tiny Transformers on Ultra-Low Power Extreme Edge SoCs. https://doi.org/10.3390/jlpea15010008

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.3390/jlpea15010008
Informasi Jurnal
Tahun Terbit
2025
Sumber Database
DOAJ
DOI
10.3390/jlpea15010008
Akses
Open Access ✓