arXiv Open Access 2026

Dynamic sparsity in tree-structured feed-forward layers at scale

Reza Sedghi Robin Schiewer Anand Subramoney David Kappel

Lihat Sumber

Abstrak

At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block's units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect: the interaction of hard routing with asymmetric nonlinearities progressively deactivates unused paths, yielding partial conversion of dynamic routing into static structural sparsity. We show that simple architectural choices can modulate this behavior and recover balanced trees without auxiliary losses. Overall, our work demonstrates that tree-structured feed-forward layers provide a scalable and controllable mechanism for sparsifying large transformer models.

Topik & Kata Kunci

cs.CL cs.AI cs.LG

Penulis (4)

Reza Sedghi

Robin Schiewer

Anand Subramoney

David Kappel

Format Sitasi

APA MLA BibTeX

Sedghi, R., Schiewer, R., Subramoney, A., Kappel, D. (2026). Dynamic sparsity in tree-structured feed-forward layers at scale. https://arxiv.org/abs/2604.08565

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓