Semantic Scholar Open Access 2020 1348 sitasi

On Layer Normalization in the Transformer Architecture

Ruibin Xiong Yunchang Yang Di He Kai Zheng Shuxin Zheng +5 lainnya

Lihat Sumber

Abstrak

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyperparameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

Topik & Kata Kunci

Computer Science Mathematics

Penulis (10)

Ruibin Xiong

Yunchang Yang

Di He

Kai Zheng

Shuxin Zheng

Chen Xing

Huishuai Zhang

Yanyan Lan

Liwei Wang

Tie-Yan Liu

Format Sitasi

APA MLA BibTeX

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C. et al. (2020). On Layer Normalization in the Transformer Architecture. https://www.semanticscholar.org/paper/b45d656ac8cc2e940609580cf291ee76ffcac20a

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2020
Bahasa: en
Total Sitasi: 1348×
Sumber Database: Semantic Scholar
Akses: Open Access ✓