DOAJ Open Access 2022

Collectively encoding protein properties enriches protein language models

Jingmin An Xiaogang Weng

Abstrak

Abstract Pre-trained natural language processing models on a large natural language corpus can naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain tasks. However, few studies focused on enriching such protein language models by jointly learning protein properties from strongly-correlated protein tasks. Here we elaborately designed a multi-task learning (MTL) architecture, aiming to decipher implicit structural and evolutionary information from three sequence-level classification tasks for protein family, superfamily and fold. Considering the co-existing contextual relevance between human words and protein language, we employed BERT, pre-trained on a large natural language corpus, as our backbone to handle protein sequences. More importantly, the encoded knowledge obtained in the MTL stage can be well transferred to more fine-grained downstream tasks of TAPE. Experiments on structure- or evolution-related applications demonstrate that our approach outperforms many state-of-the-art Transformer-based protein models, especially in remote homology detection.

Penulis (2)

J

Jingmin An

X

Xiaogang Weng

Format Sitasi

An, J., Weng, X. (2022). Collectively encoding protein properties enriches protein language models. https://doi.org/10.1186/s12859-022-05031-z

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.1186/s12859-022-05031-z
Informasi Jurnal
Tahun Terbit
2022
Sumber Database
DOAJ
DOI
10.1186/s12859-022-05031-z
Akses
Open Access ✓