arXiv Open Access 2025

From Sentences to Sequences: Rethinking Languages in Biological System

Ke Liu Shuaike Shen Hao Chen
Lihat Sumber

Abstrak

The paradigm of large language models in natural language processing (NLP) has also shown promise in modeling biological languages, including proteins, RNA, and DNA. Both the auto-regressive generation paradigm and evaluation metrics have been transferred from NLP to biological sequence modeling. However, the intrinsic structural correlations in natural and biological languages differ fundamentally. Therefore, we revisit the notion of language in biological systems to better understand how NLP successes can be effectively translated to biological domains. By treating the 3D structure of biomolecules as the semantic content of a sentence and accounting for the strong correlations between residues or bases, we highlight the importance of structural evaluation and demonstrate the applicability of the auto-regressive paradigm in biological language modeling. Code can be found at \href{https://github.com/zjuKeLiu/RiFold}{github.com/zjuKeLiu/RiFold}

Topik & Kata Kunci

Penulis (3)

K

Ke Liu

S

Shuaike Shen

H

Hao Chen

Format Sitasi

Liu, K., Shen, S., Chen, H. (2025). From Sentences to Sequences: Rethinking Languages in Biological System. https://arxiv.org/abs/2507.00953

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓