CrossRef Open Access 2026

Emergence of Biological Structural Discovery in General-Purpose Language Models

Liang Wang

Abstrak

Abstract Large language models (LLMs) are evolving into engines for scientific discovery, yet the assumption that biological understanding requires domain-specific pre-training remains unchallenged. Here, we report that general-purpose LLMs possess an emergent capability for biological structural discovery. First, we demonstrate that a small-scale GPT-2, fine-tuned solely on English paraphrasing, achieves ~84% zero-shot accuracy in protein homology detection, where network-based interpretability confirms a deep structural isomorphism between human language and the language of life. Scaling to massive models (e.g., Qwen-3) reveals a phase transition, achieving near-perfect accuracy (~100%) on standard tasks while maintaining 75% precision on specially constructed remote homology datasets. Chain-of-Thought interpretability reveals that these models transcend simple sequence alignment, leveraging implicit structural knowledge to perform reasoning akin to "mental folding." We formalize this cross-modal universality through the BioPAWS benchmark. Our work establishes a minimalist paradigm for AI for Science, proving that abstract logical structures distilled from human language constitute a powerful cognitive prior for decoding the complex syntax of biology.

Penulis (1)

Liang Wang

Format Sitasi

APA MLA BibTeX

Wang, L. (2026). Emergence of Biological Structural Discovery in General-Purpose Language Models. https://doi.org/10.21203/rs.3.rs-8507849/v1

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →

Lihat di Sumber doi.org/10.21203/rs.3.rs-8507849/v1

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: CrossRef
DOI: 10.21203/rs.3.rs-8507849/v1
Akses: Open Access ✓