arXiv Open Access 2025

Gene42: Long-Range Genomic Foundation Model With Dense Attention

Kirill Vishniakov Boulbaba Ben Amor Engin Tekin Nancy A. ElNaker Karthik Viswanathan +10 lainnya
Lihat Sumber

Abstrak

We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at huggingface.co/inceptionai.

Penulis (15)

K

Kirill Vishniakov

B

Boulbaba Ben Amor

E

Engin Tekin

N

Nancy A. ElNaker

K

Karthik Viswanathan

A

Aleksandr Medvedev

A

Aahan Singh

M

Maryam Nadeem

M

Mohammad Amaan Sayeed

P

Praveenkumar Kanithi

T

Tiago Magalhaes

N

Natalia Vassilieva

D

Dwarikanath Mahapatra

M

Marco Pimentel

a

and Shadab Khan

Format Sitasi

Vishniakov, K., Amor, B.B., Tekin, E., ElNaker, N.A., Viswanathan, K., Medvedev, A. et al. (2025). Gene42: Long-Range Genomic Foundation Model With Dense Attention. https://arxiv.org/abs/2503.16565

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓