arXiv Open Access 2024

MassSpecGym: A benchmark for the discovery and identification of molecules

Roman Bushuiev Anton Bushuiev Niek F. de Jonge Adamo Young Fleming Kretschmer +25 lainnya
Lihat Sumber

Abstrak

The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at https://github.com/pluskal-lab/MassSpecGym.

Topik & Kata Kunci

Penulis (30)

R

Roman Bushuiev

A

Anton Bushuiev

N

Niek F. de Jonge

A

Adamo Young

F

Fleming Kretschmer

R

Raman Samusevich

J

Janne Heirman

F

Fei Wang

L

Luke Zhang

K

Kai Dührkop

M

Marcus Ludwig

N

Nils A. Haupt

A

Apurva Kalia

C

Corinna Brungs

R

Robin Schmid

R

Russell Greiner

B

Bo Wang

D

David S. Wishart

L

Li-Ping Liu

J

Juho Rousu

W

Wout Bittremieux

H

Hannes Rost

T

Tytus D. Mak

S

Soha Hassoun

F

Florian Huber

J

Justin J. J. van der Hooft

M

Michael A. Stravs

S

Sebastian Böcker

J

Josef Sivic

T

Tomáš Pluskal

Format Sitasi

Bushuiev, R., Bushuiev, A., Jonge, N.F.d., Young, A., Kretschmer, F., Samusevich, R. et al. (2024). MassSpecGym: A benchmark for the discovery and identification of molecules. https://arxiv.org/abs/2410.23326

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓