arXiv Open Access 2024

A Large Encoder-Decoder Family of Foundation Models For Chemical Language

Eduardo Soares Victor Shirasuna Emilio Vital Brazil Renato Cerqueira Dmitry Zubarev +1 lainnya

Lihat Sumber

Abstrak

Large-scale pre-training methodologies for chemical language models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Typically, this involves pre-training on unlabeled data followed by fine-tuning on specific tasks, reducing dependence on annotated datasets and broadening chemical language representation understanding. This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, which is equivalent to 4 billion of molecular tokens. The proposed foundation model supports different complex tasks, including quantum property prediction, and offer flexibility with two main variants (289M and $8\times289M$). Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks. We also provide a preliminary assessment of the compositionality of the embedding space as a prerequisite for the reasoning tasks. We demonstrate that the produced latent space is separable compared to the state-of-the-art with few-shot learning capabilities.

Topik & Kata Kunci

cs.LG cs.AI physics.chem-ph

Penulis (6)

Eduardo Soares

Victor Shirasuna

Emilio Vital Brazil

Renato Cerqueira

Dmitry Zubarev

Kristin Schmidt

Format Sitasi

APA MLA BibTeX

Soares, E., Shirasuna, V., Brazil, E.V., Cerqueira, R., Zubarev, D., Schmidt, K. (2024). A Large Encoder-Decoder Family of Foundation Models For Chemical Language. https://arxiv.org/abs/2407.20267

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓