arXiv Open Access 2024

A Large Encoder-Decoder Family of Foundation Models For Chemical Language

Eduardo Soares Victor Shirasuna Emilio Vital Brazil Renato Cerqueira Dmitry Zubarev +1 lainnya
Lihat Sumber

Abstrak

Large-scale pre-training methodologies for chemical language models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Typically, this involves pre-training on unlabeled data followed by fine-tuning on specific tasks, reducing dependence on annotated datasets and broadening chemical language representation understanding. This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, which is equivalent to 4 billion of molecular tokens. The proposed foundation model supports different complex tasks, including quantum property prediction, and offer flexibility with two main variants (289M and $8\times289M$). Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks. We also provide a preliminary assessment of the compositionality of the embedding space as a prerequisite for the reasoning tasks. We demonstrate that the produced latent space is separable compared to the state-of-the-art with few-shot learning capabilities.

Penulis (6)

E

Eduardo Soares

V

Victor Shirasuna

E

Emilio Vital Brazil

R

Renato Cerqueira

D

Dmitry Zubarev

K

Kristin Schmidt

Format Sitasi

Soares, E., Shirasuna, V., Brazil, E.V., Cerqueira, R., Zubarev, D., Schmidt, K. (2024). A Large Encoder-Decoder Family of Foundation Models For Chemical Language. https://arxiv.org/abs/2407.20267

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓