arXiv Open Access 2026

How to make the most of your masked language model for protein engineering

Calvin McCarter Nick Bhattacharya Sebastian W. Ober Hunter Elliott

Lihat Sumber

Abstrak

A plethora of protein language models have been released in recent years. Yet comparatively little work has addressed how to best sample from them to optimize desired biological properties. We fill this gap by proposing a flexible, effective sampling method for masked language models (MLMs), and by systematically evaluating models and methods both in silico and in vitro on actual antibody therapeutics campaigns. Firstly, we propose sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo-perplexity of the entire 1-edit neighborhood of a sequence. Reframing generation in terms of entire-sequence evaluation enables flexible guidance with multiple optimization objectives. Secondly, we report results from our extensive in vitro head-to-head evaluation for the antibody engineering setting. This reveals that choice of sampling method is at least as impactful as the model used, motivating future research into this under-explored area.

Topik & Kata Kunci

cs.LG q-bio.QM

Penulis (4)

Calvin McCarter

Nick Bhattacharya

Sebastian W. Ober

Hunter Elliott

Format Sitasi

APA MLA BibTeX

McCarter, C., Bhattacharya, N., Ober, S.W., Elliott, H. (2026). How to make the most of your masked language model for protein engineering. https://arxiv.org/abs/2603.10302

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓