MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
Abstrak
VOLUME 35 NUMBER 11 NOVEMBER 2017 NATURE BIOTECHNOLOGY performance was to combine the doublematch criterion with making k-mers as long as possible, which required finding similar and not just exact k-mers. This effectively bases our decision on up to 2 × 7 = 14 residues instead of just 2 × 3 in BLAST or 12 letters on a size-11 alphabet in DIAMOND. MMseqs2 is parallelized on three levels: time-critical parts are manually vectorized, queries can be distributed to multiple cores, and the target database can be split into chunks distributed to multiple servers. Because MMseqs2 needs no random memory access in its innermost loop, its runtime scales almost inversely with the number of cores used (Supplementary Fig. 2). MMseqs2 requires 13.4 GB plus 7 bytes per amino acid to store the database in memory, MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
Topik & Kata Kunci
Penulis (2)
Martin Steinegger
J. Söding
Akses Cepat
- Tahun Terbit
- 2017
- Bahasa
- en
- Total Sitasi
- 2382×
- Sumber Database
- Semantic Scholar
- DOI
- 10.1038/nbt.3988
- Akses
- Open Access ✓