Benchmarking web-based in-silico toxicity prediction tools using gold-standard datasets across five key endpoints
Abstrak
Abstract In silico toxicity prediction tools have become indispensable in early drug development for assessing safety risks. However, their reported predictive performance is rarely evaluated against independent experimental datasets. In this study, we systematically benchmark four widely used free, web-based toxicity predictors– ProTox, pkCSM, ADMETLab, and vNN-ADMET across experimentally validated hepatotoxicity, cardiotoxicity (hERG inhibition), nephrotoxicity, blood–brain barrier (BBB) permeability, and mutagenicity (AMES) endpoints using gold-standard datasets, including DILIrank, hERG Central, DIRIL, B3DB, and the ISSTOX Chemical Toxicity database. Tool-reported endpoint-specific performance metrics were first analyzed and subsequently compared against externally benchmarked predictions generated on independent compound sets. Model performance was evaluated using accuracy, precision, recall, F1 score, specificity, and Matthews Correlation Coefficient (MCC). Our results reveal pronounced discrepancies between tool-reported and benchmarked performance across multiple endpoints, indicating limited generalizability of several models beyond their original training and validation domains. While ProTox and vNN-ADMET demonstrated strong reported performance across endpoints, only mutagenicity predictions remained consistently robust under benchmarking conditions (F1 > 0.89; MCC > 0.80). ProTox achieved the highest benchmarked performance for hepatotoxicity (F1 = 0.92; MCC = 0.84), whereas ADMETLab showed balanced recall-driven performance for nephrotoxicity and cardiotoxicity but exhibited reduced specificity for hepatotoxicity, suggesting overprediction. In contrast, BBB permeability and nephrotoxicity emerged as the most challenging endpoints, with substantial performance degradation relative to reported metrics and consistently low MCC values across tools. Overall, no single tool demonstrated uniform reliability across all toxicity endpoints when evaluated on independent datasets. These findings underscore the limitations of relying solely on tool reported performance and highlight the necessity of endpoint aware benchmarking in in-silico toxicity screening. This study provides actionable guidance for tool selection in early drug discovery and supports future development of ensemble and applicability domain aware models to improve predictive robustness and translational relevance.
Topik & Kata Kunci
Penulis (7)
Anirudh R. Urs
Varshini Ganesan Selvi
Ananya Sreekumar
Ananya Sudarsan
Prutha V. Murthy
Manjunatha Reddy A H
Sumathra Manokaran
Akses Cepat
- Tahun Terbit
- 2026
- Sumber Database
- DOAJ
- DOI
- 10.1007/s42452-026-08308-7
- Akses
- Open Access ✓