DOAJ Open Access 2025

Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data

Yinyao Ma Hanlin Lv Yanhua Ma Xiao Wang Longting Lv +2 lainnya

Abstrak

Abstract Background Constructing a predictive model is challenging in imbalanced medical dataset (such as preeclampsia), particularly when employing ensemble machine learning algorithms. Objective This study aims to develop a robust pipeline that enhances the predictive performance of ensemble machine learning models for the early prediction of preeclampsia in an imbalanced dataset. Methods Our research establishes a comprehensive pipeline optimized for early preeclampsia prediction in imbalanced medical datasets. We gathered electronic health records from pregnant women at the People’s Hospital of Guangxi from 2015 to 2020, with additional external validation using three public datasets. This extensive data collection facilitated the systematic assessment of various resampling techniques, varied minority-to-majority ratios, and ensemble machine learning algorithms through a structured evaluation process. We analyzed 4,608 combinations of model settings against performance metrics such as G-mean, MCC, AP, and AUC to determine the most effective configurations. Advanced statistical analyses including OLS regression, ANOVA, and Kruskal-Wallis tests were utilized to fine-tune these settings, enhancing model performance and robustness for clinical application. Results Our analysis confirmed the significant impact of systematic sequential optimization of variables on the predictive performance of our models. The most effective configuration utilized the Inverse Weighted Gaussian Mixture Model for resampling, combined with Gradient Boosting Decision Trees algorithm, and an optimized minority-to-majority ratio of 0.09, achieving a Geometric Mean of 0.6694 (95% confidence interval: 0.5855–0.7557). This configuration significantly outperformed the baseline across all evaluated metrics, demonstrating substantial improvements in model performance. Conclusions This study establishes a robust pipeline that significantly enhances the predictive performance of models for preeclampsia within imbalanced datasets. Our findings underscore the importance of a strategic approach to variable optimization in medical diagnostics, offering potential for broad application in various medical contexts where class imbalance is a concern.

Penulis (7)

Y

Yinyao Ma

H

Hanlin Lv

Y

Yanhua Ma

X

Xiao Wang

L

Longting Lv

X

Xuxia Liang

L

Lei Wang

Format Sitasi

Ma, Y., Lv, H., Ma, Y., Wang, X., Lv, L., Liang, X. et al. (2025). Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data. https://doi.org/10.1186/s13040-025-00440-1

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.1186/s13040-025-00440-1
Informasi Jurnal
Tahun Terbit
2025
Sumber Database
DOAJ
DOI
10.1186/s13040-025-00440-1
Akses
Open Access ✓