IntroductionAtrial fibrillation (AF) is closely associated with metabolic dysfunction. The uric acid–to–albumin ratio (UAR), integrating oxidative stress, inflammation, and nutritional status, reflect cardiometabolic burden, but evidence linking UAR to AF prognosis remains limited.MethodsWe analyzed clinical data from 1,908 AF patients at West China Hospital, with external validation from the MIMIC database (n=1,366). Associations were assessed using Kaplan–Meier analyses, restricted cubic splines, and multivariable Cox proportional hazards models. Incremental prognostic value beyond the CHA2DS2-VASc score was evaluated in both cohorts. Exploratory machine learning and SHAP analyses were employed to assess the variable importance of UAR. Subgroup and sensitivity analyses were performed in primary cohort, including additional cardiometabolic adjustment, analyses with cardiac mortality, competing risk models, and longer follow-up.ResultsBaseline characteristics differed across UAR quartiles, with high UARs associated with substantial burdens of metabolic comorbidities, heart failure, renal dysfunction, and elevated inflammatory and cardiac biomarkers. Mortality was higher in the highest UAR quartile (log-rank P<0.001). In the primary cohort, restricted cubic splines showed a J-shaped association between UAR and 1-year mortality (P for nonlinearity <0.001). In fully adjusted Cox models, UAR (per SD) predicted 1-year all-cause mortality in the primary cohort (HR 1.162, 95% CI 1.036–1.304) and in the MIMIC cohort (HR 1.137, 95% CI 1.092–1.185). Adding UAR to the CHA2DS2-VASc score improved discrimination (C-index 0.654 to 0.692; P = 0.001), reclassification (continuous NRI 0.178), calibration, and clinical net benefit, with consistent incremental performance in the MIMIC cohort. In both cohorts, SHAP analysis consistently identified UAR as one of the major contributors to mortality prediction. Findings were consistent across subgroups and sensitivity analyses.ConclusionUAR is an independent predictor of mortality in AF and captures cardiometabolic remodeling beyond conventional risk assessment. As a readily available biomarker, UAR may facilitate metabolically guided risk stratification and individualized management in AF populations.
Diseases of the endocrine glands. Clinical endocrinology
BackgroundThe incidence of ulcerative colitis (UC) is rapidly increasing worldwide, but existing therapeutics are limited. Neutrophil extracellular traps (NETs), which have been associated with the development of various autoimmune diseases, may serve as a novel therapeutic target for UC treatment.MethodsBioinformatics analysis was performed to investigate UC-related datasets downloaded from the GEO database, including GSE87466, GSE75214, and GSE206285. Differentially expressed genes (DEGs) related to NETs in UC patients and healthy controls were identified using Limma R package and WGCNA, followed by functional enrichment analysis. To identify potential diagnostic biomarkers, we applied the Least Absolute Shrinkage and Selection Operator (LASSO), Support Vector Machine-Recursive Feature Elimination (SVM-RFE) model, and Random Forest (RF) algorithm, and constructed Receiver Operating Characteristic (ROC) curves to evaluate accuracy. Additionally, immune infiltration analysis was conducted to identify immune cells potentially involved in the regulation of NETs. Finally, the expression of core genes in patients was validated using Quantitative real-time PCR (qRT-PCR), and potential therapeutic drugs for UC were explored through drug target databases.ResultDifferential analysis of transcriptomic sequencing data from UC samples identified 29 DEGs related to NETs. Enrichment analysis showed that these genes primarily mediate UC-related damage through biological functions such as leukocyte activation, migration, immune receptor activity, and the IL-17 signaling pathway. Three machine learning algorithms successfully identified core NETs-related genes in UC (IL1B, MMP9 and DYSF). According to ROC analysis, all three demonstrated excellent diagnostic efficacy. Additionally, Immune infiltration analysis revealed that the expression of these core genes was closely associated with neutrophils infiltration and CD4+ memory T cell activation, and negatively associated with M2 macrophage infiltration. qRT-PCR showed that the core genes were significantly overexpressed in UC patients. Gevokizumab, canakinumab and carboxylated glucosamine were predicted as potential therapeutic drugs for UC.ConclusionBy combining three machine learning algorithms and bioinformatics, this research identified three hub genes that could serve as novel targets for the diagnosis and therapy of UC, which may provide valuable insights into the mechanism of NETs in UC and potential related therapies.
Ali Zaenal Abidin, I Ketut Agung Enriko, Aloysius Adya Pramudita
Energy audits play a pivotal role in improving energy efficiency and reducing carbon emissions in office buildings. However, conventional audits often suffer from fragmented insights, lack of system-level monitoring, establishing energy baseline, and insufficient incorporation of occupant behavior. To address these challenges, this study conducts a systematic literature review of recent applications of Internet of Things (IoT), machine learning (ML), and digital twin (DT) technologies in the energy audit domain. The review, guided by PRISMA methodology, analyzes eleven selected studies published between 2022 and 2024, revealing that while ML dominates in predictive modeling, IoT and DT remain underutilized in delivering integrated, efficiency recommendations. The analysis identifies three key engineering gaps: limited use of occupant behavior data, absence of continuous energy baseline modeling, and lack of systems capable of generating real-time efficiency recommendations. In response, this paper proposes a novel AIoT-based energy audit framework that combines real-time monitoring via IoT with ML-driven analytics and optimization, supported optionally by DT-based simulation. The proposed framework aims to enable continuous, system-level audits aligned with ISO 50000 standards, offering practical pathways for building managers to diagnose inefficiencies and implement energy-saving actions. Validating the model in real-world office environments, expanding input variables, and integration strategy with building automation systems are further important study to realize intelligent and scalable energy audit solutions.
Abstract The use of limestone powder as a partial replacement for cement in concrete has gained significant attention due to its potential to enhance compressive strength and promote sustainability. This study investigates the mechanical behavior of limestone-modified concrete, focusing on strength development over various curing periods. Advanced machine learning techniques—Gradient Boosting (GB) and K-Nearest Neighbors (KNN)—are employed to optimize mix proportions and accurately predict compressive strength. The GB model achieved a high predictive accuracy with an R² value of 0.98, effectively capturing the complex nonlinear relationships between cement content, limestone dosage, and curing time. Meanwhile, the KNN model demonstrated strong performance with an R² of 0.965 by leveraging pattern similarities in experimental data. Both regression models align closely with experimental results, validating limestone’s positive impact on long-term concrete performance. This data-driven approach enhances mix design decisions, ensuring structural reliability and sustainability while reducing cement usage and its associated environmental footprint.
Jorge da Paixão Marques Filho, Antônio José Teixeira Guerra, Carla Bernadete Madureira Cruz
et al.
Soil erosion is a global issue—with gully erosion recognized as one of the most important forms of land degradation. The purpose of this study is to compare and contrast the outcomes of four machine learning models, Classification and Regression (CART), eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and Support Vector Machine (SVM), used for mapping susceptibility to soil gully erosion. The controlling factors of gully erosion in the Piraí Drainage Basin, Paraíba do Sul Middle Valley were analysed by image interpretation in Google Earth and gully erosion samples (n = 159) were used for modelling and spatial prediction. The XGBoost and RF models achieved identical results for the area under the receiver operating characteristic curve (AUROC = 88.50%), followed by the SVM and CART models, respectively (AUROC = 86.17%; AUROC = 85.11%). In all models analysed, the importance of the main controlling factors predominated among Lineaments, Land Use and Cover, Slope, Elevation and Rainfall, highlighting the need to understand the landscape. The XGBoost model, considering a smaller number of false negatives in spatial prediction, was considered the most appropriate, compared to the Random Forest model. It is noteworthy that the XGBoost model made it possible to validate the hypothesis of the study area, for susceptibility to gully erosion and identifying that 9.47% of the Piraí Drainage Basin is susceptible to gully erosion. Furthermore, replicable methodologies are evidenced by their rapid applicability at different scales.
Reshma Devi, Keith T. Butler, Gopalakrishnan Sai Gautam
Abstract A pathway to overcome limited data availability in materials science is to use the framework of transfer learning, where a pre-trained (PT) machine learning model (on a larger dataset) can be fine-tuned (FT) on a target (smaller) dataset. We systematically explore the effectiveness of various PT/FT strategies to learn and predict material properties and create generalizable models by PT on multiple properties (MPT) simultaneously. Specifically, we leverage graph neural networks (GNNs) to PT/FT on seven diverse curated materials datasets, with sizes ranging from 941 to 132,752. Besides identifying optimal PT/FT strategies and hyperparameters, we find our pair-wise PT-FT models to consistently outperform models trained from scratch on target datasets. Importantly, our MPT models outperform pair-wise models on several datasets and, more significantly, on a 2D material band gap dataset that is completely out-of-domain. Finally, we expect our PT/FT and MPT frameworks to accelerate materials design and discovery for various applications.
Materials of engineering and construction. Mechanics of materials, Computer software
Rachel L. Fitzjerrells, Nicholas J. Ollberding, Ashutosh K. Mangalam
The microbiome, a complex micro-ecosystem, helps the host with various vital physiological processes. Alterations of the microbiome (dysbiosis) have been linked with several diseases, and generally, differential abundance testing between the healthy and patient groups is performed to identify important bacteria. However, providing a singular species of bacteria to an individual as treatment has not been as successful as fecal microbiota transplant therapy, where the entire microbiome of a healthy individual is transferred. These observations suggest that a combination of bacteria might be crucial for the beneficial effects. Here we provide the framework to utilize topic modeling, an unsupervised machine learning approach, to identify a community of bacteria related to health or disease. Specifically, we used our previously published gut microbiome data of patients with multiple sclerosis (MS), a neurodegenerative disease linked to a dysbiotic gut microbiome. We identified communities of bacteria associated with MS, including genera previously discovered, but also others that would have been overlooked by differential abundance testing. This method can be a useful tool for analyzing the microbiome, and it should be considered along with the commonly utilized differential abundance tests to better understand the role of the gut microbiome in health and disease.
Richard Boadu Antwi, Michael Kimollo, Samuel Yaw Takyi
et al.
Advancements in computer vision are rapidly revolutionizing the way traffic agencies gather roadway geometry data, leading to significant savings in both time and money. Utilizing aerial and satellite imagery for data collection proves to be more cost-effective, more accurate, and safer compared to traditional field observations, considering factors such as equipment cost, crew safety, and data collection efficiency. Consequently, there is a pressing need to develop more efficient methodologies for promptly, safely, and economically acquiring roadway geometry data. While image processing has previously been regarded as a time-consuming and error-prone approach for capturing these data, recent developments in computing power and image recognition techniques have opened up new avenues for accurately detecting and mapping various roadway features from a wide range of imagery data sources. This research introduces a novel approach combining image processing with a YOLO-based methodology to detect turning lane pavement markings from high-resolution aerial images, specifically focusing on Florida’s public roadways. Upon comparison with ground truth data from Leon County, Florida, the developed model achieved an average accuracy of 87% at a 25% confidence threshold for detected features. Implementation of the model in Leon County identified approximately 3026 left turn, 1210 right turn, and 200 center lane features automatically. This methodology holds paramount significance for transportation agencies in facilitating tasks such as identifying deteriorated markings, comparing turning lane positions with other roadway features like crosswalks, and analyzing intersection-related accidents. The extracted roadway geometry data can also be seamlessly integrated with crash and traffic data, providing crucial insights for policymakers and road users.
Summary: Tropical cyclone (TC) intensity change forecasting remains challenging due to the lack of understanding of the interactions between TC changes and environmental parameters, and the high uncertainties resulting from climate change. This study proposed hybrid convolutional neural networks (hybrid-CNN), which effectively combined satellite-based spatial characteristics and numerical prediction model outputs, to forecast TC intensity with lead times of 24, 48, and 72 h. The models were validated against best track data by TC category and phase and compared with the Korea Meteorological Administrator (KMA)-based TC forecasts. The hybrid-CNN-based forecasts outperformed KMA-based forecasts, exhibiting up to 22%, 110%, and 7% improvement in skill scores for the 24-, 48-, and 72-h forecasts, respectively. For rapid intensification cases, the models exhibited improvements of 62%, 87%, and 50% over KMA-based forecasts for the three lead times. Moreover, explainable deep learning demonstrated hybrid-CNN’s potential in predicting TC intensity and contributing to the TC forecasting field.
In this work, a high-throughput computation (HTC) and machine learning (ML) combined method was applied to identify the determining factors of the detonation velocity (vd) and detonation pressure (pd) of energetic molecules and screen potential high-energy molecules with acceptable stability in a high-throughput way. The HTC was performed based on 1725 sample molecules abstracted from a dataset of over 106 linear nitroaliphatics with 1- to 6-membered C backbones and three types of substituents, namely single nitro group (-NO2), nitroamine (-NNO2), and nitrate ester (-ONO2). ML models were established based on the HTC results to screen high-energy molecules and to identify the determining factors of vd and pd. Compared with quantum chemistry calculation results, the absolute relative errors of vd and pd obtained using the ML models were less than 3.63% and 5%, respectively. Furthermore, eight molecules with high energy and acceptable stability were selected as potential candidates. This study shows the high efficiency of the combination of HTC and ML in high-throughput screening.
Janani Venkatraman Jagatha, Christoph Schneider, Tobias Sauter
Machine learning (ML) methods are widely used in particulate matter prediction modelling, especially through use of air quality sensor data. Despite their advantages, these methods’ black-box nature obscures the understanding of how a prediction has been made. Major issues with these types of models include the data quality and computational intensity. In this study, we employed feature selection methods using recursive feature elimination and global sensitivity analysis for a random-forest (RF)-based land-use regression model developed for the city of Berlin, Germany. Land-use-based predictors, including local climate zones, leaf area index, daily traffic volume, population density, building types, building heights, and street types were used to create a baseline RF model. Five additional models, three using recursive feature elimination method and two using a Sobol-based global sensitivity analysis (GSA), were implemented, and their performance was compared against that of the baseline RF model. The predictors that had a large effect on the prediction as determined using both the methods are discussed. Through feature elimination, the number of predictors were reduced from 220 in the baseline model to eight in the parsimonious models without sacrificing model performance. The model metrics were compared, which showed that the parsimonious_GSA-based model performs better than does the baseline model and reduces the mean absolute error (MAE) from 8.69 µg/m<sup>3</sup> to 3.6 µg/m<sup>3</sup> and the root mean squared error (RMSE) from 9.86 µg/m<sup>3</sup> to 4.23 µg/m<sup>3</sup> when applying the trained model to reference station data. The better performance of the GSA_parsimonious model is made possible by the curtailment of the uncertainties propagated through the model via the reduction of multicollinear and redundant predictors. The parsimonious model validated against reference stations was able to predict the PM<sub>2.5</sub> concentrations with an MAE of less than 5 µg/m<sup>3</sup> for 10 out of 12 locations. The GSA_parsimonious performed best in all model metrics and improved the R<sup>2</sup> from 3% in the baseline model to 17%. However, the predictions exhibited a degree of uncertainty, making it unreliable for regional scale modelling. The GSA_parsimonious model can nevertheless be adapted to local scales to highlight the land-use parameters that are indicative of PM<sub>2.5</sub> concentrations in Berlin. Overall, population density, leaf area index, and traffic volume are the major predictors of PM<sub>2.5</sub>, while building type and local climate zones are the less significant predictors. Feature selection based on sensitivity analysis has a large impact on the model performance. Optimising models through sensitivity analysis can enhance the interpretability of the model dynamics and potentially reduce computational costs and time when modelling is performed for larger areas.
Michele Miranda, Elena Sofia Ruzzetti, Andrea Santilli
et al.
Large Language Models (LLMs) represent a significant advancement in artificial intelligence, finding applications across various domains. However, their reliance on massive internet-sourced datasets for training brings notable privacy issues, which are exacerbated in critical domains (e.g., healthcare). Moreover, certain application-specific scenarios may require fine-tuning these models on private data. This survey critically examines the privacy threats associated with LLMs, emphasizing the potential for these models to memorize and inadvertently reveal sensitive information. We explore current threats by reviewing privacy attacks on LLMs and propose comprehensive solutions for integrating privacy mechanisms throughout the entire learning pipeline. These solutions range from anonymizing training datasets to implementing differential privacy during training or inference and machine unlearning after training. Our comprehensive review of existing literature highlights ongoing challenges, available tools, and future directions for preserving privacy in LLMs. This work aims to guide the development of more secure and trustworthy AI systems by providing a thorough understanding of privacy preservation methods and their effectiveness in mitigating risks.
The task of predicting long-term patient outcomes using supervised machine learning is a challenging one, in part because of the high variance of each patient's trajectory, which can result in the model over-fitting to the training data. Temporal difference (TD) learning, a common reinforcement learning technique, may reduce variance by generalising learning to the pattern of state transitions rather than terminal outcomes. However, in healthcare this method requires several strong assumptions about patient states, and there appears to be limited literature evaluating the performance of TD learning against traditional supervised learning methods for long-term health outcome prediction tasks. In this study, we define a framework for applying TD learning to real-time irregularly sampled time series data using a Semi-Markov Reward Process. We evaluate the model framework in predicting intensive care mortality and show that TD learning under this framework can result in improved model robustness compared to standard supervised learning methods. and that this robustness is maintained even when validated on external datasets. This approach may offer a more reliable method when learning to predict patient outcomes using high-variance irregular time series data.
In this chapter, we will discuss the use of Machine Learning methods for the identification and localization of cometary activity for Solar System objects in ground and in space-based wide-field all-sky surveys. We will begin the chapter by discussing the challenges of identifying known and unknown active, extended Solar System objects in the presence of stellar-type sources and the application of classical pre-ML identification techniques and their limitations. We will then transition to the discussion of implementing ML techniques to address the challenge of extended object identification. We will finish with prospective future methods and the application to future surveys such as the Vera C. Rubin Observatory.
Ljubomir Buturovic, Michael Mayhew, Roland Luethy
et al.
We applied machine learning to the unmet medical need of rapid and accurate diagnosis and prognosis of acute infections and sepsis in emergency departments. Our solution consists of a Myrna (TM) Instrument and embedded TriVerity (TM) classifiers. The instrument measures abundances of 29 messenger RNAs in patient's blood, subsequently used as features for machine learning. The classifiers convert the input features to an intuitive test report comprising the separate likelihoods of (1) a bacterial infection (2) a viral infection, and (3) severity (need for Intensive Care Unit-level care). In internal validation, the system achieved AUROC = 0.83 on the three-class disease diagnosis (bacterial, viral, or non-infected) and AUROC = 0.77 on binary prognosis of disease severity. The Myrna, TriVerity system was granted breakthrough device designation by the United States Food and Drug Administration (FDA). This engineering manuscript teaches the standard and novel machine learning methods used to translate an academic research concept to a clinical product aimed at improving patient care, and discusses lessons learned.
Hepatocellular carcinoma (HCC) is a malignant tumor with high recurrence and metastasis rates and poor prognosis. Basement membrane is a ubiquitous extracellular matrix and is a key physical factor in cancer metastasis. Therefore, basement membrane-related genes may be new targets for the diagnosis and treatment of HCC. We systematically analyzed the expression pattern and prognostic value of basement membrane-related genes in HCC using the TCGA-HCC dataset, and constructed a new BMRGI based on WGCNA and machine learning. We used the HCC single-cell RNA-sequencing data in GSE146115 to describe the single-cell map of HCC, analyzed the interaction between different cell types, and explored the expression of model genes in different cell types. BMRGI can accurately predict the prognosis of HCC patients and was validated in the ICGC cohort. In addition, we also explored the underlying molecular mechanisms and tumor immune infiltration in different BMRGI subgroups, and confirmed the differences in response to immunotherapy in different BMRGI subgroups based on the TIDE algorithm. Then, we assessed the sensitivity of HCC patients to common drugs. In conclusion, our study provides a theoretical basis for the selection of immunotherapy and sensitive drugs in HCC patients. Finally, we also considered CTSA as the most critical basement membrane-related gene affecting HCC progression. In vitro experiments showed that the proliferation, migration and invasion abilities of HCC cells were significantly impaired when CTSA was knocked down.
This paper presents a detailed analysis of the linguistic characteristics connected to specific levels of suicide risks, providing insight into the impact of the feature extraction techniques on the effectiveness of the predictive models of suicide ideation. Prevalent initiatives of research works had been observed in the detection of suicide ideation from social media posts through feature extraction and machine learning techniques but scarcely on the multiclass classification of suicide risks and analysis of linguistic characteristics' impact on predictability. To address this issue, this paper proposes the implementation of a machine learning framework that is capable of analyzing multiclass classification of suicide risks from social media posts with extended analysis of linguistic characteristics that contribute to suicide risk detection. A total of 552 samples of a supervised dataset of Twitter posts were manually annotated for suicide risk modeling. Feature extraction was done through a combination of feature extraction techniques of term frequency-inverse document frequency (TF-IDF), Part-of-Speech (PoS) tagging, and valence-aware dictionary for sentiment reasoning (VADER). Data training and modeling were conducted through the Random Forest technique. Testing of 138 samples with scenarios of detections in real-time data for the performance evaluation yielded 86.23% accuracy, 86.71% precision, and 86.23% recall, an improved result with a combination of feature extraction techniques rather than data modeling techniques. An extended analysis of linguistic characteristics showed that a sentence's context is the main contributor to suicide risk classification accuracy, while grammatical tags and strong conclusive terms were not.
Although supervised learning has been highly successful in improving the state-of-the-art in the domain of image-based computer vision in the past, the margin of improvement has diminished significantly in recent years, indicating that a plateau is in sight. Meanwhile, the use of self-supervised learning (SSL) for the purpose of natural language processing (NLP) has seen tremendous successes during the past couple of years, with this new learning paradigm yielding powerful language models. Inspired by the excellent results obtained in the field of NLP, self-supervised methods that rely on clustering, contrastive learning, distillation, and information-maximization, which all fall under the banner of discriminative SSL, have experienced a swift uptake in the area of computer vision. Shortly afterwards, generative SSL frameworks that are mostly based on masked image modeling, complemented and surpassed the results obtained with discriminative SSL. Consequently, within a span of three years, over $100$ unique general-purpose frameworks for generative and discriminative SSL, with a focus on imaging, were proposed. In this survey, we review a plethora of research efforts conducted on image-oriented SSL, providing a historic view and paying attention to best practices as well as useful software packages. While doing so, we discuss pretext tasks for image-based SSL, as well as techniques that are commonly used in image-based SSL. Lastly, to aid researchers who aim at contributing to image-focused SSL, we outline a number of promising research directions.
Turbulent flows are of central importance across applications in science and engineering problems. For design and analysis, scientists and engineers use Computational Fluid Dynamics (CFD) simulations using turbulence models. Turbulent models are limited approximations, introducing epistemic uncertainty in CFD results. For reliable design and analysis, we require quantification of these uncertainties. The Eigenspace Perturbation Method (EPM) is the preeminent physics based approach for turbulence model UQ, but often leads to overly conservative uncertainty bounds. In this study, we use Machine Learning (ML) models to moderate the EPM perturbations and introduce our physics constrained machine learning framework for turbulence model UQ. We test this framework in multiple problems to show that it leads to improved calibration of the uncertainty estimates.