Abstract Data curation is a fundamental yet often underappreciated aspect of cheminformatics and computational drug discovery. Large public and proprietary databases now provide vast amounts of chemical structure, physicochemical, absorption, distribution, metabolism, excretion, and toxicity (ADMET), and bioactivity data. However, these resources contain structural inconsistencies, annotation errors, and heterogeneous experimental conditions that can limit model performance and reproducibility. This narrative review summarizes why and how data should be curated before use in cheminformatics workflows. We frame chemical data curation around two complementary pillars: structural curation and curation of experimental conditions. On the structural side, we review existing standardization and quantitative structure–activity relationship (QSAR)-ready workflows, including handling of salts and mixtures, parent–child policies, aromatization, tautomer handling, stereochemistry validation, and duplicate detection with conflict resolution. On the experimental side, we synthesize evidence that assay protocols, measurement methods, and reporting practices introduce substantial uncertainty and bias in physicochemical and ADMET endpoints as well as bioactivity data, and we outline practical strategies for assembling condition-aware datasets from the literature and public databases. Across case studies, we highlight how curated structure–condition pairs yield more accurate, robust, and reproducible models than raw, unfiltered collections. Rather than introducing a new predictive method or performing a formal statistical meta-analysis, we provide a structured narrative synthesis of current best practices, tools, and decision points for data curation in cheminformatics. This review offers practical, evidence-based guidance on the structural and experimental-condition curation required to build reliable cheminformatics models. Scientific Contribution: This article does not introduce a new algorithm but provides a practice-oriented, structured synthesis of data curation in cheminformatics. We (i) formulate a two-pillar framework that treats structural curation and experimental-condition curation as equally important components of cheminformatics workflows; (ii) consolidate scattered best practices into concrete workflows, checklists, and decision maps for building “QSAR-ready” and condition-aware datasets; and (iii) integrate endpoint-specific case studies showing that rigorous curation materially improves predictive performance and reproducibility. We also identify open challenges and research directions for scaling and automating curation, including the use of workflow technologies and large language models, and for establishing community standards for condition metadata. Graphical Abstract
Abstract To systematically summarise and analyse published research, which assessed balance control deficiencies of children and adolescents with intellectual disability (ID) in various balance domains by comparing their performance with that of typically developing (TD) peers. This study followed PRISMA principles and conducted a systematic search in six electronic databases in May 2025, including PubMed, Web of Science Core Collection, Scopus, EBSCO, LILACS, Cochrane Library, PEDro and Embase. The methodological quality of included studies was assessed using the Scottish Intercollegiate Guidelines Network checklist, and the certainty of the evidence was assessed via the GRADE approach. The meta-analysis was performed via RevMan 5.4. Descriptive analysis was performed if the data were insufficient. Fifteen studies were included and eight of which were pooled for meta-analysis. Children and adolescents with ID had poorer balance control than their TD peers. Specific to the various balance domains, individuals with ID presented substantially more difficulties in sensory orientation (eyes open: SMD = 0.89, 95% CI: 0.69 to 1.09, p < 0.001; eyes closed: SMD = 0.44, 95% CI: 0.26 to 0.62, p < 0.001) and limits of stability (SMD = -0.91, 95% CI: -1.09 to -0.73, p < 0.001). They also encountered more hardships in gait stability (SMD = 0.52, 95% CI: 0.27 to 0.78, p < 0.001). For domain of anticipatory postural adjustments, inconsistent results were obtained. Children and adolescents with ID have deficient balance control ability than their TD peers. More researches are needed to comprehensively assess various domain of balance in this target group, especially those with longitudinal designs.
Abdullah Addas, Abdullah Addas, Muhammad Nasir Khan
et al.
IntroductionThe regional disparity in higher education access can only be met when there are strategies for sustainable development and diversification of the economy, as envisioned in Saudi Vision 2030. Currently, 70% of universities are concentrated in the Central and Eastern regions, leaving the Northern and Southern parts of the country with limited opportunities.MethodsThe study created a framework with sensors and generative adversarial networks (GANs) that optimize the distribution of medical universities, supporting equity in access to education and balanced regional development. The research applies an artificial intelligence (AI)-driven framework that combines sensor data with GAN-based models to perform real-time geographic and demographic data analyses on the placement of higher education institutions throughout Saudi Arabia. This framework analyzes multisensory data by examining strategic university placement impacts on regional economies, social mobility, and the environment. Scenario modeling was used to simulate potential outcomes due to changes in university distribution.ResultsThe findings indicated that areas with a higher density of universities experience up to 20% more job opportunities and a higher GDP growth of up to 15%. The GAN-based simulations reveal that redistributive educational institutions in underrepresented regions could decrease environmental impacts by about 30% and enhance access. More specifically, strategic placement in underserved areas is associated with a reduction of approximately 10% in unemployment.DiscussionThe research accentuates the need to include AI and sensor technology to develop educational infrastructures. The proposed framework can be used for continuous monitoring and dynamic adaptation of university strategies to align them with evolving economic and environmental objectives. The study explains the transformative potential of AI-enabled solutions to further equal access to education for sustainable regional development throughout Saudi Arabia.
The Constitution of the Republic of Serbia safeguards the right to privacy through several aspects, as does the European Convention on Human Rights and Fundamental Freedoms. The cornerstone in this area is the Law on Personal Data Protection from 2018. In line with the ultima ratio principle of criminal law, the Criminal Code protects these data when the most severe violations occur, pursuant to the Criminal Procedure Code. However, with the daily expansion of science, technology, and innovative means of communication and recording, this takes on a different, "digital" dimension. Naturally, this trend calls for certain adjustments in regulations, as well as in their interpretation and application. In this paper, the authors aim to highlight key provisions of the aforementioned regulations and their current and future interpretation within the context of digital society, with a particular focus on criminal law aspects. This complexity is further amplified by the development of artificial intelligence, which inherently relies on the use of vast amounts of data. The aim of this paper is to identify some of the critical elements in the protection of privacy rights, particularly those related to personal data most closely linked to individuals, and to raise the question of potential legislative amendments.
Preparing high-quality datasets required by various data-driven AI and machine learning models has become a cornerstone task in data-driven analysis. Conventional data discovery methods typically integrate datasets towards a single pre-defined quality measure that may lead to bias for downstream tasks. This paper introduces MODis, a framework that discovers datasets by optimizing multiple user-defined, model-performance measures. Given a set of data sources and a model, MODis selects and integrates data sources into a skyline dataset, over which the model is expected to have the desired performance in all the performance measures. We formulate MODis as a multi-goal finite state transducer, and derive three feasible algorithms to generate skyline datasets. Our first algorithm adopts a "reduce-from-universal" strategy, that starts with a universal schema and iteratively prunes unpromising data. Our second algorithm further reduces the cost with a bi-directional strategy that interleaves data augmentation and reduction. We also introduce a diversification algorithm to mitigate the bias in skyline datasets. We experimentally verify the efficiency and effectiveness of our skyline data discovery algorithms, and showcase their applications in optimizing data science pipelines.
Existing large language model (LLM) agents for automating data science show promise, but they remain constrained by narrow task scopes, limited generalization across tasks and models, and over-reliance on state-of-the-art (SOTA) LLMs. We introduce DatawiseAgent, a notebook-centric LLM agent framework for adaptive and robust data science automation. Inspired by how human data scientists work in computational notebooks, DatawiseAgent introduces a unified interaction representation and a multi-stage architecture based on finite-state transducers (FSTs). This design enables flexible long-horizon planning, progressive solution development, and robust recovery from execution failures. Extensive experiments across diverse data science scenarios and models show that DatawiseAgent consistently achieves SOTA performance by surpassing strong baselines such as AutoGen and TaskWeaver, demonstrating superior effectiveness and adaptability. Further evaluations reveal graceful performance degradation under weaker or smaller models, underscoring the robustness and scalability.
Sai Sanjna Chintakunta, Nathalia Nascimento, Everton Guimaraes
In recent years, Large Language Models (LLMs) have emerged as transformative tools across numerous domains, impacting how professionals approach complex analytical tasks. This systematic mapping study comprehensively examines the application of LLMs throughout the Data Science lifecycle. By analyzing relevant papers from Scopus and IEEE databases, we identify and categorize the types of LLMs being applied, the specific stages and tasks of the data science process they address, and the methodological approaches used for their evaluation. Our analysis includes a detailed examination of evaluation metrics employed across studies and systematically documents both positive contributions and limitations of LLMs when applied to data science workflows. This mapping provides researchers and practitioners with a structured understanding of the current landscape, highlighting trends, gaps, and opportunities for future research in this rapidly evolving intersection of LLMs and data science.
Since 2010, Kaggle has been a platform where data scientists from around the world come together to compete, collaborate, and push the boundaries of Data Science. Over these 15 years, it has grown from a purely competition-focused site into a broader ecosystem with forums, notebooks, models, datasets, and more. With the release of the Kaggle Meta Code and Kaggle Meta Datasets, we now have a unique opportunity to explore these competitions, technologies, and real-world applications of Machine Learning and AI. And so in this study, we take a closer look at 15 years of data science on Kaggle - through metadata, shared code, community discussions, and the competitions themselves. We explore Kaggle's growth, its impact on the data science community, uncover hidden technological trends, analyze competition winners, how Kagglers approach problems in general, and more. We do this by analyzing millions of kernels and discussion threads to perform both longitudinal trend analysis and standard exploratory data analysis. Our findings show that Kaggle is a steadily growing platform with increasingly diverse use cases, and that Kagglers are quick to adapt to new trends and apply them to real-world challenges, while producing - on average - models with solid generalization capabilities. We also offer a snapshot of the platform as a whole, highlighting its history and technological evolution. Finally, this study is accompanied by a video (https://www.youtube.com/watch?v=YVOV9bIUNrM) and a Kaggle write-up (https://kaggle.com/competitions/meta-kaggle-hackathon/writeups/kaggle-chronicles-15-years-of-competitions-communi) for your convenience.
Background: Maternal and neonatal mortality is a global problem that is highly prevalent in low- and middle-income countries, including Ethiopia. Maternal delay in seeking institutional delivery services utilization plays a significant role in determining maternal and neonatal health outcomes. Although studies have been conducted on institutional delivery service utilization in Ethiopia, little is known about factors for delays in seeking care for institutional delivery services.
Objective: This study aimed to assess the delay in seeking institutional delivery services and associated factors among immediate postpartum mothers in public health facilities in Gondar, northwest Ethiopia.
Methods: A facility-based cross-sectional study was conducted from July 15 to September 10, 2022. A total of 391 participants were selected using systematic random sampling. Data were collected through face-to-face interviews using structured, pretested, and interviewer-administered questionnaires. Data were entered into EpiData version 4.6, and the analysis was conducted using Statistical Package for Social Science version 26. The multivariable logistic regression model was fitted and the level of significance was set at p ? 0.05.
Result: The prevalence of delay in seeking institutional delivery was 49.10% (95% confidence interval [CI]: 44.13, 54.08). Rural residence (adjusted odds ratio [AOR] = 2.51; 95% CI: 1.43?4.41), no antenatal care visits (AOR: 2.87; 95% CI: 1.34?6.13), unplanned pregnancy (AOR: 2.98; 95% CI: 1.78?5.01), poor decision-making autonomy in maternity care services (AOR: 1.98; 95% CI: 1.15?3.40), and poor birth preparedness plan (AOR: 4.88; 95% CI: 2.79?8.53) were significantly associated with delays in seeking institutional delivery.
Conclusion: Delays in seeking institutional delivery services were high. It is better to promote women?s decision-making power in their own health care. In addition, it is better to arrange programs that will improve maternal and child health service utilization.
Gynecology and obstetrics, Public aspects of medicine
In the rapidly evolving field of data science, efficiently navigating the expansive body of academic literature is crucial for informed decision-making and innovation. This paper presents an enhanced Retrieval-Augmented Generation (RAG) application, an artificial intelligence (AI)-based system designed to assist data scientists in accessing precise and contextually relevant academic resources. The AI-powered application integrates advanced techniques, including the GeneRation Of BIbliographic Data (GROBID) technique for extracting bibliographic information, fine-tuned embedding models, semantic chunking, and an abstract-first retrieval method, to significantly improve the relevance and accuracy of the retrieved information. This implementation of AI specifically addresses the challenge of academic literature navigation. A comprehensive evaluation using the Retrieval-Augmented Generation Assessment System (RAGAS) framework demonstrates substantial improvements in key metrics, particularly Context Relevance, underscoring the system's effectiveness in reducing information overload and enhancing decision-making processes. Our findings highlight the potential of this enhanced Retrieval-Augmented Generation system to transform academic exploration within data science, ultimately advancing the workflow of research and innovation in the field.
While the field of NL2SQL has made significant advancements in translating natural language instructions into executable SQL scripts for data querying and processing, achieving full automation within the broader data science pipeline - encompassing data querying, analysis, visualization, and reporting - remains a complex challenge. This study introduces SageCopilot, an advanced, industry-grade system system that automates the data science pipeline by integrating Large Language Models (LLMs), Autonomous Agents (AutoAgents), and Language User Interfaces (LUIs). Specifically, SageCopilot incorporates a two-phase design: an online component refining users' inputs into executable scripts through In-Context Learning (ICL) and running the scripts for results reporting & visualization, and an offline preparing demonstrations requested by ICL in the online phase. A list of trending strategies such as Chain-of-Thought and prompt-tuning have been used to augment SageCopilot for enhanced performance. Through rigorous testing and comparative analysis against prompt-based solutions, SageCopilot has been empirically validated to achieve superior end-to-end performance in generating or executing scripts and offering results with visualization, backed by real-world datasets. Our in-depth ablation studies highlight the individual contributions of various components and strategies used by SageCopilot to the end-to-end correctness for data sciences.
ObjectiveTo systematically evaluate the effect of moxibustion with seed-sized moxa-cone at Jing acupoint on limb spasticity after stroke, and to provide a scientific basis for the clinical application of moxibustion with seed-sized moxa-cone.MethodsData were searched from the China Knowledge Network (CNKI), Wanfang Data, VIP, CBMdisc, China Clinical Registry, PubMed, Embase, Web of Science and Cochrane Library. The randomized controlled trials (RCTs) of moxibustion with seed-sized moxa-cone at Jing acupoints for stroke patients with spastic hemiplegia were included, and the retrieval time was from inception to July 2022. The main outcome measures included the effective rate of spasticity improvement [number of cases (fully recovered + significant effect + some effect)/total number of cases]×100%], modified Ashworth scale (MAS) grading, clinical spasm index (CSI) score, Fugl-Meyer assessment scale (FMA) score, Barthel index (BI) score. Two researchers independently conducted literature screening according to the inclusion and exclusion criteria, data extraction, quality evaluation by the Cochrane risk of bias tool. RevMan 5.3 software was used for meta-analysis. Measurement data were expressed by mean difference (MD) and standardized mean difference (SMD) with 95% confidence interval (CI), and enumeration data were expressed by odds ratio (OR); SAS studio software was used to calculate the effect size (logrr) and standard error (selogrr) of ordinal data based on the cumulative odds model. The statistical heterogeneity was assessed by probability value (<italic>P</italic> value) and heterogeneity value (<italic>I</italic><sup>2</sup> value). If <italic>P</italic>≥0.1 and <italic>I</italic><sup>2</sup>≤50%, a fixed effects model would be used. If <italic>P</italic><0.1 or <italic>I</italic><sup>2</sup>>50%, a random effects model would be used.ResultsA total of 10 RCTs with 736 patients were included, with 368 cases in the control group and the experimental group respectively. Meta-analysis results showed that, (1) The spasticity improvement rate: compared with the control group, the effective rate of the experimental group was significantly higher [<italic>OR</italic>=3.96, 95% <italic>CI</italic> (1.25,12.53), <italic>Z</italic>=2.34, <italic>P</italic>=0.02]. (2) MAS grade: compared with the control group, the grade of spasticity of the experimental group was significantly lower [<italic>RR</italic>=3.03, 95% <italic>CI</italic> (2.08, 4.41), <italic>Z</italic>=5.78, <italic>P</italic><0.000 01]. (3) CSI score: compared with the control group, CSI score of the experimental group was significantly lower [<italic>MD</italic>= -0.96, 95% <italic>CI</italic> (-1.26, -0.66), <italic>Z</italic>=6.20, <italic>P</italic><0.000 01]. (4) FMA score: compared with the control group, FMA score of the experimental group was significantly higher [<italic>MD</italic>=15.17, 95% <italic>CI</italic> (7.89, 22.46), <italic>Z</italic>=4.08, <italic>P</italic><0.000 1]. (5) BI score: compared with the control group, BI score of the experimental group was significantly higher [<italic>MD</italic>=5.95, 95% <italic>CI</italic> (4.38, 7.53), <italic>Z</italic>=7.39, <italic>P</italic><0.000 01].ConclusionMoxibustion with seed-sized moxa-cone at Jing acupoint can effectively improve the spasticity of limbs, reduce muscle tone, improve the motor function of limbs and quality of life of patients after stroke. However, more studies and samples would be needed for further validation in order to provide more reliable clinical evidence, due to the limitations of the included literature.
Modern vehicle communication development is a continuous process in which cutting-edge security systems are required. Security is a main problem in the Vehicular Ad Hoc Network (VANET). Malicious node detection is one of the critical issues found in the VANET environment, with the ability to communicate and enhance the mechanism to enlarge the field. The vehicles are attacked by malicious nodes, especially DDoS attack detection. Several solutions are presented to overcome the issue, but none are solved in a real-time scenario using machine learning. During DDoS attacks, multiple vehicles are used in the attack as a flood on the targeted vehicle, so communication packets are not received, and replies to requests do not correspond in this regard. In this research, we selected the problem of malicious node detection and proposed a real-time malicious node detection system using machine learning. We proposed a distributed multi-layer classifier and evaluated the results using OMNET++ and SUMO with machine learning classification using GBT, LR, MLPC, RF, and SVM models. The group of normal vehicles and attacking vehicles dataset is considered to apply the proposed model. The simulation results effectively enhance the attack classification with an accuracy of 99%. Under LR and SVM, the system achieved 94 and 97%, respectively. The RF and GBT achieved better performance with 98% and 97% accuracy values, respectively. Since we have adopted Amazon Web Services, the network’s performance has improved because training and testing time do not increase when we include more nodes in the network.
Enterprise data warehouses for research (EDW4R) is a critical component of National Institutes of Health Clinical and Translational Science Award (CTSA) hubs. EDW4R operations have unique needs that require specialized skills and collaborations across multiple domains which limit the ability to apply existing models of information technology (IT) performance. Because of this uniqueness, we developed a new EDW4R maturity model based on prior qualitative study of operational practices for supporting EDW4Rs at CTSA hubs. In a pilot study, respondents from fifteen CTSA hubs completed the novel EDW4R maturity index survey by rating 33 maturity statements across 6 categories using a 5-point Likert scale. Of the six categories, respondents rated workforce as most mature (4.17 [3.67–4.42]) and relationship with enterprise IT as the least mature (3.00 [2.80–3.80]). Our pilot of a novel maturity index shows a baseline quantitative measure of EDW4R functions across fifteen CTSA hubs. The maturity index may be useful to faculty and staff currently leading an EDW4R by creating opportunities to explore the index in local context and comparison to other institutions.
Richard J. Ladle, Fernanda Alves-Martins, Ana C.M. Malhado
et al.
Predicting whether a species is likely to go extinct (or not) is one of the fundamental objectives of conservation biology, and extinction risk classifications have become an essential tool for conservation policy, planning and research. This sort of prediction is feasible because the extinction processes follow a familiar pattern of population decline, range collapse and fragmentation, and, finally, extirpation of sub-populations through a combination of genetic, demographic and environmental stochasticity. Though less well understood and rarely quantified, the way in which science and society respond to population decline, extirpation and species extinction can also have a profound influence, either negative or positive, on whether a species goes extinct. For example, species that are highly sought after by collectors and hobbyists can become more desirable and valuable as they become rarer, leading to increased demand and greater incentives for illegal trade – known as the anthropogenic Allee effect. Conversely, species that are strongly linked to cultural identity are more likely to benefit from sustainable management, high public support for conservation actions and fund-raising, and, by extension, may be partially safeguarded from extinction. More generally, human responses to impending extinctions are extremely complex, are highly dependent on cultural and socioeconomic context, and have typically been far less studied than the ecological and genetic aspects of extinction. Here, we identify and discuss biocultural aspects of extinction and outline how recent advances in our ability to measure and monitor cultural trends with big data are, despite their intrinsic limitations and biases, providing new opportunities for incorporating biocultural factors into extinction risk assessment.
The National Science Foundation-supported CONVERGE facility was established in 2018 as the first social science-led component of the Natural Hazards Engineering Research Infrastructure (NHERI). Headquartered at the Natural Hazards Center at the University of Colorado Boulder, CONVERGE identifies, trains, connects, and funds researchers across disciplines in the hazards and disaster field. This article provides an overview of one of our most widely used tools, the CONVERGE Training Modules. These free, interactive, online trainings are designed for students, early career professionals, and others who are new to hazards and disaster research and practice. Since July 2019, our team has released 10 modules that cover a range of foundational topics in hazards and disaster research, including Institutional Review Board procedures, conducting emotionally challenging research, cultural competence, collecting and sharing perishable data, social vulnerability, and disaster mental health. In addition, CONVERGE offers advanced trainings in specialized topics such as broader ethical considerations for hazards and disaster researchers, reciprocity, gender-based violence in fieldwork, and public health implications of hazards and disaster research. Between July 2019 and November 2022, 6,311 unique users registered for the modules, and these users logged 7,222 module completions. Of the module completions to date, the largest percentage of users completed only one (46.0%) of the available trainings, although a small group of “superusers”—whom we surveyed for this article—have completed all or almost all of the available modules. When asked why they planned to complete the modules at the time of registration, most users indicated that it was to fulfill a classroom or other educational requirement (51.2%), for personal interest/to learn more (9.0%), or to prepare for or to support research (7.1%) or practice-oriented activities (5.8%). In addition to providing more information regarding module users, this article details the development of the technology and discusses the impact and success of this tool for transferring knowledge and skills to the hazards and disaster research and practice community. We conclude with a discussion of future directions for this research-based educational intervention.
Engineering (General). Civil engineering (General), City planning
Data science pipelines commonly utilize dataframe and array operations for tasks such as data preprocessing, analysis, and machine learning. The most popular tools for these tasks are pandas and NumPy. However, these tools are limited to executing on a single node, making them unsuitable for processing large-scale data. Several systems have attempted to distribute data science applications to clusters while maintaining interfaces similar to single-node libraries, enabling data scientists to scale their workloads without significant effort. However, existing systems often struggle with processing large datasets due to Out-of-Memory (OOM) problems caused by poor data partitioning. To overcome these challenges, we develop Xorbits, a high-performance, scalable data science framework specifically designed to distribute data science workloads across clusters while retaining familiar APIs. The key differentiator of Xorbits is its ability to dynamically switch between graph construction and graph execution. Xorbits has been successfully deployed in production environments with up to 5k CPU cores. Its applications span various domains, including user behavior analysis and recommendation systems in the e-commerce sector, as well as credit assessment and risk management in the finance industry. Users can easily scale their data science workloads by simply changing the import line of their pandas and NumPy code. Our experiments demonstrate that Xorbits can effectively process very large datasets without encountering OOM or data-skewing problems. Over the fastest state-of-the-art solutions, Xorbits achieves an impressive 2.66* speedup on average. In terms of API coverage, Xorbits attains a compatibility rate of 96.7%, surpassing the fastest framework by an impressive margin of 60 percentage points. Xorbits is available at https://github.com/xorbitsai/xorbits.
Kathleen Gregory, Laura Koesten, Regina Schuster
et al.
Vast amounts of (open) data are increasingly used to make arguments about crisis topics such as climate change and global pandemics. Data visualizations are central to bringing these viewpoints to broader publics. However, visualizations often conceal the many contexts involved in their production, ranging from decisions made in research labs about collecting and sharing data to choices made in editorial rooms about which data stories to tell. In this paper, we examine how data visualizations about climate change and COVID-19 are produced in popular science magazines, using Scientific American, an established English-language popular science magazine, as a case study. To do this, we apply the analytical concept of data journeys (Leonelli, 2020) in a mixed methods study that centers on interviews with Scientific American staff and is supplemented by a visualization analysis of selected charts. In particular, we discuss the affordances of working with open data, the role of collaborative data practices, and how the magazine works to counter misinformation and increase transparency. This work provides an empirical contribution by providing insight into the data (visualization) practices of science communicators and demonstrating how the concept of data journeys can be used as an analytical framework.