Hasil "data science"

S2 Open Access 2021

clusterProfiler 4.0: A universal enrichment tool for interpreting omics data

Tianzhi Wu, Erqiang Hu, Shuangbing Xu et al.

Summary Functional enrichment analysis is pivotal for interpreting high-throughput omics data in life science. It is crucial for this type of tool to use the latest annotation databases for as many organisms as possible. To meet these requirements, we present here an updated version of our popular Bioconductor package, clusterProfiler 4.0. This package has been enhanced considerably compared with its original version published 9 years ago. The new version provides a universal interface for functional enrichment analysis in thousands of organisms based on internally supported ontologies and pathways as well as annotation data provided by users or derived from online databases. It also extends the dplyr and ggplot2 packages to offer tidy interfaces for data operation and visualization. Other new features include gene set enrichment analysis and comparison of enrichment results from multiple gene lists. We anticipate that clusterProfiler 4.0 will be applied to a wide range of scenarios across diverse organisms.

10866 sitasi en Computer Science, Medicine

Detail DOI Sumber

S2 Open Access 2013

Data Science and its Relationship to Big Data and Data-Driven Decision Making

F. Provost, Tom Fawcett

Abstract Companies have realized they need to hire data scientists, academic institutions are scrambling to put together data-science programs, and publications are touting data science as a hot—even “sexy”—career choice. However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept diffuses into meaningless buzz. In this article, we argue that there are good reasons why it has been hard to pin down exactly what is data science. One reason is that data science is intricately intertwined with other important concepts also of growing importance, such as big data and data-driven decision making. Another reason is the natural tendency to associate what a practitioner does with the definition of the practitioner's field; this can result in overlooking the fundamentals of the field. We believe that trying to define the boundaries of data science precisely is not of the utmost importance. We can debate the boundaries of the field in an academic setting, but in order for data science to serve business effectively, it is important (i) to understand its relationships to other important related concepts, and (ii) to begin to identify the fundamental principles underlying data science. Once we embrace (ii), we can much better understand and explain exactly what data science has to offer. Furthermore, only once we embrace (ii) should we be comfortable calling it data science. In this article, we present a perspective that addresses all these concepts. We close by offering, as examples, a partial list of fundamental principles underlying data science.

1380 sitasi en Medicine, Psychology

Detail DOI Sumber

S2 Open Access 2013

Data Science, Predictive Analytics, and Big Data: A Revolution that Will Transform Supply Chain Design and Management

M. Waller, Stanley E. Fawcett

1337 sitasi en Business

Detail DOI Sumber

S2 Open Access 2016

Data Science in Action

Wil M.P. van der Aalst

404 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2016

A review and future direction of agile, business intelligence, analytics and data science

D. Larson, Victor Chang

387 sitasi en Engineering, Computer Science

Detail DOI Sumber

S2 Open Access 2016

Big Data and Data Science Methods for Management Research

G. George, E. Osinga, Dovev Lavie et al.

The recent advent of remote sensing, mobile technologies, novel transaction systems, and highperformance computing offers opportunities to understand trends, behaviors, and actions in a manner that has not been previously possible. Researchers can thus leverage “big data” that are generated from a plurality of sources including mobile transactions, wearable technologies, social media, ambient networks, andbusiness transactions.An earlierAcademy of Management Journal (AMJ) editorial explored the potential implications for data science inmanagement research and highlighted questions for management scholarship as well as the attendant challenges of data sharing and privacy (George, Haas, & Pentland, 2014). This nascent field is evolving rapidly and at a speed that leaves scholars and practitioners alike attempting to make sense of the emergent opportunities that big datahold.With thepromiseof bigdata comequestions about the analytical value and thus relevance of these data for theory development—including concerns over the context-specific relevance, its reliability and its validity. To address this challenge, data science is emerging as an interdisciplinary field that combines statistics, data mining, machine learning, and analytics to understand and explainhowwecan generate analytical insights and prediction models from structured and unstructured big data. Data science emphasizes the systematic study of the organization, properties, and analysis of data and their role in inference, including our confidence in the inference (Dhar, 2013).Whereas both big data and data science terms are often used interchangeably, “big data” refer to large and varied data that can be collected and managed, whereas “data science” develops models that capture, visualize, andanalyze theunderlyingpatterns in thedata. In this editorial, we address both the collection and handling of big data and the analytical tools provided by data science for management scholars. At the current time, practitioners suggest that data science applications tackle the three core elements of big data: volume, velocity, and variety (McAfee & Brynjolfsson, 2012; Zikopoulos & Eaton, 2011). “Volume” represents the sheer size of the dataset due to the aggregation of a large number of variables and an even larger set of observations for each variable. “Velocity” reflects the speed atwhich these data are collected and analyzed, whether in real time or near real time from sensors, sales transactions, social media posts, and sentiment data for breaking news and social trends. “Variety” in big data comes from the plurality of structured and unstructured data sources such as text, videos, networks, and graphics among others. The combinations of volume, velocity, and variety reveal the complex task of generating knowledge from big data, which often runs into millions of observations, and deriving theoretical contributions from such data. In this editorial, we provide a primer or a “starter kit” for potential data science applications inmanagement research. We do so with a caveat that emerging fields outdate and improve uponmethodologies while often supplanting them with new applications. Nevertheless, this primer can guide management scholars who wish to use data science techniques to reach better answers to existing questions or explore completely new research questions.

348 sitasi en Computer Science

Detail DOI Sumber

DOAJ Open Access 2026

PERMEPSY: a multicentre, randomized, double-blind proof-of-concept trial of personalized metacognitive training for adults with psychosis — a study protocol

Maria Lamarca, Maria Lamarca, Maria Lamarca et al.

BackgroundWhile psychological interventions are effective at improving symptoms of psychosis, accessible, cost- and time-efficient treatments remain limited. Personalized medicine has emerged as a promising approach, tailoring interventions to individual needs. Metacognitive Training (MCT), with its established efficacy and adaptable format, is well-suited for personalization. The PERMEPSY project (Towards a Personalized Medicine Approach to Psychological Treatment for Psychosis) aims to deliver tailored MCT intervention for individuals with psychosis.MethodsPERMEPSY is an international study funded by ERAPerMed (JTC2022) involving five clinical partners (Spain, Chile, France, Germany, Poland) and one technological partner (Spain). The project involves a proof-of-concept clinical trial recruiting 51 participants from each center for a total of 255 adult participants with psychosis in a prospective study (Registration: NCT06603922, 19-09-2024). The trial will test the efficacy of a Machine Learning (ML)-derived platform at predicting clinical and functional outcomes from baseline scores and compare a personalized MCT (P-MCT) to a classical MCT based on the platform’s predictions.AimsPERMEPSY seeks to (1) develop and test the predictive power of an algorithm that could support decision-making, and (2) ascertain whether P-MCT is more effective than MCT at improving key symptoms and cognitive impairments associated to psychosis.ResultsA harmonized retrospective database enabled the development of a predictive ML algorithm, integrated into an innovative platform. This platform provides clinicians with the information needed to deliver P-MCT. Predictions include changes in positive symptoms (e.g., delusions), insight, self-esteem, and treatment adherence.DiscussionBy integrating diverse data types and innovative technology, PERMEPSY addresses the need for personalized, effective treatment in psychosis, aiming to reduce individual and systemic burdens while supporting clinicians in their decision-making.

Psychiatry

Detail DOI Sumber

DOAJ Open Access 2026

The importance and challenge of comparing stroke care, utilization and outcomes in Medicare Advantage and Fee-for-Service Medicare: a narrative review and vision for the future

Emily J. Bian, Priyanka Menon, Kathleen A. McManus et al.

Stroke prevalence is highest in adults ≥65 years, the majority of whom are Medicare beneficiaries. Feefor- Service Medicare (FFS) incentivizes utilization by paying for each service. Medicare Advantage (MA) uses capitated payments to reduce overutilization. It is not clear if stroke patients with FFS or MA receive different stroke preventive care and whether those differences are associated with differences in postacute care utilization, cost and clinical outcomes.We performed an empirical narrative review of published peer-reviewed studies in the PubMed, EMBASE andWeb of Science databases comparing stroke preventive care between FFS and MA using the American Heart Association’s Life’s Essential 8 and American Heart Association/American Stroke Association national guidelines.We added atrial fibrillation (AF), post-acute care utilization and outcomes, including mortality. 7/1356 studies met inclusion criteria. Studies were heterogenous in their design and settings. There was limited availability of clinical data. Within those limitations, published studies suggest that MA appears to allow for guideline-directed stroke preventive care for hyperlipidemia, smoking cessation and AF in specific study populations. Post-acute care utilization was generally lower in MA. Functional outcomes improvements were similar but occurred in fewer days in MA, though the absence of acute stroke treatment data is notable. Mortality data were mixed. Given the importance of stroke in Medicare and the growth in MA enrollment, comparing the effectiveness of MA and FFS warrants further study among appropriately matched MA and FFS beneficiaries with stroke.

Public aspects of medicine

Detail DOI

arXiv Open Access 2026

A Late-Fusion Multimodal AI Framework for Privacy-Preserving Deduplication in National Healthcare Data Environments

Mohammed Omer Shakeel Ahmed

Duplicate records pose significant challenges in customer relationship management (CRM)and healthcare, often leading to inaccuracies in analytics, impaired user experiences, and compliance risks. Traditional deduplication methods rely heavily on direct identifiers such as names, emails, or Social Security Numbers (SSNs), making them ineffective under strict privacy regulations like GDPR and HIPAA, where such personally identifiable information (PII) is restricted or masked. In this research, I propose a novel, scalable, multimodal AI framework for detecting duplicates without depending on sensitive information. This system leverages three distinct modalities: semantic embeddings derived from textual fields (names, cities) using pre-trained DistilBERT models, behavioral patterns extracted from user login timestamps, and device metadata encoded through categorical embeddings. These heterogeneous modalities are combined using a late fusion approach and clustered via DBSCAN, an unsupervised density-based algorithm. This proposed model is evaluated against a traditional string-matching baseline on a synthetic CRM dataset specifically designed to reflect privacy-preserving constraints. The multimodal framework demonstrated good performance, achieving a good F1-score by effectively identifying duplicates despite variations and noise inherent in the data. This approach offers a privacy-compliant solution to entity resolution and supports secure digital infrastructure, enhances the reliability of public health analytics, and promotes ethical AI adoption across government and enterprise settings. It is well-suited for integration into national health data modernization efforts, aligning with broader goals of privacy-first innovation.

en cs.LG

Detail DOI Sumber

DOAJ Open Access 2025

Revealing gene expression links between milk fat globules and mammary glands in rodents via transcriptomics

Hancai Jiang, Xiaoxian Xu, Xinhui Song et al.

Currently, most studies on lactation-related traits and gene expression rely on invasive techniques to obtain mammary tissue. These methods are not only difficult to perform but also limit the availability of samples. Therefore, this study aimed to utilize whole transcriptome sequencing to investigate the gene expression profiles of Golden hamsters (Gh, n = 5) and Kunming mice (Km, n = 5). It compared the transcriptome expression between milk fat globules (MFG) and the mammary gland (MG), identified candidate genes and pathways associated with lactation traits, and assessed the potential of MFG as an effective alternative to MG. The data showed that a total of 21,360 genes were identified in the Gh group, with 66.5% of the mRNAs showing no differential expression between MG and MFG. In the Km group, a total of 44,248 genes were identified, with non-differentially expressed genes (NDEGs) accounting for 58.8%. Additionally, the majority of ncRNA data consisted of NDEGs. In both groups, approximately 80% of miRNA data were NDEGs. Notably, the proportion of NDEGs in circRNA data approached 100%. Enrichment analysis revealed that NDEGs from both groups were significantly enriched in several pathways, including the MAPK signaling pathway, PI3K-Akt signaling pathway, JAK-STAT signaling pathway, and prolactin signaling pathway, all of which are closely associated with lactation traits and the lactation process. Furthermore, we identified various ncRNAs that regulate the expression of target genes either directly or indirectly, thereby influencing the lactation process. This study validates MFG as a reliable substitute for MG, with potential applications in improving dairy science. By identifying key genes and pathways, it provides new insights for optimizing genetic selection and breeding strategies. It also supports the improvement of dairy animal management practices.

Veterinary medicine

Detail DOI Sumber

DOAJ Open Access 2025

Computational Adaptive Optics for HAR Hybrid Trench Array Topography Measurement by Utilizing Coherence Scanning Interferometry

Wenyou Qiao, Zhishan Gao, Qun Yuan et al.

High aspect ratio (HAR) sample-induced aberrations seriously affect the topography measurement for the bottom of the microstructure by coherence scanning interferometry (CSI). Previous research proposed an aberration compensating method using deformable mirrors at the conjugate position of the pupil. However, it failed to compensate for the shift-variant aberrations introduced by the HAR hybrid trench array composed of multiple trenches with different parameters. Here, we propose a computational aberration correction method for measuring the topography of the HAR structure by the particle swarm optimization (PSO) algorithm without constructing a database and prior knowledge, and a phase filter in the spatial frequency domain is constructed to restore interference signals distorted by shift-variant aberrations. Since the aberrations of each sampling point are basically unchanged in the field of view corresponding to a single trench, each trench under test can be considered as a separate isoplanatic region. Therefore, a multi-channel aberration correction scheme utilizing the virtual phase filter based on isoplanatic region segmentation is established for hybrid trench array samples. The PSO algorithm is adopted to derive the optimal Zernike polynomial coefficients representing the filter, in which the interference fringe contrast is taken as the optimization criterion. Additionally, aberrations introduce phase distortion within the 3D transfer function (3D-TF), and the 3D-TF bandwidth remains unchanged. Accordingly, we set the non-zero part of the 3D-TF as a window function to preprocess the interferogram by filtering out the signals outside the window. Finally, experiments are performed in a single trench sample and two hybrid trench array samples with depths ranging from 100 to 300 μm and widths from 10 to 30 μm to verify the effectiveness and accuracy of the proposed method.

Chemical technology

Detail DOI Sumber

DOAJ Open Access 2025

Enhancing the uptake of intermittent preventive treatment for malaria in pregnancy: a scoping review of interventions and gender-informed approaches

Irene A. Kretchy, Deborah Atobrah, David A. Adumbire et al.

Abstract Background Malaria infection in pregnancy is a critical determinant of maternal and neonatal health outcomes in endemic regions. Intermittent preventive treatment of malaria in pregnancy (IPTp) using sulfadoxine-pyrimethamine has been recommended by the World Health Organization (WHO), but its uptake remains low because of factors such as gender norms and expectations. However, interventions to optimize IPTp uptake, especially in malaria-endemic regions, have resulted in a decline in malaria during pregnancy, maternal and neonatal mortality, low birth weight, and placental parasitaemia. This scoping review aimed to synthesize evidence on IPTp uptake, particularly emphasizing gender-related strategies. Methods The modified version of Arksey and O'Malley's framework and the Preferred Reporting Items for Systematic Reviews and Meta-Analysis Extension for Scoping Reviews (PRISMA-ScR) were adopted for this review. Documents were retrieved from the following electronic databases and search engines: scopus, Web of Science, CINAHL Complete (EBSCO), PubMed, WHO, Global Index Medicus, and Google Scholar. The titles and abstracts of the publications were independently screened via Rayyan review management software, and the data were organized using the reach, effectiveness, adoption, implementation, and maintenance (RE-AIM) framework and gender analysis matrix. Results A total of 32 studies met the inclusion criteria. The most reported criterion was the effectiveness of the interventions, which demonstrated an increase in IPTp uptake after the intervention. The gender analysis framework revealed that involving both men and women in decision-making processes, empowering women, and promoting shared roles could improve the success of IPTp interventions. Conclusions Interventions to increase IPTp uptake should be targeted at empowering women through education, increasing financial independence, and making decisions about their health.

Arctic medicine. Tropical medicine, Infectious and parasitic diseases

Detail DOI Sumber

arXiv Open Access 2025

Machine Learning and Data-Driven Methods in Computational Surface and Interface Science

Lukas Hörmann, Wojciech G. Stark, Reinhard J. Maurer

Nanoscale design of surfaces and interfaces is essential for modern technologies like organic LEDs, batteries, fuel cells, superlubricating surfaces, and heterogeneous catalysis. However, these systems often exhibit complex surface reconstructions and polymorphism, with properties influenced by kinetic processes and dynamic behavior. A lack of accurate and scalable simulation tools has limited computational modeling of surfaces and interfaces. Recently, machine learning and data-driven methods have expanded the capabilities of theoretical modeling, enabling, for example, the routine use of machine-learned interatomic potentials to predict energies and forces across numerous structures. Despite these advances, significant challenges remain, including the scarcity of large, consistent datasets and the need for computational and data-efficient machine learning methods. Additionally, a major challenge lies in the lack of accurate reference data and electronic structure methods for interfaces. Density Functional Theory, while effective for bulk materials, is less reliable for surfaces, and too few accurate experimental studies on interface structure and stability exist. Here, we will sketch the current state of data-driven methods and machine learning in computational surface science and provide a perspective on how these methods will shape the field in the future.

en cond-mat.mtrl-sci, physics.comp-ph

Detail DOI Sumber

arXiv Open Access 2025

The Vendiscope: An Algorithmic Microscope For Data Collections

Amey P. Pasarkar, Adji Bousso Dieng

The evolution of microscopy, beginning with its invention in the late 16th century, has continuously enhanced our ability to explore and understand the microscopic world, enabling increasingly detailed observations of structures and phenomena. In parallel, the rise of data-driven science has underscored the need for sophisticated methods to explore and understand the composition of complex data collections. This paper introduces the Vendiscope, the first algorithmic microscope designed to extend traditional microscopy to computational analysis. The Vendiscope leverages the Vendi scores -- a family of differentiable diversity metrics rooted in ecology and quantum mechanics -- and assigns weights to data points based on their contribution to the overall diversity of the collection. These weights enable high-resolution data analysis at scale. We demonstrate this across biology, materials science, and machine learning (ML). We analyzed the $250$ million protein sequences in the protein universe, discovering that over $200$ million are near-duplicates and that AlphaFold fails on proteins with Gene Ontology (GO) functions that contribute most to diversity. Applying the Vendiscope to the Materials Project database led to similar findings: more than $85\%$ of the crystals with formation energy data are near-duplicates and ML models perform poorly on materials that enhance diversity. Additionally, the Vendiscope can be used to study phenomena such as memorization in generative models. We used the Vendiscope to identify memorized training samples from $13$ different generative models and found that the best-performing ones often memorize the training samples that contribute least to diversity. Our findings demonstrate that the Vendiscope can serve as a powerful tool for data-driven science.

en cs.LG, cond-mat.mtrl-sci

Detail Sumber

CrossRef Open Access 2024

Data Validation for Data Science Projects

Yuri Demchenko, Juan J. Cuadrado-Gallego, Oleg Chertov et al.

en

Detail DOI Sumber

CrossRef Open Access 2024

Building a Data Science Culture

Stylianos Kampakis

en

Detail DOI Sumber

arXiv Open Access 2024

Dataversifying Natural Sciences: Pioneering a Data Lake Architecture for Curated Data-Centric Experiments in Life \& Earth Sciences

Genoveva Vargas-Solar, Jérôme Darmont, Alejandro Adorjan et al.

This vision paper introduces a pioneering data lake architecture designed to meet Life \& Earth sciences' burgeoning data management needs. As the data landscape evolves, the imperative to navigate and maximize scientific opportunities has never been greater. Our vision paper outlines a strategic approach to unify and integrate diverse datasets, aiming to cultivate a collaborative space conducive to scientific discovery.The core of the design and construction of a data lake is the development of formal and semi-automatic tools, enabling the meticulous curation of quantitative and qualitative data from experiments. Our unique ''research-in-the-loop'' methodology ensures that scientists across various disciplines are integrally involved in the curation process, combining automated, mathematical, and manual tasks to address complex problems, from seismic detection to biodiversity studies. By fostering reproducibility and applicability of research, our approach enhances the integrity and impact of scientific experiments. This initiative is set to improve data management practices, strengthening the capacity of Life \& Earth sciences to solve some of our time's most critical environmental and biological challenges.

en cs.DB

Detail Sumber

arXiv Open Access 2024

DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?

Liqiang Jing, Zhehui Huang, Xiaoyang Wang et al.

Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.

en cs.AI, cs.CL

Detail Sumber

arXiv Open Access 2024

A Semantic Approach for Big Data Exploration in Industry 4.0

Idoia Berges, Víctor Julio Ramírez-Durán, Arantza Illarramendi

The growing trends in automation, Internet of Things, big data and cloud computing technologies have led to the fourth industrial revolution (Industry 4.0), where it is possible to visualize and identify patterns and insights, which results in a better understanding of the data and can improve the manufacturing process. However, many times, the task of data exploration results difficult for manufacturing experts because they might be interested in analyzing also data that does not appear in pre-designed visualizations and therefore they must be assisted by Information Technology experts. In this paper, we present a proposal materialized in a semantic-based visual query system developed for a real Industry 4.0 scenario that allows domain experts to explore and visualize data in a friendly way. The main novelty of the system is the combined use that it makes of captured data that are semantically annotated first, and a 2D customized digital representation of a machine that is also linked with semantic descriptions. Those descriptions are expressed using terms of an ontology, where, among others, the sensors that are used to capture indicators about the performance of a machine that belongs to a Industry 4.0 scenario have been modeled. Moreover, this semantic description allows to: formulate queries at a higher level of abstraction, provide customized graphical visualizations of the results based on the format and nature of the data, and download enriched data enabling further types of analysis.

en cs.AI, cs.DB

Detail DOI Sumber

CrossRef Open Access 2023

Culturally Relevant Data in Teaching Statistics and Data Science Courses

Travis Weiland, Immanuel Williams

9 sitasi en

Detail DOI Sumber

Hasil untuk "data science"