Eleven grand challenges in single-cell data science
David Lähnemann, Johannes Köster, Ewa Szczurek
et al.
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands—or even millions—of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
1120 sitasi
en
Biology, Medicine
Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl
Juan C. Caicedo, A. Goodman, Kyle W. Karhohs
et al.
Segmenting the nuclei of cells in microscopy images is often the first step in the quantitative analysis of imaging data for biological and biomedical applications. Many bioimage analysis tools can segment nuclei in images but need to be selected and configured for every experiment. The 2018 Data Science Bowl attracted 3,891 teams worldwide to make the first attempt to build a segmentation method that could be applied to any two-dimensional light microscopy image of stained nuclei across experiments, with no human interaction. Top participants in the challenge succeeded in this task, developing deep-learning-based models that identified cell nuclei across many image types and experimental conditions without the need to manually adjust segmentation parameters. This represents an important step toward configuration-free bioimage analysis software tools. The 2018 Data Science Bowl challenged competitors to develop an accurate tool for segmenting stained nuclei from diverse light microscopy images. The winners deployed innovative deep-learning strategies to realize configuration-free segmentation.
733 sitasi
en
Medicine, Computer Science
Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence
S. Raschka, Joshua Patterson, Corey Nolet
Smarter applications are making better use of the insights gleaned from data, having an impact on every industry and research discipline. At the core of this revolution lies the tools and the methods that are driving it, from processing the massive piles of data generated each day to learning from and taking useful action. Deep neural networks, along with advancements in classical machine learning and scalable general-purpose graphics processing unit (GPU) computing, have become critical components of artificial intelligence, enabling many of these astounding breakthroughs and lowering the barrier to adoption. Python continues to be the most preferred language for scientific computing, data science, and machine learning, boosting both performance and productivity by enabling the use of low-level libraries and clean high-level APIs. This survey offers insight into the field of machine learning with Python, taking a tour through important topics to identify some of the core hardware and software paradigms that have enabled it. We cover widely-used libraries and concepts, collected together for holistic comparison, with the goal of educating the reader and driving the field of Python machine learning forward.
656 sitasi
en
Computer Science, Mathematics
Cybersecurity data science: an overview from machine learning perspective
Iqbal H. Sarker, Iqbal H. Sarker, A. Kayes
et al.
In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model, is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. In this paper, we focus and briefly discuss on cybersecurity data science, where the data is being gathered from relevant cybersecurity sources, and the analytics complement the latest data-driven patterns for providing more effective security solutions. The concept of cybersecurity data science allows making the computing process more actionable and intelligent as compared to traditional ones in the domain of cybersecurity. We then discuss and summarize a number of associated research issues and future directions. Furthermore, we provide a machine learning based multi-layered framework for the purpose of cybersecurity modeling. Overall, our goal is not only to discuss cybersecurity data science and relevant methods but also to focus the applicability towards data-driven intelligent decision making for protecting the systems from cyber-attacks.
588 sitasi
en
Computer Science
Foundations of Data Science
Avrim Blum, J. Hopcroft, R. Kannan
Computer science as an academic discipline began in the 1960’s. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered finite automata, regular expressions, context-free languages, and computability. In the 1970’s, the study of algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect, and store data in the natural sciences, in commerce, and in other fields calls for a change in our understanding of data and how to handle it in the modern setting. The emergence of the web and social networks as central aspects of daily life presents both opportunities and challenges for theory.
377 sitasi
en
Computer Science
Data resource profile: a nationally representative linked pregnancy cohort in Canada integrating clinical, social, and environmental data
Sabrina Chiodo, Sonia M. Grandi, Jessica Gronsbell
et al.
Introduction
Perinatal outcomes are shaped by clinical, social, and environmental factors, yet Canada lacks a nationally representative pregnancy cohort capturing these influences at the individual-level. This gap has limited the ability to address multifactorial drivers of maternal and fetal health. To fill this need, we established a linked cohort integrating survey, clinical, and contextual data to support equity-focused, precision public health research in maternal health.
Methods
We linked the Canadian Community Health Survey (CCHS; 2000--2017) to the Discharge Abstract Database (DAD) using Statistics Canada's Social Data Linkage Environment. Eligible participants were female (as defined by the binary CCHS sex variable), aged 15-49 years, with a hospital delivery within two years of their CCHS interview. We excluded multifetal gestations and retained only the first delivery per individual. Area-level and environmental exposures (i.e., neighbourhood inequity, pollution, greenspace, neighbourhood walkability, etc.) were appended via residential postal codes using the Postal Code Conversion File Plus (PCCF+).
Results
The cohort includes 13,360 singleton births. Pre-pregnancy data include sociodemographics, health behaviours, chronic conditions, psychosocial factors, and reproductive history. Contextual measures capture neighbourhood marginalization, air pollution, greenness, and built environment characteristics. In the CCHS, individuals who reported being pregnant at interview and those who did not (but later delivered) had similar characteristics (SMDs < 0.1), except for age and marital status. Data quality is supported by Statistics Canada's survey protocols, CIHI's hospital validation processes, and standardised geocoding.
Conclusion
Approved researchers can recreate this dataset within Statistics Canada's Research Data Centres using reproducible R code, which will become openly available on GitHub. The cohort enables research across descriptive epidemiology, causal inference, predictive modelling, and health equity evaluation, supporting investigations into multilevel determinants of maternal health. Future work should prioritise national mother--child linkages to expand life course research.
Demography. Population. Vital events
A local meshless method for the one-dimensional Fisher’s equation
Jianjun Cao, Bailing An, Enran Hou
et al.
This study presents a novel local meshless approach for solving one-dimensional Fisher’s equation, combining a local scheme, Gaussian radial basis functions (G-RBF), and a collocation technique. The method leverages the Gaussian basis’s nonlinear fitting capability, the sparsity of the local scheme to avoid ill-conditioned matrices, and the simplicity of collocation. After time discretization using a finite difference scheme, the method constructs local approximations at each collocation point using G-RBFs over small subsets of neighboring nodes. Numerical experiments confirm its effectiveness in solving Fisher-type problems, with errors decreasing smoothly as collocation points increase and maintaining stable accuracy over time. The proposed method demonstrates computational efficiency, robustness, and potential for handling large-scale reaction-diffusion systems.
Raman Spectroscopy Pre-Trained Encoder: A Self-Supervised Learning Approach for Data-Efficient Domain-Independent Spectroscopy Analysis
Abhiraam Eranti, Yogesh Tewari, Rafael Palacios
et al.
Deep-learning methods have boosted the analytical power of Raman spectroscopy, yet they still require large, task-specific, labeled datasets and often fail to transfer across application domains. The study explores pre-trained encoders as a solution. Pre-trained encoders have significantly impacted Natural Language Processing and Computer Vision with their ability to learn transferable representations that can be applied to a variety of datasets, significantly reducing the amount of time and data required to create capable models. The following work puts forward a new approach that applies these benefits to Raman Spectroscopy. The proposed approach, RSPTE (Raman Spectroscopy Pre-Trained Encoder), is designed to learn generalizable spectral representations without labels. RSPTE employs a novel domain adaptation strategy using unsupervised Barlow Twins decorrelation objectives to learn fundamental spectral patterns from multi-domain Raman Spectroscopy datasets containing samples from medicine, biology, and mineralogy. Transferability is demonstrated through evaluation on several models created by fine-tuning RSPTE for different application domains: Medicine (detection of Melanoma and COVID), Biology (Pathogen Identification), and Agriculture. As an example, using only 20% of the dataset, models trained with RSPTE achieve accuracies ranging 50%–86% (depending on the dataset used) while without RSPTE the range is 9%–57%. Using the full dataset, accuracies with RSPTE range 81%–97%, and without pre-training 51%–97%. Current methods and state-of-the-art models in Raman Spectroscopy are compared to RSPTE for context, and RSPTE exhibits competitive results, especially with less data as well. These results provide evidence that the proposed RSPTE model can effectively learn and transfer generalizable spectral features across different domains, achieving accurate results with less data in less time (both data collection time and training time).
Electrical engineering. Electronics. Nuclear engineering
A systematic review of data science and machine learning applications to the oil and gas industry
Z. Tariq, M. Aljawad, A. Hasan
et al.
This study offered a detailed review of data sciences and machine learning (ML) roles in different petroleum engineering and geosciences segments such as petroleum exploration, reservoir characterization, oil well drilling, production, and well stimulation, emphasizing the newly emerging field of unconventional reservoirs. The future of data science and ML in the oil and gas industry, highlighting what is required from ML for better prediction, is also discussed. This study also provides a comprehensive comparison of different ML techniques used in the oil and gas industry. With the arrival of powerful computers, advanced ML algorithms, and extensive data generation from different industry tools, we see a bright future in developing solutions to the complex problems in the oil and gas industry that were previously beyond the grip of analytical solutions or numerical simulation. ML tools can incorporate every detail in the log data and every information connected to the target data. Despite their limitations, they are not constrained by limiting assumptions of analytical solutions or by particular data and/or power processing requirements of numerical simulators. This detailed and comprehensive study can serve as an exclusive reference for ML applications in the industry. Based on the review conducted, it was found that ML techniques offer a great potential in solving problems in almost all areas of the oil and gas industry involving prediction, classification, and clustering. With the generation of huge data in everyday oil and gas industry activates, machine learning and big data handling techniques are becoming a necessity toward a more efficient industry.
The role of data science in healthcare advancements: applications, benefits, and future prospects
S. Subrahmanya, Dasharathraj K. Shetty, Vathsala Patil
et al.
Data science is an interdisciplinary field that extracts knowledge and insights from many structural and unstructured data, using scientific methods, data mining techniques, machine-learning algorithms, and big data. The healthcare industry generates large datasets of useful information on patient demography, treatment plans, results of medical examinations, insurance, etc. The data collected from the Internet of Things (IoT) devices attract the attention of data scientists. Data science provides aid to process, manage, analyze, and assimilate the large quantities of fragmented, structured, and unstructured data created by healthcare systems. This data requires effective management and analysis to acquire factual results. The process of data cleansing, data mining, data preparation, and data analysis used in healthcare applications is reviewed and discussed in the article. The article provides an insight into the status and prospects of big data analytics in healthcare, highlights the advantages, describes the frameworks and techniques used, briefs about the challenges faced currently, and discusses viable solutions. Data science and big data analytics can provide practical insights and aid in the decision-making of strategic decisions concerning the health system. It helps build a comprehensive view of patients, consumers, and clinicians. Data-driven decision-making opens up new possibilities to boost healthcare quality.
Public–private partnership in pipelining science of acute care ecosystem: Insights from Taiwan's Presidential Hackathon
Chao‐Wen Chen, Yung‐Sung Yeh, Ta‐Chien Chan
et al.
Abstract Introduction The acute care system faced significant challenges in managing healthcare emergencies due to a lack of coordination between emergency services and logistical support. This disorganization undermined collaboration and response efficiency. Methods Taiwan's Presidential Hackathon introduced an innovative approach to improving the trauma system by integrating digital pipeline science through public–private partnerships (PPPs). This initiative specifically addressed inefficiencies and complexities in the acute care ecosystem, brought to light by the catastrophic 2014 gas explosion in Kaohsiung City. Results The hackathon led to the development of a unified digital platform for emergency data management. This platform significantly enhanced communication, data sharing, and coordination across healthcare sectors, culminating in the implementation of a digital pre‐hospital emergency care system across multiple administrative regions. Conclusion Our experience demonstrated the effectiveness of leveraging digital technologies, PPPs, and the hackathon model to revolutionize emergency healthcare management and response systems through cross‐sector collaboration.
Medicine (General), Public aspects of medicine
Research software: A key (neglected) component of the digital research infrastructure ecosystem
Anelda van der Walt, Kim Martin, Sumir Panji
et al.
Science, Science (General)
LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions
Mizanur Rahman, Amran Bhuiyan, Mohammed Saidul Islam
et al.
Recent advances in large language models (LLMs) have enabled a new class of AI agents that automate multiple stages of the data science workflow by integrating planning, tool use, and multimodal reasoning across text, code, tables, and visuals. This survey presents the first comprehensive, lifecycle-aligned taxonomy of data science agents, systematically analyzing and mapping forty-five systems onto the six stages of the end-to-end data science process: business understanding and data acquisition, exploratory analysis and visualization, feature engineering, model building and selection, interpretation and explanation, and deployment and monitoring. In addition to lifecycle coverage, we annotate each agent along five cross-cutting design dimensions: reasoning and planning style, modality integration, tool orchestration depth, learning and alignment methods, and trust, safety, and governance mechanisms. Beyond classification, we provide a critical synthesis of agent capabilities, highlight strengths and limitations at each stage, and review emerging benchmarks and evaluation practices. Our analysis identifies three key trends: most systems emphasize exploratory analysis, visualization, and modeling while neglecting business understanding, deployment, and monitoring; multimodal reasoning and tool orchestration remain unresolved challenges; and over 90% lack explicit trust and safety mechanisms. We conclude by outlining open challenges in alignment stability, explainability, governance, and robust evaluation frameworks, and propose future research directions to guide the development of robust, trustworthy, low-latency, transparent, and broadly accessible data science agents.
Biomaterialomics: Data Science-driven Pathways to develop fourth-Generation Biomaterials.
B. Basu, N. Gowtham, Yang Xiao
et al.
Conventional approaches to developing biomaterials and implants require intuitive tailoring of process variables, long development cycles, and high expenses. To meet the biomedical and clinical demands, it is critical to accelerate the production of personalized implantable biomaterials and biomedical devices. Building on the Materials Genome Initiative, we define the concept 'biomaterialomics' as the integration of multi-omics data and high-dimensional analysis with artificial intelligence (AI) tools throughout the entire pipeline of biomaterials development. The Data Science-driven approach is envisioned to bring together on a single platform, the computational tools, databases, experimental methods, machine learning, and advanced manufacturing (e.g., 3D printing) to develop the fourth-generation biomaterials and implants, whose clinical performance will be predicted using 'digital twins'. While analysing the key elements of the concept of 'biomaterialomics', significant emphasis has been put forward to effectively utilize high-throughput biocompatibility data together with multiscale physics-based models, E-platform/online databases of clinical studies, data science approaches, including metadata management, AI/ Machine Learning (ML) algorithms and uncertainty predictions. Such integrated formulation will allow one to adopt cross-disciplinary approaches to establish processing-structure-property (PSP) linkages. A few published studies from the lead author's research group serve as case studies to illustrate the formulation and relevance of the 'Biomaterialomics' approaches for three emerging research themes, i.e. patient-specific implants, additive manufacturing, and bioelectronic medicine. The increased adaptability of AI/ML tools in biomaterials science along with the training of the new generation researchers with data science concepts are strongly recommended. STATEMENT OF SIGNIFICANCE: The currently practiced strategy to develop new biomaterials and implants require intuitive tailoring of manufacturing protocols, biocompatibility assessment and clinical studies. This leading opinion review paper emphasizes the need to integrate the concepts and algorithms of the data science with biomaterials science. Also, this paper emphasizes the need to establish a mathematically rigorous cross-disciplinary framework that will allow a systematic quantitative exploration and curation of critical biomaterials knowledge needed to drive objectively the innovation efforts within a suitable uncertainty quantification framework, as embodied in 'biomaterialomics' concept, which integrates multi-omics data and high-dimensional analysis with artificial intelligence (AI) tools, like machine learning. The formulation of this approach has been demonstrated for patient-specific implants, additive manufacturing, and bioelectronic medicine.
Cross-Cultural Biology Teaching Using Next-Generation Science Standards
Jason Jimenez, Denis Dyvee Errabo
This study explored Next Generation Science Standards (NGSS) in cross-cultural biology teaching through collaborative lesson studies involving educators from the USA and the Philippines. We employed grounded theory and examined iterative feedback processes during lesson development to refine learning exemplars. Learning exemplars validation affirmed their alignment with both NGSS and the Philippine science education frameworks, ensuring cultural relevance and educational rigor. Five key themes were identified as pivotal: retrieval of prior knowledge, fostering meaningful learning experiences, enhancing memory and retention, fostering active engagement, and cultivating critical thinking skills—integral for developing culturally responsive curricula. Moreover, students became independent learners, responsible for their learning, reflective and critical thinkers, problem solvers, inquiry-oriented, creative, collaborative communicators, modelers, data analysts, persistent, adaptable, and self-directed. Implications include enhancing educational policies to support cultural diversity and integrating cross-cultural learning exemplars to enhance global teaching practices. This study underscored the transformative potential of cross-cultural collaboration in advancing science education, fostering engaging learning environments, and preparing students for global citizenship.
Dribbling emotions and anxiety in women’s football: a scoping review
Elena-Andreea Trandafirescu, Vladimir Potop, Ilie Mihai
et al.
Background and Study Aim. Fear of failure, choking under pressure, financial disparities in income, and concerns related to body image and social pressure may intensify on-field challenges. These factors make the competitive environment particularly demanding for female soccer players. The aim of the current review is to map the available evidence on anxiety in women's football over the past decade and to identify interventions designed to manage anxiety among female soccer players.
Material and Methods. A scoping review was conducted following the PRISMA-ScR guidelines. An extensive search was carried out across four major databases for publications dated between 2014 and 2024. Two independent reviewers screened titles and abstracts, followed by full-text screening and data extraction. Any disagreements were resolved by a third researcher. The findings were tabulated and synthesized in a narrative format. The study was pre-registered on the Open Science Framework.
Results. The search yielded 2885 unique records. A total of 21 texts were reviewed in full, and the final sample included 14 studies. Some of the records included mentioned modifications to psychosocial interventions, such as Virtual Reality (VR) relaxation sessions, pre-exercise sporting massage, and psychological training programs (including psycho-neuromuscular theory and visualization techniques). These interventions led to a reduction in anxiety, though their effectiveness varied significantly across different approaches. Our results suggest that research on women's football may disproportionately focus on psychological interventions for managing anxiety. This highlights the need for broader investigations into other factors that influence athletes' performance and well-being.
Conclusions. This scoping review helps clarify the current landscape of anxiety research in women's soccer. It highlights both promising interventions and gaps in the research over the past decade. Although the body of evidence on anxiety interventions is small, it suggests that psychological interventions may be effective in reducing anxiety among female soccer players.
Special aspects of education, Sports
Quantum Control Design by Lyapunov Trajectory Tracking and Optimal Control
Hongli Yang, Guohui Yu, Ivan Ganchev Ivanov
In this paper, we investigate a Lyapunov trajectory tracking design method that incorporates a Schrödinger equation with a dipole subterm and polarizability. Our findings suggest that the proposed control law can overcome the limitations of certain existing control laws that do not converge. By integrating a quadratic performance index, we introduce an optimal control law, which we subsequently analyze for stability and optimality. We also simulate the spin-1/2 particle system to illustrate our results. These findings are further validated through numerical illustrations involving a 3D, 5D system, and a spin-1/2 particle system.
Validation in Forensic Text Comparison: Issues and Opportunities
Shunichi Ishihara, Sonia Kulkarni, Michael Carne
et al.
It has been argued in forensic science that the empirical validation of a forensic inference system or methodology should be performed by replicating the conditions of the case under investigation and using data relevant to the case. This study demonstrates that the above requirement for validation is also critical in forensic text comparison (FTC); otherwise, the trier-of-fact may be misled for their final decision. Two sets of simulated experiments are performed: one fulfilling the above validation requirement and the other overlooking it, using mismatch in topics as a case study. Likelihood ratios (LRs) are calculated via a Dirichlet-multinomial model, followed by logistic-regression calibration. The derived LRs are assessed by means of the log-likelihood-ratio cost, and they are visualized using Tippett plots. Following the experimental results, this paper also attempts to describe some of the essential research required in FTC by highlighting some central issues and challenges unique to textual evidence. Any deliberations on these issues and challenges will contribute to making a scientifically defensible and demonstrably reliable FTC available.
Using ChatGPT for Data Science Analyses
Ozan Evkaya, Miguel de Carvalho
As a result of recent advancements in generative AI, the field of data science is prone to various changes. The way practitioners construct their data science workflows is now irreversibly shaped by recent advancements, particularly by tools like OpenAI's Data Analysis plugin. While it offers powerful support as a quantitative co-pilot, its limitations demand careful consideration in empirical analysis. This paper assesses the potential of ChatGPT for data science analyses, illustrating its capabilities for data exploration and visualization, as well as for commonly used supervised and unsupervised modeling tasks. While we focus here on how the Data Analysis plugin can serve as co-pilot for Data Science workflows, its broader potential for automation is implicit throughout.
AutoDS: Towards Human-Centered Automation of Data Science
Dakuo Wang, Josh Andres, Justin D. Weisz
et al.
Data science (DS) projects often follow a lifecycle that consists of laborious tasks for data scientists and domain experts (e.g., data exploration, model training, etc.). Only till recently, machine learning(ML) researchers have developed promising automation techniques to aid data workers in these tasks. This paper introduces AutoDS, an automated machine learning (AutoML) system that aims to leverage the latest ML automation techniques to support data science projects. Data workers only need to upload their dataset, then the system can automatically suggest ML configurations, preprocess data, select algorithm, and train the model. These suggestions are presented to the user via a web-based graphical user interface and a notebook-based programming user interface. Our goal is to offer a systematic investigation of user interaction and perceptions of using an AutoDS system in solving a data science task. We studied AutoDS with 30 professional data scientists, where one group used AutoDS, and the other did not, to complete a data science project. As expected, AutoDS improves productivity; Yet surprisingly, we find that the models produced by the AutoDS group have higher quality and less errors, but lower human confidence scores. We reflect on the findings by presenting design implications for incorporating automation techniques into human work in the data science lifecycle.
94 sitasi
en
Computer Science