Theory-Guided Data Science: A New Paradigm for Scientific Discovery from Data
A. Karpatne, G. Atluri, James H. Faghmous
et al.
Data science models, although successful in a number of commercial domains, have had limited applicability in scientific problems involving complex physical phenomena. Theory-guided data science (TGDS) is an emerging paradigm that aims to leverage the wealth of scientific knowledge for improving the effectiveness of data science models in enabling scientific discovery. The overarching vision of TGDS is to introduce scientific consistency as an essential component for learning generalizable models. Further, by producing scientifically interpretable models, TGDS aims to advance our scientific understanding by discovering novel domain insights. Indeed, the paradigm of TGDS has started to gain prominence in a number of scientific disciplines such as turbulence modeling, material discovery, quantum chemistry, bio-medical science, bio-marker discovery, climate science, and hydrology. In this paper, we formally conceptualize the paradigm of TGDS and present a taxonomy of research themes in TGDS. We describe several approaches for integrating domain knowledge in different research themes using illustrative examples from different disciplines. We also highlight some of the promising avenues of novel research for realizing the full potential of theory-guided data science.
1219 sitasi
en
Computer Science, Mathematics
What Is Data Science
Longbing Cao
Human-AI Collaboration in Data Science
Dakuo Wang, Justin D. Weisz, Michael J. Muller
et al.
The rapid advancement of artificial intelligence (AI) is changing our lives in many ways. One application domain is data science. New techniques in automating the creation of AI, known as AutoAI or AutoML, aim to automate the work practices of data scientists. AutoAI systems are capable of autonomously ingesting and pre-processing data, engineering new features, and creating and scoring models based on a target objectives (e.g. accuracy or run-time efficiency). Though not yet widely adopted, we are interested in understanding how AutoAI will impact the practice of data science. We conducted interviews with 20 data scientists who work at a large, multinational technology company and practice data science in various business settings. Our goal is to understand their current work practices and how these practices might change with AutoAI. Reactions were mixed: while informants expressed concerns about the trend of automating their jobs, they also strongly felt it was inevitable. Despite these concerns, they remained optimistic about their future job security due to a view that the future of data science work will be a collaboration between humans and AI systems, in which both automation and human expertise are indispensable.
331 sitasi
en
Computer Science, Psychology
How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation
Michael J. Muller, Ingrid Lange, Dakuo Wang
et al.
With the rise of big data, there has been an increasing need for practitioners in this space and an increasing opportunity for researchers to understand their workflows and design new tools to improve it. Data science is often described as data-driven, comprising unambiguous data and proceeding through regularized steps of analysis. However, this view focuses more on abstract processes, pipelines, and workflows, and less on how data science workers engage with the data. In this paper, we build on the work of other CSCW and HCI researchers in describing the ways that scientists, scholars, engineers, and others work with their data, through analyses of interviews with 21 data science professionals. We set five approaches to data along a dimension of interventions: Data as given; as captured; as curated; as designed; and as created. Data science workers develop an intuitive sense of their data and processes, and actively shape their data. We propose new ways to apply these interventions analytically, to make sense of the complex activities around data practices.
300 sitasi
en
Computer Science
Process Mining for Python (PM4Py): Bridging the Gap Between Process- and Data Science
A. Berti, S. V. Zelst, Wil M.P. van der Aalst
Process mining, i.e., a sub-field of data science focusing on the analysis of event data generated during the execution of (business) processes, has seen a tremendous change over the past two decades. Starting off in the early 2000's, with limited to no tool support, nowadays, several software tools, i.e., both open-source, e.g., ProM and Apromore, and commercial, e.g., Disco, Celonis, ProcessGold, etc., exist. The commercial process mining tools provide limited support for implementing custom algorithms. Moreover, both commercial and open-source process mining tools are often only accessible through a graphical user interface, which hampers their usage in large-scale experimental settings. Initiatives such as RapidProM provide process mining support in the scientific workflow-based data science suite RapidMiner. However, these offer limited to no support for algorithmic customization. In the light of the aforementioned, in this paper, we present a novel process mining library, i.e. Process Mining for Python (PM4Py) that aims to bridge this gap, providing integration with state-of-the-art data science libraries, e.g., pandas, numpy, scipy and scikit-learn. We provide a global overview of the architecture and functionality of PM4Py, accompanied by some representative examples of its usage.
243 sitasi
en
Computer Science
Supervised and Unsupervised Learning for Data Science
Shear banding and flow instabilities in wormlike micelles: Modelling and mechanisms – A review
Sudheesh Parathakkatt, Vaisakh Kizhuveetil, Gokul G. K.
et al.
Worm-like micelles (WLMs) are dynamic, self-assembling supramolecular structures that exhibit complex viscoelastic behaviour due to their ability to undergo reversible scission, fusion, branching, and sequence rearrangement. This review provides a comprehensive analysis of recent theoretical advances in modelling WLM rheology, from classical reptation–scission theories to modern stochastic simulations and multi-scale population-balance frameworks. A central challenge addressed is the rheological indistinguishability of competing models under linear conditions, which renders inverse modelling ill-posed and necessitates the integration of experimental data, such as cryogenic transmission electron microscopy (cryo-TEM), small-angle neutron scattering (SANS), and flow birefringence, to constrain theoretical predictions. The article further explores the limitations of conventional models in capturing nonlinear responses, including shear banding and extensional strain hardening, and emphasizes the need for spatially resolved, structurally informed constitutive equations. Emerging tools, including neural networks and hybrid modular frameworks, are identified as promising solutions for bridging microscopic rearrangement dynamics with macroscopic flow behaviour. Ultimately, the development of predictive, physically grounded WLM models will be essential for advancing applications in formulation science, smart materials, and industrial processing.
Materials of engineering and construction. Mechanics of materials, Chemical technology
Correction to: IGFM: An Enhanced Graph Similarity Computation Method with Fine‑Grained Analysis
Min Pei, Jianke Yu, Chen Chen
et al.
Information technology, Electronic computers. Computer science
Geospatial for Good: Empowering Citizens for Sustainable Urban and Rural Futures
S. K. Malick, V. Chavan, V. Chavan
et al.
Geospatial technologies are rapidly emerging as pivotal tools for advancing sustainable urban and rural development through citizen empowerment in India and worldwide. This study systematically reviews peer-reviewed and grey literature to examine their integration with global frameworks, such as the SDGs, Paris Agreement, and Sendai Framework, while aligning with Indian initiatives like NAPCC, Smart Cities, Digital India, SVAMITVA, AMRUT, and the National Geospatial Policy 2022, with emphasis on the citizen as a crucial feedback factor. Employing thematic mapping and comparative analysis between the Global North and South, we evaluate applications in urban planning, mobility, energy, resilience, and health, highlighting platforms like PPGIS, VGI, Bhuvan, and 'Know Your DIGIPIN' for participatory data collection and decision-making.</p>
<p>Our analysis reveals regional disparities in India, with the southern zone leading in innovation (35% adoption) and the eastern region focussing on disaster management (15%), along with global successes in disaster relief, welfare targeting, and immunisation tracking. Quantitative impacts include India's geospatial market growth to ₹63,000 crores by 2025 and AMRUT 2.0's rapid water and sewerage coverage expansion in many major cities. However, persistent challenges include technical knowledge gaps in academia, insufficient institutional support for geospatial startups, and barriers like low digital literacy and language limitations that restrict broader participation.</p>
<p>We recommend enhanced geospatial education, open data policies, vernacular interfaces, and inclusive citizen science frameworks to bridge these gaps, foster equitable participation, and realise geospatial intelligence's full potential for resilient, data-driven sustainability.
Technology, Engineering (General). Civil engineering (General)
Deep nested U-structure network with frequency attention for building semantic segmentation
Khaled Moghalles, Zaid Al-Huda, Dalal AL-Alimi
et al.
Abstract The automated segmentation of buildings from remotely sensed imagery has undergone extensive research and application across various industrial domains. Despite this, several challenges persist, including incomplete internal extraction, low accuracy in edge segmentation, and difficulties in predicting irregular targets. We have introduced a novel approach to address these issues: an end-to-end residual U-structure embedded within a U-Net, enhanced by a frequency attention module and a hybrid loss function. The novel residual U-structure is introduced to replace the encode-decode blocks of traditional U-Nets, and the hybrid loss function is utilized to guide segmentation for more complete and accurate segmentation masks. A frequency attention module is also implemented to emphasize essential features and minimize irrelevant ones. A comparison of the proposed framework with other baseline schemes was conducted on four benchmark data sets, and the experimental results demonstrate that our framework performs better segmentation than other baseline state-of-the-art schemes.
Association of Race With Risk of Incident Cardiovascular Disease, Coronary Heart Disease, Heart Failure, and Stroke
Michael J. Domanski, MD, Colin O. Wu, PhD, Xin Tian, PhD
et al.
Background: In prior studies of cumulative risk factor exposure, self-identified race was independently associated with incident cardiovascular disease (CVD). A recent study suggests clinical, demographic, and socioeconomic factors explain racial differences. We used propensity score matching to study race as an independent incident CVD risk factor. Objectives: The purpose of this study was to assess race as an independent risk factor for incident CVD. Methods: We analyzed CARDIA (Coronary Artery Risk Development in Young Adults) study data using propensity score matching of White and Black women, and, separately, White and Black men, with respect to known CVD risk factors. Results: Black men (n = 487), compared to White men (n = 487), had higher risk of CVD (HR: 2.30; 95% CI: 1.36-3.89; P = 0.0014), stroke (HR: 5.00; 95% CI: 1.45-17.3; P = 0.0047), and congestive heart failure (CHF) (HR: 3.60; 95% CI: 1.34-9.70; P = 0.0067). Black women (n = 640), compared to White women (n = 640), had higher CVD risk (HR: 2.36; 95% CI: 1.17-4.78; P = 0.014) and stroke risk (HR: 2.80; 95% CI: 1.01-7.77; P = 0.039) and borderline significantly higher CHF risk (HR: 3.50; 95% CI: 0.73-16.9; P = 0.096). Risk of coronary heart disease did not differ significantly by race in either sex. Multivariable analyses showed racial differences in the associations of multiple risk factors with incident CVD events independent of other known CVD risk factors. Conclusions: Propensity score matching analyses demonstrate that race is an independent risk factor for incident CVD and its components, CHF, and stroke. Multivariable analyses suggest racial differences in Black vs White risk factor impact as the possible cause. Reasons for these differences remain to be explored.
Diseases of the circulatory (Cardiovascular) system, Medical emergencies. Critical care. Intensive care. First aid
Data Jamboree: A Party of Open-Source Software Solving Real-World Data Science Problems
Lucy D'Agostino McGowan, Shannon Tass, Sam Tyner
et al.
The evolving focus in statistics and data science education highlights the growing importance of computing. This paper presents the Data Jamboree, a live event that combines computational methods with traditional statistical techniques to address real-world data science problems. Participants, ranging from novices to experienced users, followed workshop leaders in using open-source tools like Julia, Python, and R to perform tasks such as data cleaning, manipulation, and predictive modeling. The Jamboree showcased the educational benefits of working with open data, providing participants with practical, hands-on experience. We compared the tools in terms of efficiency, flexibility, and statistical power, with Julia excelling in performance, Python in versatility, and R in statistical analysis and visualization. The paper concludes with recommendations for designing similar events to encourage collaborative learning and critical thinking in data science.
Finding Related Tables in Data Lakes for Interactive Data Science
Yi Zhang, Z. Ives
Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.
139 sitasi
en
Medicine, Computer Science
Group-format, peer-facilitated mental health promotion interventions for students in higher education settings: a scoping review protocol
Carrie Brooke-Sumner, Yandisa Sikweyiya, Mercilene T Machisa
et al.
Introduction Young people in higher education face various stressors that can make them vulnerable to mental ill-health. Mental health promotion in this group therefore has important potential benefits. Peer-facilitated and group-format interventions may be feasible and sustainable. The scoping review outlined in this protocol aims to map the literature on group-format, peer-facilitated, in-person interventions for mental health promotion for higher education students attending courses on campuses in high and low/middle-income countries.Methods and analysis Relevant studies will be identified through conducting searches of electronic databases, including Medline, CINAHL, Scopus, ERIC and PsycINFO. Searches will be conducted using Boolean operators (AND, OR, NOT) and truncation functions appropriate for each database. We will include a grey literature search. We will include articles from student participants of any gender, and published in peer-reviewed journals between 2008 and 2023. We will include English-language studies and all study types including randomised controlled trials, pilot studies and descriptive studies of intervention development. A draft charting table has been developed, which includes the fields: author, publication date, country/countries, aims, population and sample size, demographics, methods, intervention type, comparisons, peer training, number of sessions/duration of intervention, outcomes and details of measures.Ethics and dissemination No primary data will be collected from research participants to produce this review so ethics committee approval is not required. All data will be collated from published peer-reviewed studies already in the public domain. We will publish the review in an open-access, peer-reviewed journal accessible to researchers in low/middle-income countries. This protocol is registered on Open Science Framework (https://osf.io/agbfj/).
Palliative Care Practice and Associated Factors Among Nurses Working in Chronic Care Units of Tertiary Hospitals in Ethiopia: A Cross-Sectional Study
Gembe M, Tariku T, Tesfaye T
et al.
Maycas Gembe,1 Tizita Tariku,1 Temamen Tesfaye,2 Endalew Hailu Hailu2 1Department of Nursing, College of Health Science, Mattu University, Mattu, Ethiopia; 2School of Nursing, Faculty of Health Science, Jimma University, Jimma, EthiopiaCorrespondence: Maycas Gembe, Tel +251 936704900, Email MaycasDembelu27@gmail.comIntroduction: Palliative care is not yet widely available. Each year, approximately 58 million people worldwide require palliative care. In Africa, an estimated 9.7 million people require palliative care. In Ethiopia, the integration of palliative care into the country’s health system has taken several years. Previous studies conducted on palliative care have emphasized on the knowledge of nurses towards palliative care, and some studies have attempted to assess the practice of palliative care in specific areas of the country. However, we studied the practice of palliative care at a national level. So, the purpose of the study is to identify nurses’ palliative care practice level and factors associated with palliative care.Methods: We employed a facility-based cross-sectional study design to assess palliative care practices. We considered tertiary hospitals with functional cancer centers. To conduct this study, we included nurses from three hospitals; we recruited 255 nurses from these hospitals. The dependent variable of palliative care practice of nurses was first measured in continuous scale measurement and then converted to dichotomous outcome variable based on a 75% score cutoff value. Based on the multivariable logistic regression output, we considered variables with a p-value of < 0.05 as the factors significantly associated with good palliative care practice.Results: We collected data from 249 nurses, with a response rate of 98%. Based on our findings, 57% of nurses had good palliative care practices. We also identified that the type of healthcare facility, nurses’ knowledge of palliative care, nurses’ level of academic qualification, and nurses’ attitudes towards palliative care were factors associated with the level of palliative care practice.Conclusion: This score of palliative care practice level should be improved. The Federal Ministry of Health, along with healthcare facilities in general and tertiary hospitals in particular, should strengthen good palliative care practice by providing up-to-date training for nurses.Keywords: palliative care, nurses, tertiary hospital, Ethiopia
Epidemiology and Ecology of Toscana Virus Infection and Its Global Risk Distribution
Xue-Geng Hong, Mei-Qi Zhang, Fang Tang
et al.
Toscana virus (TOSV), a member of the <i>Phlebovirus</i> genus transmitted by sandflies, is acknowledged for its capacity to cause neurological infections and is widely distributed across Mediterranean countries. The potential geographic distribution and risk to the human population remained obscure due to its neglected nature. We searched PubMed and Web of Science for articles published between 1 January 1971 and 30 June 2023 to extract data on TOSV detection in vectors, vertebrates and humans, clinical information of human patients, as well as the occurrence of two identified sandfly vectors for TOSV. We further predicted the global distribution of the two sandfly vectors, based on which the global risk of TOSV was projected, after incorporating the environmental, ecoclimatic, biological, and socioeconomic factors. A total of 1342 unique studies were retrieved, among which 389 met the selection criteria and were included for data extraction. TOSV infections were documented in 10 sandfly species and 14 species of vertebrates, as well as causing a total of 7571 human infections. The occurrence probabilities of two sandfly vectors have demonstrated the greatest contributions to the potential distribution of TOSV infection risk. This study provides a comprehensive overview of global TOSV distribution and potential risk zones. Future surveillance and intervention programs should prioritize high-risk areas based on updated quantitative analyses.
Factors influencing weekend travel destination choice: A study in Ho Chi Minh city, Vietnam
To Ngoc Thinh, Bui Phuong Linh, Tran Tuan Anh
et al.
Nowadays, weekend travel is gradually gaining people's attention due to societal impacts, with the desire to improve health, relax, rest, and entertain after days of exhausting work. Therefore, the development of weekend travel is a strategy of interest to managers and leaders, leading to intense competition among destinations. Although it has been long established worldwide, weekend travel in Vietnam has only recently gained popularity, primarily among young people. Therefore, researching the factors influencing the decision to choose weekend travel destinations is significant in developing strategies for this type of tourism. The research results show that the choice of weekend travel destinations by Ho Chi Minh City tourists is driven by various internal and external factors. Among them, internal motivations, income, convenience in the trip, the image of the destination, etc., are factors rated highly by tourists. There are significant differences in some internal and external factors according to age groups. Model testing and research hypotheses indicate that 66.4% of destination choices are influenced by the proposed factors in the model. Among them, the destination image has the most significant impact, followed by income, internal motivations, and distance. The remaining factors in the model have low or no impact on the satisfaction and commitment to return to weekend travel destinations for Ho Chi Minh City tourists.
Social Sciences, Management. Industrial management
CRS-Diff: Controllable Remote Sensing Image Generation with Diffusion Model
Datao Tang, Xiangyong Cao, Xingsong Hou
et al.
The emergence of generative models has revolutionized the field of remote sensing (RS) image generation. Despite generating high-quality images, existing methods are limited in relying mainly on text control conditions, and thus do not always generate images accurately and stably. In this paper, we propose CRS-Diff, a new RS generative framework specifically tailored for RS image generation, leveraging the inherent advantages of diffusion models while integrating more advanced control mechanisms. Specifically, CRS-Diff can simultaneously support text-condition, metadata-condition, and image-condition control inputs, thus enabling more precise control to refine the generation process. To effectively integrate multiple condition control information, we introduce a new conditional control mechanism to achieve multi-scale feature fusion, thus enhancing the guiding effect of control conditions. To our knowledge, CRS-Diff is the first multiple-condition controllable RS generative model. Experimental results in single-condition and multiple-condition cases have demonstrated the superior ability of our CRS-Diff to generate RS images both quantitatively and qualitatively compared with previous methods. Additionally, our CRS-Diff can serve as a data engine that generates high-quality training data for downstream tasks, e.g., road extraction. The code is available at https://github.com/Sonettoo/CRS-Diff.
Benchmarking Data Science Agents
Yuge Zhang, Qiyang Jiang, Xingyu Han
et al.
In the era of data-driven decision-making, the complexity of data analysis necessitates advanced expertise and tools of data science, presenting significant challenges even for specialists. Large Language Models (LLMs) have emerged as promising aids as data science agents, assisting humans in data analysis and processing. Yet their practical efficacy remains constrained by the varied demands of real-world applications and complicated analytical process. In this paper, we introduce DSEval -- a novel evaluation paradigm, as well as a series of innovative benchmarks tailored for assessing the performance of these agents throughout the entire data science lifecycle. Incorporating a novel bootstrapped annotation method, we streamline dataset preparation, improve the evaluation coverage, and expand benchmarking comprehensiveness. Our findings uncover prevalent obstacles and provide critical insights to inform future advancements in the field.
The Categorical Data Map: A Multidimensional Scaling-Based Approach
Frederik L. Dennig, Lucas Joos, Patrick Paetzold
et al.
Categorical data does not have an intrinsic definition of distance or order, and therefore, established visualization techniques for categorical data only allow for a set-based or frequency-based analysis, e.g., through Euler diagrams or Parallel Sets, and do not support a similarity-based analysis. We present a novel dimensionality reduction-based visualization for categorical data, which is based on defining the distance of two data items as the number of varying attributes. Our technique enables users to pre-attentively detect groups of similar data items and observe the properties of the projection, such as attributes strongly influencing the embedding. Our prototype visually encodes data properties in an enhanced scatterplot-like visualization, encoding attributes in the background to show the distribution of categories. In addition, we propose two graph-based measures to quantify the plot's visual quality, which rank attributes according to their contribution to cluster cohesion. To demonstrate the capabilities of our similarity-based approach, we compare it to Euler diagrams and Parallel Sets regarding visual scalability and show its benefits through an expert study with five data scientists analyzing the Titanic and Mushroom datasets with up to 23 attributes and 8124 category combinations. Our results indicate that the Categorical Data Map offers an effective analysis method, especially for large datasets with a high number of category combinations.