D. Donoho
Hasil untuk "data science"
Menampilkan 20 dari ~44680174 hasil · dari DOAJ, CrossRef, Semantic Scholar
A. Berti, S. V. Zelst, Wil M.P. van der Aalst
Process mining, i.e., a sub-field of data science focusing on the analysis of event data generated during the execution of (business) processes, has seen a tremendous change over the past two decades. Starting off in the early 2000's, with limited to no tool support, nowadays, several software tools, i.e., both open-source, e.g., ProM and Apromore, and commercial, e.g., Disco, Celonis, ProcessGold, etc., exist. The commercial process mining tools provide limited support for implementing custom algorithms. Moreover, both commercial and open-source process mining tools are often only accessible through a graphical user interface, which hampers their usage in large-scale experimental settings. Initiatives such as RapidProM provide process mining support in the scientific workflow-based data science suite RapidMiner. However, these offer limited to no support for algorithmic customization. In the light of the aforementioned, in this paper, we present a novel process mining library, i.e. Process Mining for Python (PM4Py) that aims to bridge this gap, providing integration with state-of-the-art data science libraries, e.g., pandas, numpy, scipy and scikit-learn. We provide a global overview of the architecture and functionality of PM4Py, accompanied by some representative examples of its usage.
Dominik Kret, Wiktoria Szlachta, Daria Twardowska et al.
Myocardial bridging (MB) of the coronary arteries has long been considered a benign anatomical variant with no significant clinical relevance. However, the development of modern imaging and functional diagnostic methods indicates that its impact on myocardial perfusion may become apparent under conditions of increased hemodynamic load, especially during physical exertion. In the context of sports science, this issue is particularly important, as even subtle coronary flow disturbances can modulate the heart's response to training load and affect exercise tolerance.The aim of this study was to present the current state of knowledge on the anatomy, pathophysiology, clinical picture, diagnosis, and therapeutic management of myocardial bridges, with particular emphasis on their functional significance under conditions of stress typical of physical activity. The article is a narrative review of the literature based on a critical analysis of anatomical, observational, and imaging studies, as well as works devoted to functional assessment and treatment strategies. Available data indicate that MBs are a common structure, but heterogeneous in terms of morphology and hemodynamic significance. Although they remain asymptomatic in most people, in selected cases they may be associated with myocardial ischemia, arrhythmias, and reduced exercise tolerance. Myocardial bridges require individualized clinical and functional assessment, and their presence should not be automatically considered insignificant, especially in the context of exercise capacity assessment.
Zeynep Engin, P. Treleaven
The data science technologies of artificial intelligence (AI), Internet of Things (IoT), big data and behavioral/predictive analytics, and blockchain are poised to revolutionize government and create a new generation of GovTech start-ups. The impact from the ‘smartification’ of public services and the national infrastructure will be much more significant in comparison to any other sector given government's function and importance to every institution and individual. Potential GovTech systems include Chatbots and intelligent assistants for public engagement, Robo-advisors to support civil servants, real-time management of the national infrastructure using IoT and blockchain, automated compliance/regulation, public records securely stored in blockchain distributed ledgers, online judicial and dispute resolution systems, and laws/statutes encoded as blockchain smart contracts. Government is potentially the major ‘client’ and also ‘public champion’ for these new data technologies. This review paper uses our simple taxonomy of government services to provide an overview of data science automation being deployed by governments world-wide. The goal of this review paper is to encourage the Computer Science community to engage with government to develop these new systems to transform public services and support the work of civil servants.
Jianqing Fan, Runze Li, Cun-Hui Zhang et al.
Lana Pacheco Franco‐Gedda, Karina Rodrigues, Matias Noll et al.
ABSTRACT Background and Aims Vitamin D deficiency is a major public health issue, with varying individual responses to supplementation. Genetic factors, especially single‐nucleotide polymorphisms (SNPs) in Vitamin D metabolism genes, likely play a key role. This protocol proposes a systematic review to explore how genetic variability affects serum 25‐hydroxyvitamin D [25(OH)D] levels after supplementation. Methods This protocol adheres to the Preferred Reporting Items for Systematic Review and Meta‐Analysis Protocols (PRISMA‐P). The literature search will be conducted across MEDLINE, Scopus, Web of Science, and Embase, without restrictions on publication date or language. The study selection will be guided by Population, Exposure, Comparator, Outcomes, Study Design (PECOS) framework, focusing on randomized clinical trials that report pre‐ and post‐supplementation serum 25(OH)D levels alongside genotype data. Inclusion criteria comprise adults and elderly individuals, from both sexes and any ethnicity, who received Vitamin D supplementation and have SNPs data, while exclusion criteria reject studies with confounding factors such as pre‐existing conditions or use of medications affecting Vitamin D status. Data extraction will include study characteristics, participant demographics, intervention details, SNPs, and serum 25(OH)D data. Inter‐rater reliability will be assessed using Cohen's kappa coefficient. A descriptive synthesis will summarize the findings, and if feasible, a meta‐analysis will be conducted. The primary outcome will be changes in serum 25(OH)D concentrations. Heterogeneity among studies will be quantified using the I² statistic. The methodological quality of studies will be assessed using the Joanna Briggs Institute checklist, and the overall certainty of evidence will be evaluated using the Grading of Recommendations, Assessment, Development, and Evaluations (GRADE) approach. Conclusion By identifying genetic subgroups with differential responses to vitamin D supplementation, the findings are expected to contribute to the development of personalized supplementation strategies. These insights may enhance health interventions by optimizing supplementation protocols based on genetic predispositions, ultimately improving health outcomes. Trial Registration This protocol has been registered with International prospective register of systematic reviews (PROSPERO) ID: CRD42023449836.
J. J. Carrera-Hernández
<p>This work presents Mexico's High Resolution Climate Database (MexHiResClimDB), which is a newly developed gridded, high-resolution climate dataset comprised of daily, monthly and yearly precipitation and temperature (<span class="inline-formula"><i>T</i><sub>min</sub></span>, <span class="inline-formula"><i>T</i><sub>max</sub></span>, <span class="inline-formula"><i>T</i><sub>avg</sub></span>). This new database provides the largest temporal coverage of the aforementioned climate variables at the highest spatial resolution (20 arcsec, or 560 m on Mexico's CCL projection) when compared to the other currently available gridded datasets for Mexico and its development has allowed for the analysis of the country's climate extremes for the 1951–2020 period. By comparing the spatial distribution of precipitation from the MexHiResClimDB with other gridded data (Daymet, L15, CHIRPS and PERSIANN CDR), it was found that the precipitation provided by this new dataset adequately represents the spatial variation of extreme precipitation events, in particular for the precipitation that occurred during 15–16 September 2013, caused by the presence of Tropical storm Manuel in the Pacific Ocean and Hurricane Ingrid (Cat 1) in the Gulf of Mexico. Using data from 61 days retrieved from Automated Weather Stations located throughout Mexico – and correspoding to the two months with the largest precipitation in Mexico – it was found that precipitation data from MexHiResClimDB has the lowest MAE (8.7 mm), compared to those of L15 (9.5 mm), Daymet (10.1 mm) and CHIRPS (11.7 mm). For <span class="inline-formula"><i>T</i><sub>min</sub></span> and <span class="inline-formula"><i>T</i><sub>max</sub></span>, the lowest MAE was obtained with MexHiResClimDB (1.7 and 1.8 °C, respectively), followed by Daymet (2.0 °C for both temperatures) and L15 (2.4 and 2.5 °C). With this new database an analysis of the extreme events of precipitation and temperature in Mexico for the 1951–2020 period was undertaken: the wettest year was 1958, the wettest day 26 September 1970, and September of 2013 the wettest month. It was also found that eight out of the ten days with the highest <span class="inline-formula"><i>T</i><sub>min</sub></span> occurred in 2020, the two months with the highest <span class="inline-formula"><i>T</i><sub>min</sub></span> were July and August of 2020 and that the six years with the highest <span class="inline-formula"><i>T</i><sub>min</sub></span> were 2015–2020. When <span class="inline-formula"><i>T</i><sub>max</sub></span> was analysed, it was found that the hottest day was 15 June 1998, while June of 1998 was the hottest month and 2020 the hottest year, and that the four hottest years occurred between 2011–2020. Nationwide (and considering 1961–1990 as the baseline period), <span class="inline-formula"><i>T</i><sub>min</sub></span>, <span class="inline-formula"><i>T</i><sub>avg</sub></span> and <span class="inline-formula"><i>T</i><sub>max</sub></span> have increased, with their anomalies drastically increasing in recent years and reaching values above 1.0 °C in 2020. At the same time, precipitation has also decreased in recent years – which combined with the increase in temperature will have severe impacts on water availability. This new database provides a tool to quantify – in detail – the spatio-temporal variability of climate throughout Mexico.</p> <p>The MexHiResClimDB entire dataset is available on Figshare (<a href="https://doi.org/10.6084/m9.figshare.c.7689428.v2">https://doi.org/10.6084/m9.figshare.c.7689428.v2</a>, <span class="cit" id="xref_altparen.1"><a href="#bib1.bibx16">Carrera-Hernández</a>, <a href="#bib1.bibx16">2025</a><a href="#bib1.bibx16">a</a></span>).</p>
Fiona Preston-Whyte , Toshka Barnardo , Danica Marlin et al.
Data gaps limit solutions and policy development for environmental issues. Citizen science offers a possible solution to reduce data gaps at a limited cost while enhancing environmental education (EE). While highly effective in the latter, citizen science campaigns rarely produce reliable, comparable, and meaningful data. This often results from fragmented awareness, varying data collection methods, and little training prior to data collection. This article explores how Massive Open Online Courses (MOOCs) can be used to train citizen scientists, increase the value of citizen science data, and ensure that resources invested in citizen science initiatives are used more efficiently. We use a beach macrolitter monitoring course developed by Sustainable Seas Trust (SST) (NGO/NPO) and GRID-Arendal (a research foundation) as a case study in Africa, since the marine litter issue has widespread public support, and beaches are pleasant locations that attract potential citizen scientists. Beach macrolitter surveys utilise everyday equipment, and monitoring methods are simple if individuals are supported with appropriate training. This is especially relevant in Africa, where plastic pollution is forecasted to increase faster than other regions, and resources for research can be limited. This article gives a modified problemsolution model (mPSM) perspective, considering the challenges and solutions of MOOC development by two organisations working in the same space with limited resources. Challenges to inclusivity for online training in Africa included language barriers and limited technological access. Using Africa as a case study, we show that by combining professional abilities, inclusive digital education can be achieved using data-light MOOCs, offline engagement and other inclusive strategies to overcome the challenges of m- (mobile) and e- (electronic) learning. This kind of EE can be a powerful tool in developing reliable data while enhancing citizens’ agency in working towards Sustainable Development Goals (SDGs).
Bin Yu
Significance Predictability, computability, and stability (PCS) are three core principles of data science. They embed the scientific principles of prediction and replication in data-driven decision making while recognizing the central role of computation. Based on these principles, we propose the PCS framework, including workflow and documentation (in R Markdown or Jupyter Notebook). The PCS framework aims at responsible, reliable, reproducible, and transparent analysis across fields of science, social science, engineering, business, and government. It can be used as a recommendation system for scientific hypothesis generation and experimental design. In particular, we propose (basic) PCS inference for reliability measures on data results, extending statistical inference to a much broader scope as current data science practice entails. Building and expanding on principles of statistics, machine learning, and scientific inquiry, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework, composed of both a workflow and documentation, aims to provide responsible, reliable, reproducible, and transparent results across the data science life cycle. The PCS workflow uses predictability as a reality check and considers the importance of computation in data collection/storage and algorithm design. It augments predictability and computability with an overarching stability principle. Stability expands on statistical uncertainty considerations to assess how human judgment calls impact data results through data and model/algorithm perturbations. As part of the PCS workflow, we develop PCS inference procedures, namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data results relative to problem formulation, data cleaning, modeling decisions, and interpretations. We illustrate PCS inference through neuroscience and genomics projects of our own and others. Moreover, we demonstrate its favorable performance over existing methods in terms of receiver operating characteristic (ROC) curves in high-dimensional, sparse linear model simulations, including a wide range of misspecified models. Finally, we propose PCS documentation based on R Markdown or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis. The PCS workflow and documentation are demonstrated in a genomics case study available on Zenodo.
J. Pei
Data are invaluable. How can we assess the value of data objectively, systematically and quantitatively? Pricing data, or information goods in general, has been studied and practiced in dispersed areas and principles, such as economics, marketing, electronic commerce, data management, data mining and machine learning. In this article, we present a unified, interdisciplinary and comprehensive overview of this important direction. We examine various motivations behind data pricing, understand the economics of data pricing and review the development and evolution of pricing models according to a series of fundamental principles. We discuss both digital products and data products. We also consider a series of challenges and directions for future work.
James F. Steiner, Edward Nathan, Kun Hu et al.
We present the first X-ray spectropolarimetric results for Cygnus X-1 in its soft state from a campaign of five IXPE observations conducted during 2023 May–June. Companion multiwavelength data during the campaign are likewise shown. The 2–8 keV X-rays exhibit a net polarization degree PD = 1.99% ± 0.13% (68% confidence). The polarization signal is found to increase with energy across the Imaging X-ray Polarimetry Explorer’s (IXPE) 2–8 keV bandpass. The polarized X-rays exhibit an energy-independent polarization angle of PA = −25.°7 ± 1.°8 east of north (68% confidence). This is consistent with being aligned to Cyg X-1’s au-scale compact radio jet and its parsec-scale radio lobes. In comparison to earlier hard-state observations, the soft state exhibits a factor of 2 lower polarization degree but a similar trend with energy and a similar (also energy-independent) position angle. When scaling by the natural unit of the disk temperature, we find the appearance of a consistent trend line in the polarization degree between the soft and hard states. Our favored polarimetric model indicates that Cyg X-1’s spin is likely high ( a _* ≳ 0.96). The substantial X-ray polarization in Cyg X-1's soft state is most readily explained as resulting from a large portion of X-rays emitted from the disk returning and reflecting off the disk surface, generating a high polarization degree and a polarization direction parallel to the black hole spin axis and radio jet. In IXPE’s bandpass, the polarization signal is dominated by the returning reflection emission. This constitutes polarimetric evidence for strong gravitational lensing of X-rays close to the black hole.
Juana Sanchez
Luke Durell, J. Thad Scott, Amanda S. Hering
Li X, Jiang H, Liang X
Xuefeng Li,1,2 Hui Jiang,1 Xiaoyu Liang1 1School of Engineering Science, University of Chinese Academy of Sciences, Beijing, 100049, People’s Republic of China; 2School of Public Policy and Management, University of Chinese Academy of Sciences, Beijing, 100049, People’s Republic of ChinaCorrespondence: Hui Jiang, School of Engineering Science, University of Chinese Academy of Sciences, 19 Yuquan Road, Shijingshan District, Beijing, People’s Republic of China, Email huijiang@ucas.ac.cnPurpose: Based on the Chinese context, this study uses severe acute respiratory syndrome (SARS) and coronavirus disease 2019 (COVID-19) outbreaks as examples to identify the risk factors that lead to the major emerging infectious diseases outbreak, and put forward risk governance strategies to improve China’s biosecurity risk prevention and control capabilities.Material and Methods: This study combines grounded theory and WSR methodology, and utilizes the NVivo 12.0 qualitative analysis software to identify the risk factors that led to the major emerging infectious diseases outbreak. The research data was sourced from 168 publicly available official documents, which are highly authoritative and reliable.Results: This study identified 10 categories of Wuli risk factors, 6 categories of logical Shili risk factors, and 8 categories of human Renli risk factors that contributed to the outbreak of major emerging infectious diseases. These risk factors were distributed across the early stages of the outbreak, and have different mechanisms of action at the macro and micro levels.Conclusion: This study identified the risk factors that lead to the outbreak of major emerging infectious disease, and discovered the mechanism of the outbreak at the macro and micro levels. At the macro level, Wuli risk factors are the forefront antecedents that lead to the outbreak of the crisis, Renli factors are the intermediate regulatory factors, and Shili risk factors are the back-end posterior factors. At the micro level, there are risk coupling, risk superposition, and risk resonance interactions among various risk factors, leading to the outbreak of the crisis. Based on these interactive relationships, this study proposes risk governance strategies that are helpful for policymakers in dealing with similar crises in the future.Keywords: major emerging infectious diseases, risk factors, SARS, COVID-19, grounded theory, WSR methodology
Arisa Putri Saubari, I Gde Wawan Sudatha
The lack of creative and interactive learning media impacts students' low learning outcomes. This study aims to develop interactive learning multimedia based on problem-based learning models in science content class V. This research belongs to the type of development research with the ADDIE model. The subjects of this study were one subject matter expert, one instructional design expert, one instructional media expert, three students for the individual test, and nine students for the small group test. The data collection method is the questionnaire/questionnaire method and the test method. The instrument used to collect data is a questionnaire. The data analysis techniques are quantitative descriptive and inferential statistical analyses (t-test). The results of the media product validity test showed that the subject matter expert test obtained a presentation of 98.6% (very good). Learning design expert is 92% (very good). Learning media expert is 95% (very good). Individual trials obtained a percentage of 91.3% (very good), and small group trials 90% (very good). In addition, problem-based learning-based interactive multimedia learning on science content shows an effect of problem-based learning-based interactive multimedia on student learning outcomes. It was concluded that multimedia could significantly increase the learning outcomes of fifth-grade elementary school students.
A. C. Davison
Ruichen Rong, Yonglong Wei, Lin Li et al.
Summary: Background: Tissues such as the liver lobule, kidney nephron, and intestinal gland exhibit intricate patterns of zonated gene expression corresponding to distinct cell types and functions. To quantitatively understand zonation, it is important to measure cellular or genetic features as a function of position along a zonal axis. While it is possible to manually count, characterize, and locate features in relation to the zonal axis, it is labor-intensive and difficult to do manually while maintaining precision and accuracy. Methods: We addressed this challenge by developing a deep-learning-based quantification method called the “Tissue Positioning System” (TPS), which can automatically analyze zonation in the liver lobule as a model system. Findings: By using algorithms that identified vessels, classified vessels, and segmented zones based on the relative position along the portal vein to central vein axis, TPS was able to spatially quantify gene expression in mice with zone specific reporters. Interpretation: TPS could discern expression differences between zonal reporter strains, ages, and disease states. TPS could also reveal the zonal distribution of cells previously thought to be positioned randomly. The design principles of TPS could be generalized to other tissues to explore the biology of zonation. Funding: CPRIT (RP190208, RP220614, RP230330) and NIH (P30CA142543, R01AA028791, R01CA251928, R01DK1253961, R01GM140012, 1R01GM141519, 1R01DE030656, 1U01CA249245). The Pollack Foundation, Simmons Comprehensive Cancer Center Cancer & Obesity Translational Pilot Award, and the Emerging Leader Award from the Mark Foundation For Cancer Research (#21-003-ELA).
Archie Campbell, Robin Flaig, Cathie Sudlow
Objectives We started a family-based genetic epidemiology study in 2006-11 which recruited ~24,000 adult volunteers from ~7000 families across Scotland with consent for follow-up through medical record linkage and re-contact. In 2022-23 we are recruiting another 20,000, with consent extended to administrative records, with age range now 12+. Methods Original volunteers completed a demographic, health and lifestyle questionnaire, provided biological samples, and underwent detailed clinical assessment. The samples, phenotype and genotype data form a resource for research on the genetics of conditions of public health importance. This has become a longitudinal dataset by linkage to routine NHS hospital, maternity, lab test, prescriptions, dentistry, mortality, imaging, cancer screening, GP data records, Covid-19 testing and vaccinations, as well as follow-up questionnaires. The new wave of recruitment is all online and can be done on a smartphone, with DNA from saliva collected by post. Teenagers aged 12-15 can join with parental consent. Results GWAS has been done on quantitative traits and biomarkers, with DNA methylation data and proteomics available for most of the cohort. Our “CovidLife” surveys collected data on effects of the pandemic. Researchers can find prevalent and incident disease cases and controls, to test research hypotheses on a stratified population. They can also do targeted recruitment of participants to new studies, including recall by genotype. We have established and validated E-HR linkage with the NHS Scotland CHI Register,,overcoming technical and governance issues in the process. We contribute to major international consortia, with collaborators from institutions worldwide, both academic and commercial. Recruits are asked to give consent to linkage to other administrative data, and reuse of samples from routine NHS tests for medical research. Conclusion We plan to extend the linkage process to include other administrative data from national datasets as and when approvals are obtained. New types of data can also be collected by online questionnaires. The Research Tissue Bank resources are available to academic and commercial researchers through a managed access process.
Guangquan Xu, Wenyu Dong, Jun Xing et al.
Cryptojacking is a type of resource embezzlement attack, wherein an attacker secretly executes the cryptocurrency mining program in the target host to gain profits. It has been common since 2017, and in fact, it once became the greatest threat to network security. To better prove the attack ability the harm caused by cryptojacking, this paper proposes a new covert browser-based mining attack model named Delay-CJ, this model was deployed in a simulation environment for evaluation. Based on the general framework of cryptojacking, Delay-CJ adds hybrid evasion detection techniques and applies the delayed execution strategy specifically for video websites in the prototype implementation. The results show that the existing detection methods used for testing may become invalid as result of this model. In view of this situation, to achieve a more general and robust detection scheme, we built a cryptojacking detection system named CJDetector, which is based on cryptojacking process features. Specifically, it identifies malicious mining by monitoring CPU usage and analyzing the function call information. This system not only effectively detects the attack in our example but also has universal applicability. The recognition accuracy of CJDetector reaches 99.33%. Finally, we tested the web pages in Alexa 50K websites to investigate cryptojacking activity in the real network. We found that although cryptojacking is indeed on the decline, it remains a part of network security threats that cannot be ignored.
Halaman 5 dari 2234009