Hasil "data science"

S2 Open Access 2022

Galactica: A Large Language Model for Science

Ross Taylor, Marcin Kardas, Guillem Cucurull et al.

Information overload is a major obstacle to scientific progress. The explosive growth in scientific literature and data has made it ever harder to discover useful insights in a large mass of information. Today scientific knowledge is accessed through search engines, but they are unable to organize scientific knowledge alone. In this paper we introduce Galactica: a large language model that can store, combine and reason about scientific knowledge. We train on a large scientific corpus of papers, reference material, knowledge bases and many other sources. We outperform existing models on a range of scientific tasks. On technical knowledge probes such as LaTeX equations, Galactica outperforms the latest GPT-3 by 68.2% versus 49.0%. Galactica also performs well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH with a score of 20.4% versus 8.8%. It also sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6% and 52.9%. And despite not being trained on a general corpus, Galactica outperforms BLOOM and OPT-175B on BIG-bench. We believe these results demonstrate the potential for language models as a new interface for science. We open source the model for the benefit of the scientific community.

989 sitasi en Computer Science, Mathematics

Detail Sumber

S2 Open Access 2021

Tabular Data: Deep Learning is Not All You Need

Ravid Shwartz-Ziv, Amitai Armon

A key element in solving real-life data science problems is selecting the types of models to use. Tree ensemble models (such as XGBoost) are usually recommended for classification and regression problems with tabular data. However, several deep learning models for tabular data have recently been proposed, claiming to outperform XGBoost for some use cases. This paper explores whether these deep models should be a recommended option for tabular data by rigorously comparing the new deep models to XGBoost on various datasets. In addition to systematically comparing their performance, we consider the tuning and computation they require. Our study shows that XGBoost outperforms these deep models across the datasets, including the datasets used in the papers that proposed the deep models. We also demonstrate that XGBoost requires much less tuning. On the positive side, we show that an ensemble of deep models and XGBoost performs better on these datasets than XGBoost alone.

1873 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2022

AVONET: morphological, ecological and geographical data for all birds.

J. Tobias, C. Sheard, A. Pigot et al.

Functional traits offer a rich quantitative framework for developing and testing theories in evolutionary biology, ecology and ecosystem science. However, the potential of functional traits to drive theoretical advances and refine models of global change can only be fully realised when species-level information is complete. Here we present the AVONET dataset containing comprehensive functional trait data for all birds, including six ecological variables, 11 continuous morphological traits, and information on range size and location. Raw morphological measurements are presented from 90,020 individuals of 11,009 extant bird species sampled from 181 countries. These data are also summarised as species averages in three taxonomic formats, allowing integration with a global phylogeny, geographical range maps, IUCN Red List data and the eBird citizen science database. The AVONET dataset provides the most detailed picture of continuous trait variation for any major radiation of organisms, offering a global template for testing hypotheses and exploring the evolutionary origins, structure and functioning of biodiversity.

836 sitasi en Medicine

Detail DOI Sumber

S2 Open Access 2015

The journal coverage of Web of Science and Scopus: a comparative analysis

P. Mongeon, Adèle Paul-Hus

Bibliometric methods are used in multiple fields for a variety of purposes, namely for research evaluation. Most bibliometric analyses have in common their data sources: Thomson Reuters’ Web of Science (WoS) and Elsevier’s Scopus. The objective of this research is to describe the journal coverage of those two databases and to assess whether some field, publishing country and language are over or underrepresented. To do this we compared the coverage of active scholarly journals in WoS (13,605 journals) and Scopus (20,346 journals) with Ulrich’s extensive periodical directory (63,013 journals). Results indicate that the use of either WoS or Scopus for research evaluation may introduce biases that favor Natural Sciences and Engineering as well as Biomedical Research to the detriment of Social Sciences and Arts and Humanities. Similarly, English-language journals are overrepresented to the detriment of other languages. While both databases share these biases, their coverage differs substantially. As a consequence, the results of bibliometric analyses may vary depending on the database used. These results imply that in the context of comparative research evaluation, WoS and Scopus should be used with caution, especially when comparing different fields, institutions, countries or languages. The bibliometric community should continue its efforts to develop methods and indicators that include scientific output that are not covered in WoS or Scopus, such as field-specific and national citation indexes.

4003 sitasi en Computer Science, Political Science

Detail DOI Sumber

S2 Open Access 2010

Creating the CIPRES Science Gateway for inference of large phylogenetic trees

Mark A. Miller, W. Pfeiffer, Terri Schwartz

10747 sitasi en

Detail DOI Sumber

S2 Open Access 2010

Bayesian data analysis.

J. Kruschke

11320 sitasi en Computer Science, Medicine

Detail DOI Sumber

S2 Open Access 2003

AIRS/AMSU/HSB on the Aqua mission: design, science objectives, data products, and processing systems

H. Aumann, M. Chahine, C. Gautier et al.

1534 sitasi en Environmental Science, Computer Science

Detail DOI Sumber

S2 Open Access 2018

NOMAD: The FAIR concept for big data-driven materials science

C. Draxl, M. Scheffler

Data are a crucial raw material of this century. The amount of data that have been created in materials science thus far and that continues to be created every day is immense. Without a proper infrastructure that allows for collecting and sharing data, the envisioned success of big data-driven materials science will be hampered. For the field of computational materials science, the NOMAD (Novel Materials Discovery) Center of Excellence (CoE) has changed the scientific culture toward comprehensive and findable, accessible, interoperable, and reusable (FAIR) data, opening new avenues for mining materials science big data. Novel data-analytics concepts and tools turn data into knowledge and help in the prediction of new materials and in the identification of new properties of already known materials.

376 sitasi en Physics, Materials Science

Detail DOI Sumber

S2 Open Access 2019

Engagement in science through citizen science: Moving beyond data collection

T. Phillips, H. Ballard, B. Lewenstein et al.

"To date, most studies of citizen science engagement focus on quantifiable measures related to the contribution of data or other output measures. Few studies have attempted to qualitatively characterize citizen science engagement across multiple projects and from the perspective of the participants. Building on pertinent literature and sociocultural learning theories, this study operationalizes engagement in citizen science through an analysis of interviews of 72 participants from six different environmentally based projects. We document engagement in citizen science through an examination of cognitive, affective, social, behavioral, and motivational dimensions. We assert that engagement in citizen science is enhanced by acknowledging these multiple dimensions and creating opportunities for volunteers to find personal relevance in their work with scientists. A Dimensions of Engagement framework is presented that can facilitate the innovation of new questions and methodologies for studying engagement in citizen science and other forms of informal science education."

260 sitasi en Psychology

Detail DOI Sumber

S2 Open Access 2019

Data-driven modeling and learning in science and engineering

F. Montáns, F. Chinesta, Rafael Gómez-Bombarelli et al.

Abstract In the past, data in which science and engineering is based, was scarce and frequently obtained by experiments proposed to verify a given hypothesis. Each experiment was able to yield only very limited data. Today, data is abundant and abundantly collected in each single experiment at a very small cost. Data-driven modeling and scientific discovery is a change of paradigm on how many problems, both in science and engineering, are addressed. Some scientific fields have been using artificial intelligence for some time due to the inherent difficulty in obtaining laws and equations to describe some phenomena. However, today data-driven approaches are also flooding fields like mechanics and materials science, where the traditional approach seemed to be highly satisfactory. In this paper we review the application of data-driven modeling and model learning procedures to different fields in science and engineering.

257 sitasi en Computer Science

Detail DOI Sumber

CrossRef Open Access 2026

Detecting unique wind field features in hurricane Sandy from topological data maps

Justin Hoffmeier

Abstract This study investigates the use of topological data maps for extracting unique tropical cyclone (TC) wind field features. These maps are presented as graphs generated through a sequence of steps that filter, cluster, and identify data structure, and are used to characterize topological properties and shape in the data. The objective and scope of the method is explored through application to wind fields from the HURDAT2 data set, and its viability for detecting anomalous behavior in TCs is considered. We refer to the resulting graphs as wind field connectivity signatures (WFCS) or collective wind field connectivity map (CWFCM), depending on the data set. Our focus is Hurricane Sandy, where the method successfully identifies a complete 360-degree rotation of the high wind speed radii. This cyclical example of phase rotation of wind speed asymmetries corresponds to a distinct structural property of the graph. These methods have not been previously applied to wind field data and have only seen limited use in atmospheric sciences.

en

Detail DOI Sumber

S2 Open Access 2020

Utilization of text mining as a big data analysis tool for food science and nutrition.

Dandan Tao, Pengkun Yang, H. Feng

Big data analysis has found applications in many industries due to its ability to turn huge amounts of data into insights for informed business and operational decisions. Advanced data mining techniques have been applied in many sectors of supply chains in the food industry. However, the previous work has mainly focused on the analysis of instrument-generated data such as those from hyperspectral imaging, spectroscopy, and biometric receptors. The importance of digital text data in the food and nutrition has only recently gained attention due to advancements in big data analytics. The purpose of this review is to provide an overview of the data sources, computational methods, and applications of text data in the food industry. Text mining techniques such as word-level analysis (e.g., frequency analysis), word association analysis (e.g., network analysis), and advanced techniques (e.g., text classification, text clustering, topic modeling, information retrieval, and sentiment analysis) will be discussed. Applications of text data analysis will be illustrated with respect to food safety and food fraud surveillance, dietary pattern characterization, consumer-opinion mining, new-product development, food knowledge discovery, food supply-chain management, and online food services. The goal is to provide insights for intelligent decision-making to improve food production, food safety, and human nutrition.

185 sitasi en Medicine, Computer Science

Detail DOI Sumber

S2 Open Access 2019

Using Semistructured Surveys to Improve Citizen Science Data for Monitoring Biodiversity

S. Kelling, A. Johnston, A. Bonn et al.

Abstract Biodiversity is being lost at an unprecedented rate, and monitoring is crucial for understanding the causal drivers and assessing solutions. Most biodiversity monitoring data are collected by volunteers through citizen science projects, and often crucial information is lacking to account for the inevitable biases that observers introduce during data collection. We contend that citizen science projects intended to support biodiversity monitoring must gather information about the observation process as well as species occurrence. We illustrate this using eBird, a global citizen science project that collects information on bird occurrences as well as vital contextual information on the observation process while maintaining broad participation. Our fundamental argument is that regardless of what species are being monitored, when citizen science projects collect a small set of basic information about how participants make their observations, the scientific value of the data collected will be dramatically improved.

210 sitasi en Geography, Medicine

Detail DOI Sumber

DOAJ Open Access 2025

Achieving inclusive healthcare through integrating education and research with AI and personalized curricula

Amir Bahmani, Kexin Cha, Arash Alavi et al.

Abstract Background Precision medicine promises significant health benefits but faces challenges such as complex data management and analytics, interdisciplinary collaboration, and education of researchers, healthcare professionals, and participants. Addressing these needs requires the integration of computational experts, engineers, designers, and healthcare professionals to develop user-friendly systems and shared terminologies. The widespread adoption of large language models (LLMs) such as Generative Pretrained Transformer (GPT) and Claude highlights the importance of making complex data accessible to non-specialists. Methods We evaluated the Stanford Data Ocean (SDO) precision medicine training program’s learning outcomes, AI Tutor performance, and learner satisfaction by assessing self-rated competency on key learning objectives through pre- and post-learning surveys, along with formative and summative assessment completion rates. We also analyzed AI Tutor accuracy and learners’ self-reported satisfaction, and post-program academic and career impacts. Additionally, we demonstrated the capabilities of the AI Data Visualization tool. Results SDO demonstrates the ability to improve learning outcomes for learners from broad educational and socioeconomic backgrounds with the support of the AI Tutor. The AI Data Visualization tool enables learners to interpret multi-omics and wearable data and replicate research findings. Conclusions SDO strives to mitigate challenges in precision medicine through a scalable, cloud-based platform that supports data management for various data types, advanced research, and personalized learning. SDO provides AI Tutors and AI-powered data visualization tools to enhance educational and research outcomes and make data analysis accessible to users from broad educational backgrounds. By extending engagement and cutting-edge research capabilities globally, SDO particularly benefits economically disadvantaged and historically marginalized communities, fostering interdisciplinary biomedical research and bridging the gap between education and practical application in the biomedical field.

Medicine

Detail DOI Sumber

DOAJ Open Access 2025

Correction: Validity of smartwatch-derived estimates of lactate threshold heart rate and pace compared to graded exercise testing

Changda Lu, Wei Cui, Zheng Zhu et al.

Physiology

Detail DOI Sumber

S2 Open Access 2021

SPSS: An Imperative Quantitative Data Analysis Tool for Social Science Research

Arif Rahman, M. Muktadir

The purpose of this paper is to elaborate on the importance of the Statistical Package for the Social Sciences, widely known as SPSS in the field of social sciences as an effective tool for quantitative data analysis. This paper includes the broader aspects of SPSS from its foundation as a statistical analysis tool to how it has advanced as a major choice for researchers from different fields of study. It is necessary for researchers, especially new researchers to know the ins and outs of SPSS as to why they should use it. This study expressed the arguments from personal experiences of using SPSS and what other SPSS users have described in various pieces of literature. Different works of literature related to SPSS and social sciences have been discussed throughout the paper. The most relevant ones have been talked about. The features and major advantages are portrayed in such a way that anyone can relate the tool with their research works. In the end, it has been evident that SPSS is considered one of the most important and influential statistical tools for quantitative data analysis.

131 sitasi en

Detail DOI Sumber

S2 Open Access 2021

Big data and machine learning for materials science

J. F. Rodrigues, L. Florea, Maria C. F. de Oliveira et al.

Herein, we review aspects of leading-edge research and innovation in materials science that exploit big data and machine learning (ML), two computer science concepts that combine to yield computational intelligence. ML can accelerate the solution of intricate chemical problems and even solve problems that otherwise would not be tractable. However, the potential benefits of ML come at the cost of big data production; that is, the algorithms demand large volumes of data of various natures and from different sources, from material properties to sensor data. In the survey, we propose a roadmap for future developments with emphasis on computer-aided discovery of new materials and analysis of chemical sensing compounds, both prominent research fields for ML in the context of materials science. In addition to providing an overview of recent advances, we elaborate upon the conceptual and practical limitations of big data and ML applied to materials science, outlining processes, discussing pitfalls, and reviewing cases of success and failure.

131 sitasi en Medicine

Detail DOI Sumber

S2 Open Access 2021

Data Quality in Citizen Science

B. Balázs, P. Mooney, E. Nováková et al.

This chapter discusses the broad and complex topic of data quality in citizen science – a contested arena because different projects and stakeholders aspire to different levels of data accuracy. In this chapter, we consider how we ensure the validity and reliability of data generated by citizen scientists and citizen science projects. We show that this is an essential methodological question that has emerged within a highly contested field in recent years. Data quality means different things to different stakeholders. This is no surprise as quality is always a broad spectrum, and nearly 200 terms are in use to describe it, regardless of the approach. We seek to deliver a high-level overview of the main themes and issues in data quality in citizen science, mechanisms to ensure and improve quality, and some conclusions on best practice and ways forwards. We encourage citizen science projects to share insights on their data practice failures. Finally, we show how data quality assurance gives credibility, reputation, and sustainability to citizen science projects.

113 sitasi en Political Science

Detail DOI Sumber

S2 Open Access 2019

The data source of this study is Web of Science Core Collection? Not enough

Weishu Liu

178 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2022

2022 Review of Data-Driven Plasma Science

R. Anirudh, Rick Archibald, M. Salman Asif et al.

Data-driven science and technology offer transformative tools and methods to science. This review article highlights the latest development and progress in the interdisciplinary field of data-driven plasma science (DDPS), i.e., plasma science whose progress is driven strongly by data and data analyses. Plasma is considered to be the most ubiquitous form of observable matter in the universe. Data associated with plasmas can, therefore, cover extremely large spatial and temporal scales, and often provide essential information for other scientific disciplines. Thanks to the latest technological developments, plasma experiments, observations, and computation now produce a large amount of data that can no longer be analyzed or interpreted manually. This trend now necessitates a highly sophisticated use of high-performance computers for data analyses, making artificial intelligence and machine learning vital components of DDPS. This article contains seven primary sections, in addition to the introduction and summary. Following an overview of fundamental data-driven science, five other sections cover widely studied topics of plasma science and technologies, i.e., basic plasma physics and laboratory experiments, magnetic confinement fusion, inertial confinement fusion and high-energy-density physics, space and astronomical plasmas, and plasma technologies for industrial and other applications. The final Section before the summary discusses plasma-related databases that could significantly contribute to DDPS. Each primary Section starts with a brief introduction to the topic, discusses the state-of-the-art developments in the use of data and/or data-scientific approaches, and presents the summary and outlook. Despite the recent impressive signs of progress, the DDPS is still in its infancy. This article attempts to offer a broad perspective on the development of this field and identify where further innovations are required.

70 sitasi en Physics

Detail DOI Sumber

Hasil untuk "data science"