O. Papaspiliopoulos
© 2018, Cambridge University Press Let us summarize our findings. A random projection of a set T in R n onto an m-dimensional subspace approximately preserves the geometry of T if m ⪆ d ( T ) . For...
Menampilkan 20 dari ~44666105 hasil · dari DOAJ, CrossRef, Semantic Scholar, arXiv
O. Papaspiliopoulos
© 2018, Cambridge University Press Let us summarize our findings. A random projection of a set T in R n onto an m-dimensional subspace approximately preserves the geometry of T if m ⪆ d ( T ) . For...
Evan Bolyen, J. Rideout, Matthew R. Dillon et al.
Evan Bolyen, J. Rideout, Matthew R. Dillon et al.
We present QIIME 2, an open-source microbiome data science platform accessible to users spanning the microbiome research ecosystem, from scientists and engineers to clinicians and policy makers. QIIME 2 provides new features that will drive the next generation of microbiome research. These include interactive spatial and temporal analysis and visualization tools, support for metabolomics and shotgun metagenomics analysis, and automated data provenance tracking to ensure reproducible, transparent microbiome data science.
E. Pebesma, R. Bivand
Yuhang Lai, Chengxi Li, Yiming Wang et al.
We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.
Iqbal H. Sarker
The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science, advanced analytics methods including machine learning modeling can provide actionable insights or deeper knowledge about data, which makes the computing process automatic and smart. In this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios. We also discuss and summarize ten potential real-world application domains including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making. Based on this, we finally highlight the challenges and potential research directions within the scope of our study. Overall, this paper aims to serve as a reference point on data science and advanced analytics to the researchers and decision-makers as well as application developers, particularly from the data-driven solution point of view for real-world problems.
Hossein Hassani, E. Silva
ChatGPT, a conversational AI interface that utilizes natural language processing and machine learning algorithms, is taking the world by storm and is the buzzword across many sectors today. Given the likely impact of this model on data science, through this perspective article, we seek to provide an overview of the potential opportunities and challenges associated with using ChatGPT in data science, provide readers with a snapshot of its advantages, and stimulate interest in its use for data science projects. The paper discusses how ChatGPT can assist data scientists in automating various aspects of their workflow, including data cleaning and preprocessing, model training, and result interpretation. It also highlights how ChatGPT has the potential to provide new insights and improve decision-making processes by analyzing unstructured data. We then examine the advantages of ChatGPT’s architecture, including its ability to be fine-tuned for a wide range of language-related tasks and generate synthetic data. Limitations and issues are also addressed, particularly around concerns about bias and plagiarism when using ChatGPT. Overall, the paper concludes that the benefits outweigh the costs and ChatGPT has the potential to greatly enhance the productivity and accuracy of data science workflows and is likely to become an increasingly important tool for intelligence augmentation in the field of data science. ChatGPT can assist with a wide range of natural language processing tasks in data science, including language translation, sentiment analysis, and text classification. However, while ChatGPT can save time and resources compared to training a model from scratch, and can be fine-tuned for specific use cases, it may not perform well on certain tasks if it has not been specifically trained for them. Additionally, the output of ChatGPT may be difficult to interpret, which could pose challenges for decision-making in data science applications.
Fernando Martínez-Plumed, Lidia Contreras-Ochando, C. Ferri et al.
CRISP-DM(CRoss-Industry Standard Process for Data Mining) has its origins in the second half of the nineties and is thus about two decades old. According to many surveys and user polls it is still the de facto standard for developing data mining and knowledge discovery projects. However, undoubtedly the field has moved on considerably in twenty years, with data science now the leading term being favoured over data mining. In this paper we investigate whether, and in what contexts, CRISP-DM is still fit for purpose for data science projects. We argue that if the project is goal-directed and process-driven the process model view still largely holds. On the other hand, when data science projects become more exploratory the paths that the project can take become more varied, and a more flexible model is called for. We suggest what the outlines of such a trajectory-based model might look like and how it can be used to categorise data science projects (goal-directed, exploratory or data management). We examine seven real-life exemplars where exploratory activities play an important role and compare them against 51 use cases extracted from the NIST Big Data Public Working Group. We anticipate this categorisation can help project planning in terms of time and cost characteristics.
Reza Iranzad, Xiao Liu
Sirui Hong, Yizhang Lin, Bangbang Liu et al.
Large Language Model (LLM)-based agents have shown effectiveness across many applications. However, their use in data science scenarios requiring solving long-term interconnected tasks, dynamic data adjustments and domain expertise remains challenging. Previous approaches primarily focus on individual tasks, making it difficult to assess the complete data science workflow. Moreover, they struggle to handle real-time changes in intermediate data and fail to adapt dynamically to evolving task dependencies inherent to data science problems. In this paper, we present Data Interpreter, an LLM-based agent designed to automatically solve various data science problems end-to-end. Our Data Interpreter incorporates two key modules: 1) Hierarchical Graph Modeling, which breaks down complex problems into manageable subproblems, enabling dynamic node generation and graph optimization; and 2) Programmable Node Generation, a technique that refines and verifies each subproblem to iteratively improve code generation results and robustness. Extensive experiments consistently demonstrate the superiority of Data Interpreter. On InfiAgent-DABench, it achieves a 25% performance boost, raising accuracy from 75.9% to 94.9%. For machine learning and open-ended tasks, it improves performance from 88% to 95%, and from 60% to 97%, respectively. Moreover, on the MATH dataset, Data Interpreter achieves remarkable performance with a 26% improvement compared to state-of-the-art baselines. The code is available at https://github.com/geekan/MetaGPT.
F. Giorgi, Carmine Ceraolo, D. Mercatelli
The R programming language is approaching its 30th birthday, and in the last three decades it has achieved a prominent role in statistics, bioinformatics, and data science in general. It currently ranks among the top 10 most popular languages worldwide, and its community has produced tens of thousands of extensions and packages, with scopes ranging from machine learning to transcriptome data analysis. In this review, we provide an historical chronicle of how R became what it is today, describing all its current features and capabilities. We also illustrate the major tools of R, such as the current R editors and integrated development environments (IDEs), the R Shiny web server, the R methods for machine learning, and its relationship with other programming languages. We also discuss the role of R in science in general as a driver for reproducibility. Overall, we hope to provide both a complete snapshot of R today and a practical compendium of the major features and applications of this programming language.
Manu Suvarna, J. Pérez‐Ramírez
Liqiang Jing, Zhehui Huang, Xiaoyang Wang et al.
Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.
Amanda R. Ellis, E. Slade
Abstract ChatGPT is one of many generative artificial intelligence (AI) tools that has emerged recently, creating controversy in the education community with concerns about its potential to be used for plagiarism and to undermine students’ ability to think independently. Recent publications have criticized the use of ChatGPT and other generative AI tools in the classroom, with little focus on the potential benefits. This article focuses on the potential of ChatGPT as an educational tool for statistics and data science. It encourages readers to consider the history of trepidation surrounding introducing new technology in the classroom, such as the calculator. We explore the possibility of leveraging ChatGPT’s capabilities in statistics and data science education, providing examples of how ChatGPT can aid in developing course materials and suggestions for how educators can prompt students to interact with ChatGPT responsibly. As educators, we can guide the use of generative AI tools in statistics and data science classrooms so that students and educators can leverage the benefits of this technology.
Ziming Li, Qianbo Zang, David Ma et al.
Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists data scientists in completing daily data pipelines through a collaborative multi-agent system. AutoKaggle implements an iterative development process that combines code execution, debugging, and comprehensive unit testing to ensure code correctness and logic consistency. The framework offers highly customizable workflows, allowing users to intervene at each phase, thus integrating automated intelligence with human expertise. Our universal data science toolkit, comprising validated functions for data cleaning, feature engineering, and modeling, forms the foundation of this solution, enhancing productivity by streamlining common tasks. We selected 8 Kaggle competitions to simulate data processing workflows in real-world application scenarios. Evaluation results demonstrate that AutoKaggle achieves a validation submission rate of 0.85 and a comprehensive score of 0.82 in typical data science pipelines, fully proving its effectiveness and practicality in handling complex data science tasks.
Yuge Zhang, Qiyang Jiang, Xingyu Han et al.
In the era of data-driven decision-making, the complexity of data analysis necessitates advanced expertise and tools of data science, presenting significant challenges even for specialists. Large Language Models (LLMs) have emerged as promising aids as data science agents, assisting humans in data analysis and processing. Yet their practical efficacy remains constrained by the varied demands of real-world applications and complicated analytical process. In this paper, we introduce DSEval -- a novel evaluation paradigm, as well as a series of innovative benchmarks tailored for assessing the performance of these agents throughout the entire data science lifecycle. Incorporating a novel bootstrapped annotation method, we streamline dataset preparation, improve the evaluation coverage, and expand benchmarking comprehensiveness. Our findings uncover prevalent obstacles and provide critical insights to inform future advancements in the field.
Zaid Ameen Abduljabbar, Vincent Omollo Nyangaresi, Ahmed Ali Ahmed et al.
Abstract Vehicular Ad-Hoc Networks (VANETs) have facilitated the massive exchange of real-time traffic and weather conditions, which have helped prevent collisions, reduce accidents, and road congestions. This can effectively enhance driving safety and efficiency in technology-driven transportation systems. However, the transmission of massive and sensitive information across public wireless communication channels exposes the transmitted data to a myriad of privacy as well as security threats. Although past researches has developed many vehicular ad-hoc networks security preservation schemes, several of them are inefficient or susceptible to attacks. This work, introduces an approach that leverages reverse fuzzy extraction, bilinear pairing, and Physically Unclonable Function (PUF) to design an efficient and anonymity-preserving authentication scheme. We conduct an elaborate formal security analysis to demonstrate that the derived session key is secure. The semantic security analyses also demonstrate its resilience against typical VANET attacks such as impersonations, denial of service, and de-synchronization, instilling confidence in its effectiveness. Moreover, our approach incurs the lowest computational overheads at relatively low communication costs. Specifically, our protocol attains a 66.696% reduction in computation costs, and a 70% increment in the supported security functionalities.
Roger E. Cohen, Kristen B. W. McQuinn, Alessandro Savino et al.
Radial stellar population gradients within dwarf galaxies provide a promising avenue for disentangling the drivers of galaxy evolution, including environment. Within the Local Volume, radial stellar age gradient slopes correlate with interaction history, contrary to model predictions, so dwarfs that are isolated provide a critical control sample. We measure radial stellar age gradients in the relatively isolated gas-rich dwarf irregular Wolf–Lundmark–Melotte Galaxy (WLM), combining JWST NIRCam and NIRISS imaging with six archival Hubble Space Telescope fields over semimajor axis equivalent distances of 0 ≲ R _SMA ≲ 4 kpc (≲3 R _hl ). Fitting lifetime star formation histories to resolved color–magnitude diagrams, radial age gradients are quantified using τ _90 and τ _50 , the lookback times to form 90% and 50% of the cumulative stellar mass. We find that globally, the outskirts of WLM are older on average, with ( δτ _90 , δτ _50 )/ δ R _SMA = (0.82 ${}_{-0.10}^{+0.10}$ , 1.60 ${}_{-0.22}^{+0.23}$ ) Gyr kpc ^−1 (stat.), in good agreement with simulations. However, we also detect an azimuthal dependence of radial stellar age gradients, finding that stars on the leading edge of WLM (relative to its proper motion) are both younger and have a flatter age gradient compared to the trailing edge. This difference persists over 0.6 ≲ R _SMA ≲ 3.2 kpc (∼0.5–2.5 R _hl ) and lookback times up to ∼8 Gyr, and is robust to the assumed stellar evolutionary model. Our results are consistent with star formation triggered by ram pressure stripping from a circumgalactic and/or intergalactic medium, suggested by recent H I observations. If confirmed, processes typifying dense environments, such as ram pressure stripping, may be more relevant to the evolution of isolated galaxies than previously thought.
Longfei Cui, Xinyu Niu, Haizhong Qian et al.
The extraction of shape features from vector elements is essential in cartography and geographic information science, supporting a range of intelligent processing tasks. Traditional methods rely on different machine learning algorithms tailored to specific types of line and polygon elements, limiting their general applicability. This study introduces a novel approach called “Pre-Trained Shape Feature Representations from Transformers (PSRT)”, which utilizes transformer encoders designed with three self-supervised pre-training tasks: coordinate masking prediction, coordinate offset correction, and coordinate sequence rearrangement. This approach enables the extraction of general shape features applicable to both line and polygon elements, generating high-dimensional embedded feature vectors. These vectors facilitate downstream tasks like shape classification, pattern recognition, and cartographic generalization. Our experimental results show that PSRT can extract vector shape features effectively without needing labeled samples and is adaptable to various types of vector features. Compared to the methods without pre-training, PSRT enhances training efficiency by over five times and improves accuracy by 5–10% in tasks such as line element matching and polygon shape classification. This innovative approach offers a more unified, efficient solution for processing vector shape data across different applications.
Biprateep Dey, David Zhao, Brett H Andrews et al.
Key science questions, such as galaxy distance estimation and weather forecasting, often require knowing the full predictive distribution of a target variable Y given complex inputs X . Despite recent advances in machine learning and physics-based models, it remains challenging to assess whether an initial model is calibrated for all x , and when needed, to reshape the densities of y toward ‘instance-wise’ calibration. This paper introduces the local amortized diagnostics and reshaping of conditional densities (LADaR) framework and proposes a new computationally efficient algorithm ( Cal-PIT ) that produces interpretable local diagnostics and provides a mechanism for adjusting conditional density estimates (CDEs). Cal-PIT learns a single interpretable local probability–probability map from calibration data that identifies where and how the initial model is miscalibrated across feature space, which can be used to morph CDEs such that they are well-calibrated. We illustrate the LADaR framework on synthetic examples, including probabilistic forecasting from image sequences, akin to predicting storm wind speed from satellite imagery. Our main science application involves estimating the probability density functions of galaxy distances given photometric data, where Cal-PIT achieves better instance-wise calibration than all 11 other literature methods in a benchmark data challenge, demonstrating its utility for next-generation cosmological analyzes ^9 .
Halaman 1 dari 2233306