{"results":[{"id":"ss_fa5853fdef7d2f6bb68203d187ddacbbddc63a8b","title":"High-Dimensional Probability: An Introduction with Applications in Data Science","authors":[{"name":"O. Papaspiliopoulos"}],"abstract":"© 2018, Cambridge University Press Let us summarize our findings. A random projection of a set T in R n onto an m-dimensional subspace approximately preserves the geometry of T if m ⪆ d ( T ) . For...","source":"Semantic Scholar","year":2020,"language":"en","subjects":["Physics"],"doi":"10.1080/14697688.2020.1813475","url":"https://www.semanticscholar.org/paper/fa5853fdef7d2f6bb68203d187ddacbbddc63a8b","is_open_access":true,"citations":3747,"published_at":"","score":94},{"id":"ss_133bcd7488a3c07cb0f493a87564c30e5433768c","title":"Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2","authors":[{"name":"Evan Bolyen"},{"name":"J. Rideout"},{"name":"Matthew R. Dillon"},{"name":"Nicholas A. Bokulich"},{"name":"C. Abnet"},{"name":"Gabriel A. Al-Ghalith"},{"name":"Harriet Alexander"},{"name":"E. Alm"},{"name":"Manimozhiyan Arumugam"},{"name":"F. Asnicar"},{"name":"Yang Bai"},{"name":"J. Bisanz"},{"name":"K. Bittinger"},{"name":"A. Brejnrod"},{"name":"Colin J. Brislawn"},{"name":"C. Brown"},{"name":"B. Callahan"},{"name":"A. Caraballo-Rodríguez"},{"name":"John Chase"},{"name":"Emily K. Cope"},{"name":"R. D. da Silva"},{"name":"C. Diener"},{"name":"P. Dorrestein"},{"name":"G. Douglas"},{"name":"D. Durall"},{"name":"C. Duvallet"},{"name":"C. Edwardson"},{"name":"Madeleine Ernst"},{"name":"M. Estaki"},{"name":"Jennifer T. Fouquier"},{"name":"J. Gauglitz"},{"name":"S. Gibbons"},{"name":"D. Gibson"},{"name":"Antonio Gonzalez"},{"name":"Kestrel Gorlick"},{"name":"Jiarong Guo"},{"name":"Benjamin M Hillmann"},{"name":"S. Holmes"},{"name":"H. Holste"},{"name":"C. Huttenhower"},{"name":"G. Huttley"},{"name":"Stefan Janssen"},{"name":"A. Jarmusch"},{"name":"Lingjing Jiang"},{"name":"Benjamin D. Kaehler"},{"name":"K. Kang"},{"name":"Christopher R. Keefe"},{"name":"P. Keim"},{"name":"S. Kelley"},{"name":"D. Knights"},{"name":"I. Koester"},{"name":"T. Kosciólek"},{"name":"Jorden Kreps"},{"name":"M. Langille"},{"name":"Joslynn S. Lee"},{"name":"R. Ley"},{"name":"Yong-xin Liu"},{"name":"E. Loftfield"},{"name":"C. Lozupone"},{"name":"Massoud Maher"},{"name":"C. Marotz"},{"name":"Bryan D. Martin"},{"name":"Daniel McDonald"},{"name":"L. McIver"},{"name":"A. Melnik"},{"name":"J. Metcalf"},{"name":"S. C. Morgan"},{"name":"James T. Morton"},{"name":"Ahmad Turan Naimey"},{"name":"Jose A Navas-Molina"},{"name":"Louis-Félix Nothias"},{"name":"Stephanie B. Orchanian"},{"name":"Talima R. Pearson"},{"name":"Samuel L. Peoples"},{"name":"D. Petráš"},{"name":"M. Preuss"},{"name":"Elmar Pruesse"},{"name":"Lasse Buur Rasmussen"},{"name":"A. Rivers"},{"name":"M. Robeson"},{"name":"P. Rosenthal"},{"name":"N. Segata"},{"name":"Michael Shaffer"},{"name":"A. Shiffer"},{"name":"R. Sinha"},{"name":"Se Jin Song"},{"name":"J. Spear"},{"name":"Austin D. Swafford"},{"name":"Luke R. Thompson"},{"name":"P. Torres"},{"name":"Pauline Trinh"},{"name":"A. Tripathi"},{"name":"P. Turnbaugh"},{"name":"Sabah Ul-Hasan"},{"name":"J. V. D. van der Hooft"},{"name":"Fernando Vargas"},{"name":"Y. Vázquez-Baeza"},{"name":"E. Vogtmann"},{"name":"Max von Hippel"},{"name":"William Walters"},{"name":"Yunhu Wan"},{"name":"Mingxun Wang"},{"name":"Jonathan Warren"},{"name":"Kyle C. Weber"},{"name":"C. H. Williamson"},{"name":"A. Willis"},{"name":"Z. Xu"},{"name":"Jesse R. Zaneveld"},{"name":"Yilong Zhang"},{"name":"Qiyun Zhu"},{"name":"R. Knight"},{"name":"J. Caporaso"}],"abstract":"","source":"Semantic Scholar","year":2019,"language":"en","subjects":["Engineering","Medicine"],"doi":"10.1038/s41587-019-0209-9","url":"https://www.semanticscholar.org/paper/133bcd7488a3c07cb0f493a87564c30e5433768c","pdf_url":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7015180","is_open_access":true,"citations":17921,"published_at":"","score":93},{"id":"ss_1ba044d3d501dddd94b479aa9dbe55a93bfa9d5f","title":"QIIME 2: Reproducible, interactive, scalable, and extensible microbiome data science","authors":[{"name":"Evan Bolyen"},{"name":"J. Rideout"},{"name":"Matthew R. Dillon"},{"name":"Nicholas A. Bokulich"},{"name":"C. Abnet"},{"name":"Gabriel A. Al-Ghalith"},{"name":"Harriet Alexander"},{"name":"Harriet Alexander"},{"name":"E. Alm"},{"name":"Manimozhiyan Arumugam"},{"name":"F. Asnicar"},{"name":"Yang Bai"},{"name":"J. Bisanz"},{"name":"K. Bittinger"},{"name":"A. Brejnrod"},{"name":"Colin J. Brislawn"},{"name":"C. Brown"},{"name":"B. Callahan"},{"name":"A. Caraballo-Rodríguez"},{"name":"John Chase"},{"name":"Emily K. Cope"},{"name":"R. R. Silva"},{"name":"P. Dorrestein"},{"name":"G. Douglas"},{"name":"D. Durall"},{"name":"C. Duvallet"},{"name":"C. Edwardson"},{"name":"Madeleine Ernst"},{"name":"M. Estaki"},{"name":"Jennifer T. Fouquier"},{"name":"J. Gauglitz"},{"name":"D. Gibson"},{"name":"Antonio Gonzalez"},{"name":"Kestrel Gorlick"},{"name":"Jiarong Guo"},{"name":"Benjamin M Hillmann"},{"name":"S. Holmes"},{"name":"H. Holste"},{"name":"C. Huttenhower"},{"name":"C. Huttenhower"},{"name":"G. Huttley"},{"name":"Stefan Janssen"},{"name":"A. Jarmusch"},{"name":"Lingjing Jiang"},{"name":"Benjamin D. Kaehler"},{"name":"K. Kang"},{"name":"K. Kang"},{"name":"Christopher R. Keefe"},{"name":"P. Keim"},{"name":"S. Kelley"},{"name":"D. Knights"},{"name":"I. Koester"},{"name":"I. Koester"},{"name":"T. Kosciólek"},{"name":"Jorden Kreps"},{"name":"M. Langille"},{"name":"Joslynn S. Lee"},{"name":"R. Ley"},{"name":"R. Ley"},{"name":"Yong-xin Liu"},{"name":"E. Loftfield"},{"name":"C. Lozupone"},{"name":"Massoud Maher"},{"name":"C. Marotz"},{"name":"Bryan D. Martin"},{"name":"Daniel McDonald"},{"name":"L. McIver"},{"name":"L. McIver"},{"name":"A. Melnik"},{"name":"J. Metcalf"},{"name":"S. C. Morgan"},{"name":"James T. Morton"},{"name":"Ahmad Turan Naimey"},{"name":"Jose A Navas-Molina"},{"name":"Jose A Navas-Molina"},{"name":"Louis-Félix Nothias"},{"name":"Stephanie B. Orchanian"},{"name":"Talima R. Pearson"},{"name":"Samuel L. Peoples"},{"name":"Samuel L. Peoples"},{"name":"D. Petráš"},{"name":"M. Preuss"},{"name":"Elmar Pruesse"},{"name":"Lasse Buur Rasmussen"},{"name":"A. Rivers"},{"name":"Ii Michael S Robeson"},{"name":"P. Rosenthal"},{"name":"N. Segata"},{"name":"Michael Shaffer"},{"name":"A. Shiffer"},{"name":"R. Sinha"},{"name":"Se Jin Song"},{"name":"J. Spear"},{"name":"Austin D. Swafford"},{"name":"Luke R. Thompson"},{"name":"Luke R. Thompson"},{"name":"P. Torres"},{"name":"Pauline Trinh"},{"name":"A. Tripathi"},{"name":"A. Tripathi"},{"name":"P. Turnbaugh"},{"name":"Sabah Ul-Hasan"},{"name":"J. J. Hooft"},{"name":"Fernando Vargas"},{"name":"Y. Vázquez-Baeza"},{"name":"E. Vogtmann"},{"name":"Max von Hippel"},{"name":"William Walters"},{"name":"Yunhu Wan"},{"name":"Mingxun Wang"},{"name":"Jonathan Warren"},{"name":"Kyle C. Weber"},{"name":"Kyle C. Weber"},{"name":"Chase Williamson"},{"name":"A. Willis"},{"name":"Z. Xu"},{"name":"Jesse R. Zaneveld"},{"name":"Yilong Zhang"},{"name":"R. Knight"},{"name":"J. Caporaso"}],"abstract":"We present QIIME 2, an open-source microbiome data science platform accessible to users spanning the microbiome research ecosystem, from scientists and engineers to clinicians and policy makers. QIIME 2 provides new features that will drive the next generation of microbiome research. These include interactive spatial and temporal analysis and visualization tools, support for metabolomics and shotgun metagenomics analysis, and automated data provenance tracking to ensure reproducible, transparent microbiome data science.","source":"Semantic Scholar","year":2018,"language":"en","subjects":["Computer Science"],"doi":"10.7287/PEERJ.PREPRINTS.27295V1","url":"https://www.semanticscholar.org/paper/1ba044d3d501dddd94b479aa9dbe55a93bfa9d5f","pdf_url":"https://doi.org/10.7287/peerj.preprints.27295v2","is_open_access":true,"citations":1173,"published_at":"","score":92},{"id":"ss_29ff532501bbcc97404811ba375454e54690a454","title":"Spatial Data Science","authors":[{"name":"E. Pebesma"},{"name":"R. Bivand"}],"abstract":"","source":"Semantic Scholar","year":2023,"language":"en","subjects":null,"doi":"10.1201/9780429459016","url":"https://www.semanticscholar.org/paper/29ff532501bbcc97404811ba375454e54690a454","is_open_access":true,"citations":690,"published_at":"","score":87.7},{"id":"ss_8a4fc5f00cd4aca61e148e46a2125c3a406719f1","title":"DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation","authors":[{"name":"Yuhang Lai"},{"name":"Chengxi Li"},{"name":"Yiming Wang"},{"name":"Tianyi Zhang"},{"name":"Ruiqi Zhong"},{"name":"Luke Zettlemoyer"},{"name":"S. Yih"},{"name":"Daniel Fried"},{"name":"Si-yi Wang"},{"name":"Tao Yu"}],"abstract":"We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.","source":"Semantic Scholar","year":2022,"language":"en","subjects":["Computer Science"],"doi":"10.48550/arXiv.2211.11501","url":"https://www.semanticscholar.org/paper/8a4fc5f00cd4aca61e148e46a2125c3a406719f1","pdf_url":"https://arxiv.org/pdf/2211.11501","is_open_access":true,"citations":490,"published_at":"","score":80.7},{"id":"ss_7f29044de1a0e5a6d3ec1d33fb6ad482f3d10dd4","title":"Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective","authors":[{"name":"Iqbal H. Sarker"}],"abstract":"The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science, advanced analytics methods including machine learning modeling can provide actionable insights or deeper knowledge about data, which makes the computing process automatic and smart. In this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios. We also discuss and summarize ten potential real-world application domains including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making. Based on this, we finally highlight the challenges and potential research directions within the scope of our study. Overall, this paper aims to serve as a reference point on data science and advanced analytics to the researchers and decision-makers as well as application developers, particularly from the data-driven solution point of view for real-world problems.","source":"Semantic Scholar","year":2021,"language":"en","subjects":["Computer Science","Medicine"],"doi":"10.1007/s42979-021-00765-8","url":"https://www.semanticscholar.org/paper/7f29044de1a0e5a6d3ec1d33fb6ad482f3d10dd4","pdf_url":"https://link.springer.com/content/pdf/10.1007/s42979-021-00765-8.pdf","is_open_access":true,"citations":407,"published_at":"","score":77.21000000000001},{"id":"ss_c082ccfcfe1afc696e371374146ba9380b84061e","title":"The Role of ChatGPT in Data Science: How AI-Assisted Conversational Interfaces Are Revolutionizing the Field","authors":[{"name":"Hossein Hassani"},{"name":"E. Silva"}],"abstract":"ChatGPT, a conversational AI interface that utilizes natural language processing and machine learning algorithms, is taking the world by storm and is the buzzword across many sectors today. Given the likely impact of this model on data science, through this perspective article, we seek to provide an overview of the potential opportunities and challenges associated with using ChatGPT in data science, provide readers with a snapshot of its advantages, and stimulate interest in its use for data science projects. The paper discusses how ChatGPT can assist data scientists in automating various aspects of their workflow, including data cleaning and preprocessing, model training, and result interpretation. It also highlights how ChatGPT has the potential to provide new insights and improve decision-making processes by analyzing unstructured data. We then examine the advantages of ChatGPT’s architecture, including its ability to be fine-tuned for a wide range of language-related tasks and generate synthetic data. Limitations and issues are also addressed, particularly around concerns about bias and plagiarism when using ChatGPT. Overall, the paper concludes that the benefits outweigh the costs and ChatGPT has the potential to greatly enhance the productivity and accuracy of data science workflows and is likely to become an increasingly important tool for intelligence augmentation in the field of data science. ChatGPT can assist with a wide range of natural language processing tasks in data science, including language translation, sentiment analysis, and text classification. However, while ChatGPT can save time and resources compared to training a model from scratch, and can be fine-tuned for specific use cases, it may not perform well on certain tasks if it has not been specifically trained for them. Additionally, the output of ChatGPT may be difficult to interpret, which could pose challenges for decision-making in data science applications.","source":"Semantic Scholar","year":2023,"language":"en","subjects":["Computer Science"],"doi":"10.3390/bdcc7020062","url":"https://www.semanticscholar.org/paper/c082ccfcfe1afc696e371374146ba9380b84061e","pdf_url":"https://www.mdpi.com/2504-2289/7/2/62/pdf?version=1679985158","is_open_access":true,"citations":332,"published_at":"","score":76.96000000000001},{"id":"ss_4c6e31458b0b44c1e8bd6e58f7d7e0767f7fde44","title":"CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories","authors":[{"name":"Fernando Martínez-Plumed"},{"name":"Lidia Contreras-Ochando"},{"name":"C. Ferri"},{"name":"José Hernández Orallo"},{"name":"Meelis Kull"},{"name":"N. Lachiche"},{"name":"M. J. R. Quintana"},{"name":"P. Flach"}],"abstract":"CRISP-DM(CRoss-Industry Standard Process for Data Mining) has its origins in the second half of the nineties and is thus about two decades old. According to many surveys and user polls it is still the de facto standard for developing data mining and knowledge discovery projects. However, undoubtedly the field has moved on considerably in twenty years, with data science now the leading term being favoured over data mining. In this paper we investigate whether, and in what contexts, CRISP-DM is still fit for purpose for data science projects. We argue that if the project is goal-directed and process-driven the process model view still largely holds. On the other hand, when data science projects become more exploratory the paths that the project can take become more varied, and a more flexible model is called for. We suggest what the outlines of such a trajectory-based model might look like and how it can be used to categorise data science projects (goal-directed, exploratory or data management). We examine seven real-life exemplars where exploratory activities play an important role and compare them against 51 use cases extracted from the NIST Big Data Public Working Group. We anticipate this categorisation can help project planning in terms of time and cost characteristics.","source":"Semantic Scholar","year":2021,"language":"en","subjects":["Computer Science"],"doi":"10.1109/tkde.2019.2962680","url":"https://www.semanticscholar.org/paper/4c6e31458b0b44c1e8bd6e58f7d7e0767f7fde44","pdf_url":"https://riunet.upv.es/bitstream/10251/184751/3/Martinez-PlumedContreras-OchandoFerri%20-%20CRISP-DM%20Twenty%20Years%20Later%20From%20Data%20Mining%20Processes%20to....pdf","is_open_access":true,"citations":313,"published_at":"","score":74.39},{"id":"ss_4dc6bc8f5c78497470cc2e427a9bda1d49ba50a0","title":"A review of random forest-based feature selection methods for data science education and applications","authors":[{"name":"Reza Iranzad"},{"name":"Xiao Liu"}],"abstract":"","source":"Semantic Scholar","year":2024,"language":"en","subjects":["Computer Science"],"doi":"10.1007/s41060-024-00509-w","url":"https://www.semanticscholar.org/paper/4dc6bc8f5c78497470cc2e427a9bda1d49ba50a0","is_open_access":true,"citations":194,"published_at":"","score":73.82},{"id":"ss_01e863776846ebd1a9a7acc4a9ca24217f953aa2","title":"Data Interpreter: An LLM Agent For Data Science","authors":[{"name":"Sirui Hong"},{"name":"Yizhang Lin"},{"name":"Bangbang Liu"},{"name":"Binhao Wu"},{"name":"Danyang Li"},{"name":"Jiaqi Chen"},{"name":"Jiayi Zhang"},{"name":"Jinlin Wang"},{"name":"Lingyao Zhang"},{"name":"Mingchen Zhuge"},{"name":"Taicheng Guo"},{"name":"Tuo Zhou"},{"name":"Wei Tao"},{"name":"Wenyi Wang"},{"name":"Xiangru Tang"},{"name":"Xiang Lu"},{"name":"Xinbing Liang"},{"name":"Yaying Fei"},{"name":"Yuheng Cheng"},{"name":"Zhibin Gou"},{"name":"Zongze Xu"},{"name":"Chenglin Wu"},{"name":"Li Zhang"},{"name":"Min Yang"},{"name":"Xiawu Zheng"}],"abstract":"Large Language Model (LLM)-based agents have shown effectiveness across many applications. However, their use in data science scenarios requiring solving long-term interconnected tasks, dynamic data adjustments and domain expertise remains challenging. Previous approaches primarily focus on individual tasks, making it difficult to assess the complete data science workflow. Moreover, they struggle to handle real-time changes in intermediate data and fail to adapt dynamically to evolving task dependencies inherent to data science problems. In this paper, we present Data Interpreter, an LLM-based agent designed to automatically solve various data science problems end-to-end. Our Data Interpreter incorporates two key modules: 1) Hierarchical Graph Modeling, which breaks down complex problems into manageable subproblems, enabling dynamic node generation and graph optimization; and 2) Programmable Node Generation, a technique that refines and verifies each subproblem to iteratively improve code generation results and robustness. Extensive experiments consistently demonstrate the superiority of Data Interpreter. On InfiAgent-DABench, it achieves a 25% performance boost, raising accuracy from 75.9% to 94.9%. For machine learning and open-ended tasks, it improves performance from 88% to 95%, and from 60% to 97%, respectively. Moreover, on the MATH dataset, Data Interpreter achieves remarkable performance with a 26% improvement compared to state-of-the-art baselines. The code is available at https://github.com/geekan/MetaGPT.","source":"Semantic Scholar","year":2024,"language":"en","subjects":["Computer Science"],"doi":"10.48550/arXiv.2402.18679","url":"https://www.semanticscholar.org/paper/01e863776846ebd1a9a7acc4a9ca24217f953aa2","is_open_access":true,"citations":193,"published_at":"","score":73.78999999999999},{"id":"ss_fb29359d794265c0931d756858a70c9265b5693d","title":"The R Language: An Engine for Bioinformatics and Data Science","authors":[{"name":"F. Giorgi"},{"name":"Carmine Ceraolo"},{"name":"D. Mercatelli"}],"abstract":"The R programming language is approaching its 30th birthday, and in the last three decades it has achieved a prominent role in statistics, bioinformatics, and data science in general. It currently ranks among the top 10 most popular languages worldwide, and its community has produced tens of thousands of extensions and packages, with scopes ranging from machine learning to transcriptome data analysis. In this review, we provide an historical chronicle of how R became what it is today, describing all its current features and capabilities. We also illustrate the major tools of R, such as the current R editors and integrated development environments (IDEs), the R Shiny web server, the R methods for machine learning, and its relationship with other programming languages. We also discuss the role of R in science in general as a driver for reproducibility. Overall, we hope to provide both a complete snapshot of R today and a practical compendium of the major features and applications of this programming language.","source":"Semantic Scholar","year":2022,"language":"en","subjects":["Medicine"],"doi":"10.3390/life12050648","url":"https://www.semanticscholar.org/paper/fb29359d794265c0931d756858a70c9265b5693d","pdf_url":"https://www.mdpi.com/2075-1729/12/5/648/pdf?version=1652413277","is_open_access":true,"citations":163,"published_at":"","score":70.89},{"id":"ss_eda28b2d1e761eaaa2f5cd6528579bf65b7bf1ac","title":"Embracing data science in catalysis research","authors":[{"name":"Manu Suvarna"},{"name":"J. Pérez‐Ramírez"}],"abstract":"","source":"Semantic Scholar","year":2024,"language":"en","subjects":null,"doi":"10.1038/s41929-024-01150-3","url":"https://www.semanticscholar.org/paper/eda28b2d1e761eaaa2f5cd6528579bf65b7bf1ac","is_open_access":true,"citations":84,"published_at":"","score":70.52000000000001},{"id":"ss_395c978221a21ee47c84a40a2ef11fb4d012fca1","title":"DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?","authors":[{"name":"Liqiang Jing"},{"name":"Zhehui Huang"},{"name":"Xiaoyang Wang"},{"name":"Wenlin Yao"},{"name":"Wenhao Yu"},{"name":"Kaixin Ma"},{"name":"Hongming Zhang"},{"name":"Xinya Du"},{"name":"Dong Yu"}],"abstract":"Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.","source":"Semantic Scholar","year":2024,"language":"en","subjects":["Computer Science"],"url":"https://www.semanticscholar.org/paper/395c978221a21ee47c84a40a2ef11fb4d012fca1","is_open_access":true,"citations":79,"published_at":"","score":70.37},{"id":"ss_7b5432e589b5f801d99bd27c00349ac651d308e4","title":"A New Era of Learning: Considerations for ChatGPT as a Tool to Enhance Statistics and Data Science Education","authors":[{"name":"Amanda R. Ellis"},{"name":"E. Slade"}],"abstract":"Abstract ChatGPT is one of many generative artificial intelligence (AI) tools that has emerged recently, creating controversy in the education community with concerns about its potential to be used for plagiarism and to undermine students’ ability to think independently. Recent publications have criticized the use of ChatGPT and other generative AI tools in the classroom, with little focus on the potential benefits. This article focuses on the potential of ChatGPT as an educational tool for statistics and data science. It encourages readers to consider the history of trepidation surrounding introducing new technology in the classroom, such as the calculator. We explore the possibility of leveraging ChatGPT’s capabilities in statistics and data science education, providing examples of how ChatGPT can aid in developing course materials and suggestions for how educators can prompt students to interact with ChatGPT responsibly. As educators, we can guide the use of generative AI tools in statistics and data science classrooms so that students and educators can leverage the benefits of this technology.","source":"Semantic Scholar","year":2023,"language":"en","subjects":null,"doi":"10.1080/26939169.2023.2223609","url":"https://www.semanticscholar.org/paper/7b5432e589b5f801d99bd27c00349ac651d308e4","pdf_url":"https://www.tandfonline.com/doi/pdf/10.1080/26939169.2023.2223609?needAccess=true\u0026role=button","is_open_access":true,"citations":105,"published_at":"","score":70.15},{"id":"ss_1d214355642847f8c5b9fb6806a4c3f0da0a84c8","title":"AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions","authors":[{"name":"Ziming Li"},{"name":"Qianbo Zang"},{"name":"David Ma"},{"name":"Jiawei Guo"},{"name":"T. Zheng"},{"name":"Minghao Liu"},{"name":"Xinyao Niu"},{"name":"Yue Wang"},{"name":"Jian Yang"},{"name":"Jiaheng Liu"},{"name":"Wanjun Zhong"},{"name":"Wangchunshu Zhou"},{"name":"Wenhao Huang"},{"name":"Ge Zhang"}],"abstract":"Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists data scientists in completing daily data pipelines through a collaborative multi-agent system. AutoKaggle implements an iterative development process that combines code execution, debugging, and comprehensive unit testing to ensure code correctness and logic consistency. The framework offers highly customizable workflows, allowing users to intervene at each phase, thus integrating automated intelligence with human expertise. Our universal data science toolkit, comprising validated functions for data cleaning, feature engineering, and modeling, forms the foundation of this solution, enhancing productivity by streamlining common tasks. We selected 8 Kaggle competitions to simulate data processing workflows in real-world application scenarios. Evaluation results demonstrate that AutoKaggle achieves a validation submission rate of 0.85 and a comprehensive score of 0.82 in typical data science pipelines, fully proving its effectiveness and practicality in handling complex data science tasks.","source":"Semantic Scholar","year":2024,"language":"en","subjects":["Computer Science"],"doi":"10.48550/arXiv.2410.20424","url":"https://www.semanticscholar.org/paper/1d214355642847f8c5b9fb6806a4c3f0da0a84c8","is_open_access":true,"citations":58,"published_at":"","score":69.74000000000001},{"id":"ss_612a8b0172d3eccc945762bfa5aed6e0ee7a6680","title":"Benchmarking Data Science Agents","authors":[{"name":"Yuge Zhang"},{"name":"Qiyang Jiang"},{"name":"Xingyu Han"},{"name":"Nan Chen"},{"name":"Yuqing Yang"},{"name":"Kan Ren"}],"abstract":"In the era of data-driven decision-making, the complexity of data analysis necessitates advanced expertise and tools of data science, presenting significant challenges even for specialists. Large Language Models (LLMs) have emerged as promising aids as data science agents, assisting humans in data analysis and processing. Yet their practical efficacy remains constrained by the varied demands of real-world applications and complicated analytical process. In this paper, we introduce DSEval -- a novel evaluation paradigm, as well as a series of innovative benchmarks tailored for assessing the performance of these agents throughout the entire data science lifecycle. Incorporating a novel bootstrapped annotation method, we streamline dataset preparation, improve the evaluation coverage, and expand benchmarking comprehensiveness. Our findings uncover prevalent obstacles and provide critical insights to inform future advancements in the field.","source":"Semantic Scholar","year":2024,"language":"en","subjects":["Computer Science"],"doi":"10.48550/arXiv.2402.17168","url":"https://www.semanticscholar.org/paper/612a8b0172d3eccc945762bfa5aed6e0ee7a6680","is_open_access":true,"citations":37,"published_at":"","score":69.11},{"id":"doaj_10.1038/s41598-025-15011-x","title":"Vehicular ad hoc networks verification scheme based on bilinear pairings and networks reverse fuzzy extraction","authors":[{"name":"Zaid Ameen Abduljabbar"},{"name":"Vincent Omollo Nyangaresi"},{"name":"Ahmed Ali Ahmed"},{"name":"Junchao Ma"},{"name":"Mustafa A. Al Sibahee"},{"name":"Mohammed Abdulridha Hussain"},{"name":"Zaid Alaa Hussien"},{"name":"Ali Hasan Ali"},{"name":"Abdulla J. Y. Aldarwish"},{"name":"Husam A. Neamah"}],"abstract":"Abstract Vehicular Ad-Hoc Networks (VANETs) have facilitated the massive exchange of real-time traffic and weather conditions, which have helped prevent collisions, reduce accidents, and road congestions. This can effectively enhance driving safety and efficiency in technology-driven transportation systems. However, the transmission of massive and sensitive information across public wireless communication channels exposes the transmitted data to a myriad of privacy as well as security threats. Although past researches has developed many vehicular ad-hoc networks security preservation schemes, several of them are inefficient or susceptible to attacks. This work, introduces an approach that leverages reverse fuzzy extraction, bilinear pairing, and Physically Unclonable Function (PUF) to design an efficient and anonymity-preserving authentication scheme. We conduct an elaborate formal security analysis to demonstrate that the derived session key is secure. The semantic security analyses also demonstrate its resilience against typical VANET attacks such as impersonations, denial of service, and de-synchronization, instilling confidence in its effectiveness. Moreover, our approach incurs the lowest computational overheads at relatively low communication costs. Specifically, our protocol attains a 66.696% reduction in computation costs, and a 70% increment in the supported security functionalities.","source":"DOAJ","year":2025,"language":"","subjects":["Medicine","Science"],"doi":"10.1038/s41598-025-15011-x","url":"https://doi.org/10.1038/s41598-025-15011-x","is_open_access":true,"published_at":"","score":69},{"id":"doaj_10.3847/1538-4357/adb61d","title":"The JWST Resolved Stellar Populations Early Release Science Program. VIII. The Spatially Resolved Star Formation History of WLM","authors":[{"name":"Roger E. Cohen"},{"name":"Kristen B. W. McQuinn"},{"name":"Alessandro Savino"},{"name":"Max J. B. Newman"},{"name":"Daniel R. Weisz"},{"name":"Andrew E. Dolphin"},{"name":"Martha L. Boyer"},{"name":"Matteo Correnti"},{"name":"Marla C. Geha"},{"name":"Mario Gennaro"},{"name":"Karoline M. Gilbert"},{"name":"Nitya Kallivayalil"},{"name":"Jack T. Warfield"},{"name":"Benjamin F. Williams"},{"name":"Alyson M. Brooks"},{"name":"Andrew A. Cole"},{"name":"Evan D. Skillman"},{"name":"Christopher T. Garling"},{"name":"Jason S. Kalirai"},{"name":"Jay Anderson"}],"abstract":"Radial stellar population gradients within dwarf galaxies provide a promising avenue for disentangling the drivers of galaxy evolution, including environment. Within the Local Volume, radial stellar age gradient slopes correlate with interaction history, contrary to model predictions, so dwarfs that are isolated provide a critical control sample. We measure radial stellar age gradients in the relatively isolated gas-rich dwarf irregular Wolf–Lundmark–Melotte Galaxy (WLM), combining JWST NIRCam and NIRISS imaging with six archival Hubble Space Telescope fields over semimajor axis equivalent distances of 0 ≲  R _SMA  ≲ 4 kpc (≲3 R _hl ). Fitting lifetime star formation histories to resolved color–magnitude diagrams, radial age gradients are quantified using τ _90 and τ _50 , the lookback times to form 90% and 50% of the cumulative stellar mass. We find that globally, the outskirts of WLM are older on average, with ( δτ _90 , δτ _50 )/ δ R _SMA  = (0.82 ${}_{-0.10}^{+0.10}$ , 1.60 ${}_{-0.22}^{+0.23}$ ) Gyr kpc ^−1 (stat.), in good agreement with simulations. However, we also detect an azimuthal dependence of radial stellar age gradients, finding that stars on the leading edge of WLM (relative to its proper motion) are both younger and have a flatter age gradient compared to the trailing edge. This difference persists over 0.6 ≲  R _SMA  ≲ 3.2 kpc (∼0.5–2.5 R _hl ) and lookback times up to ∼8 Gyr, and is robust to the assumed stellar evolutionary model. Our results are consistent with star formation triggered by ram pressure stripping from a circumgalactic and/or intergalactic medium, suggested by recent H I observations. If confirmed, processes typifying dense environments, such as ram pressure stripping, may be more relevant to the evolution of isolated galaxies than previously thought.","source":"DOAJ","year":2025,"language":"","subjects":["Astrophysics"],"doi":"10.3847/1538-4357/adb61d","url":"https://doi.org/10.3847/1538-4357/adb61d","is_open_access":true,"published_at":"","score":69},{"id":"doaj_10.3390/app15052383","title":"A Transformer-Based Approach for Efficient Geometric Feature Extraction from Vector Shape Data","authors":[{"name":"Longfei Cui"},{"name":"Xinyu Niu"},{"name":"Haizhong Qian"},{"name":"Xiao Wang"},{"name":"Junkui Xu"}],"abstract":"The extraction of shape features from vector elements is essential in cartography and geographic information science, supporting a range of intelligent processing tasks. Traditional methods rely on different machine learning algorithms tailored to specific types of line and polygon elements, limiting their general applicability. This study introduces a novel approach called “Pre-Trained Shape Feature Representations from Transformers (PSRT)”, which utilizes transformer encoders designed with three self-supervised pre-training tasks: coordinate masking prediction, coordinate offset correction, and coordinate sequence rearrangement. This approach enables the extraction of general shape features applicable to both line and polygon elements, generating high-dimensional embedded feature vectors. These vectors facilitate downstream tasks like shape classification, pattern recognition, and cartographic generalization. Our experimental results show that PSRT can extract vector shape features effectively without needing labeled samples and is adaptable to various types of vector features. Compared to the methods without pre-training, PSRT enhances training efficiency by over five times and improves accuracy by 5–10% in tasks such as line element matching and polygon shape classification. This innovative approach offers a more unified, efficient solution for processing vector shape data across different applications.","source":"DOAJ","year":2025,"language":"","subjects":["Technology","Engineering (General). Civil engineering (General)","Biology (General)","Physics","Chemistry"],"doi":"10.3390/app15052383","url":"https://www.mdpi.com/2076-3417/15/5/2383","is_open_access":true,"published_at":"","score":69},{"id":"doaj_10.1088/2632-2153/ae1f05","title":"Towards instance-wise calibration: local amortized diagnostics and reshaping of conditional densities (LADaR)","authors":[{"name":"Biprateep Dey"},{"name":"David Zhao"},{"name":"Brett H Andrews"},{"name":"Jeffrey A Newman"},{"name":"Rafael Izbicki"},{"name":"Ann B Lee"}],"abstract":"Key science questions, such as galaxy distance estimation and weather forecasting, often require knowing the full predictive distribution of a target variable Y given complex inputs X . Despite recent advances in machine learning and physics-based models, it remains challenging to assess whether an initial model is calibrated for all x , and when needed, to reshape the densities of y toward ‘instance-wise’ calibration. This paper introduces the local amortized diagnostics and reshaping of conditional densities (LADaR) framework and proposes a new computationally efficient algorithm ( Cal-PIT ) that produces interpretable local diagnostics and provides a mechanism for adjusting conditional density estimates (CDEs). Cal-PIT learns a single interpretable local probability–probability map from calibration data that identifies where and how the initial model is miscalibrated across feature space, which can be used to morph CDEs such that they are well-calibrated. We illustrate the LADaR framework on synthetic examples, including probabilistic forecasting from image sequences, akin to predicting storm wind speed from satellite imagery. Our main science application involves estimating the probability density functions of galaxy distances given photometric data, where Cal-PIT achieves better instance-wise calibration than all 11 other literature methods in a benchmark data challenge, demonstrating its utility for next-generation cosmological analyzes ^9 .","source":"DOAJ","year":2025,"language":"","subjects":["Computer engineering. Computer hardware","Electronic computers. Computer science"],"doi":"10.1088/2632-2153/ae1f05","url":"https://doi.org/10.1088/2632-2153/ae1f05","is_open_access":true,"published_at":"","score":69}],"total":44665592,"page":1,"page_size":20,"sources":["DOAJ","CrossRef","arXiv","Semantic Scholar"],"query":"data science"}