Semantic Scholar Open Access 2020 88 sitasi

Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks

Cong Yan Yeye He

Abstrak

Data preparation is widely recognized as the most time-consuming process in modern business intelligence (BI) and machine learning (ML) projects. Automating complex data preparation steps (e.g., Pivot, Unpivot, Normalize-JSON, etc.)holds the potential to greatly improve user productivity, and has therefore become a central focus of research. We propose a novel approach to "auto-suggest" contextualized data preparation steps, by "learning" from how data scientists would manipulate data, which are documented by data science notebooks widely available today. Specifically, we crawled over 4M Jupyter notebooks on GitHub, and replayed them step-by-step, to observe not only full input/output tables (data-frames) at each step, but also the exact data-preparation choices data scientists make that they believe are best suited to the input data (e.g., how input tables are Joined/Pivoted/Unpivoted, etc.). By essentially "logging" how data scientists interact with diverse tables, and using the resulting logs as a proxy of "ground truth", we can learn-to-recommend data preparation steps best suited to given user data, just like how search engines (Google or Bing) leverage their click-through logs to learn-to-rank documents. This data-driven and log-driven approach leverages the "collective wisdom" of data scientists embodied in the notebooks, and is shown to significantly outperform strong baselines including commercial systems in terms of accuracy.

Topik & Kata Kunci

Computer Science

Penulis (2)

Cong Yan

Yeye He

Format Sitasi

APA MLA BibTeX

Yan, C., He, Y. (2020). Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks. https://doi.org/10.1145/3318464.3389738

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →

Lihat di Sumber doi.org/10.1145/3318464.3389738

Informasi Jurnal

Tahun Terbit: 2020
Bahasa: en
Total Sitasi: 88×
Sumber Database: Semantic Scholar
DOI: 10.1145/3318464.3389738
Akses: Open Access ✓