arXiv Open Access 2024

A Systematic Review of NeurIPS Dataset Management Practices

Yiwei Wu Leah Ajmani Shayne Longpre Hanlin Li
Lihat Sumber

Abstrak

As new machine learning methods demand larger training datasets, researchers and developers face significant challenges in dataset management. Although ethics reviews, documentation, and checklists have been established, it remains uncertain whether consistent dataset management practices exist across the community. This lack of a comprehensive overview hinders our ability to diagnose and address fundamental tensions and ethical issues related to managing large datasets. We present a systematic review of datasets published at the NeurIPS Datasets and Benchmarks track, focusing on four key aspects: provenance, distribution, ethical disclosure, and licensing. Our findings reveal that dataset provenance is often unclear due to ambiguous filtering and curation processes. Additionally, a variety of sites are used for dataset hosting, but only a few offer structured metadata and version control. These inconsistencies underscore the urgent need for standardized data infrastructures for the publication and management of datasets.

Topik & Kata Kunci

Penulis (4)

Y

Yiwei Wu

L

Leah Ajmani

S

Shayne Longpre

H

Hanlin Li

Format Sitasi

Wu, Y., Ajmani, L., Longpre, S., Li, H. (2024). A Systematic Review of NeurIPS Dataset Management Practices. https://arxiv.org/abs/2411.00266

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓