Semantic Scholar Open Access 2025 2 sitasi

AstroMLab 5: Structured Summaries and Concept Extraction for 400,000 Astrophysics Papers

Y.-S. Ting Alberto Accomazzi Tirthankar Ghosal T. Nguyen Rui Pan +2 lainnya

Abstrak

We present a dataset of 408,590 astrophysics papers from arXiv (astro-ph), spanning 1992 through July 2025. Each paper has been processed through a multi-stage pipeline to produce: (1) structured summaries organized into six semantic sections (Background, Motivation, Methodology, Results, Interpretation, Implication), and (2) concept extraction yielding 9,999 unique concepts with detailed descriptions. The dataset contains 3.8 million paper-concept associations and includes semantic embeddings for all concepts. Comparison with traditional ADS keywords reveals that the concepts provide denser coverage and more uniform distribution, while analysis of embedding space structure demonstrates that concepts are semantically dispersed within papers-enabling discovery through multiple diverse entry points. Concept vocabulary and embeddings are publicly released at https://github.com/tingyuansen/astro-ph_knowledge_graph.

Topik & Kata Kunci

Penulis (7)

Y

Y.-S. Ting

A

Alberto Accomazzi

T

Tirthankar Ghosal

T

T. Nguyen

R

Rui Pan

Z

Ze-Chang Sun

T

Tijmen de Haan

Format Sitasi

Ting, Y., Accomazzi, A., Ghosal, T., Nguyen, T., Pan, R., Sun, Z. et al. (2025). AstroMLab 5: Structured Summaries and Concept Extraction for 400,000 Astrophysics Papers. https://doi.org/10.18653/v1/2025.wasp-main.19

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.18653/v1/2025.wasp-main.19
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Total Sitasi
Sumber Database
Semantic Scholar
DOI
10.18653/v1/2025.wasp-main.19
Akses
Open Access ✓