DOAJ Open Access 2025

An open dataset of Chinese duration expressions

Si-Qi Zhang Jia-Wen Niu Xiaoqian Liu Xiao-Yang Sui Li-Lin Rao

Abstrak

Abstract Duration information is essential for understanding and analyzing our world. In textual contexts, duration information is typically conveyed in two formats: numeric (e.g., 1  hour) and verbal (e.g., shortly). To analyze duration information in text, it is crucial to understand how people map duration expressions to corresponding numerical duration. However, the literature has yet to provide lexicons supporting such conversion. Furthermore, existing databases of time-related expressions often lack information about word frequency – a robust predictor of information processing. This article reports an open dataset of 2,101 Chinese duration expressions, each annotated with its corresponding numerical duration. To obtain high-quality data for word frequency, we obtained the frequency of each duration expression from a large-scale corpus of 10 billion Chinese characters (BLCU Corpus Center (BCC) Corpus) and computed an adjusted frequency for each expression. This dataset provides a valuable resource for research on temporal information in Chinese, facilitating studies in natural language processing, psychology, and linguistics.

Topik & Kata Kunci

Penulis (5)

S

Si-Qi Zhang

J

Jia-Wen Niu

X

Xiaoqian Liu

X

Xiao-Yang Sui

L

Li-Lin Rao

Format Sitasi

Zhang, S., Niu, J., Liu, X., Sui, X., Rao, L. (2025). An open dataset of Chinese duration expressions. https://doi.org/10.1038/s41597-025-06016-2

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.1038/s41597-025-06016-2
Informasi Jurnal
Tahun Terbit
2025
Sumber Database
DOAJ
DOI
10.1038/s41597-025-06016-2
Akses
Open Access ✓