arXiv Open Access 2025

Automatic Extraction of Clausal Embedding Based on Large-Scale English Text Data

Iona Carslaw Sivan Milton Nicolas Navarre Ciyang Qing Wataru Uegaki

Lihat Sumber

Abstrak

For linguists, embedded clauses have been of special interest because of their intricate distribution of syntactic and semantic features. Yet, current research relies on schematically created language examples to investigate these constructions, missing out on statistical information and naturally-occurring examples that can be gained from large language corpora. Thus, we present a methodological approach for detecting and annotating naturally-occurring examples of English embedded clauses in large-scale text data using constituency parsing and a set of parsing heuristics. Our tool has been evaluated on our dataset Golden Embedded Clause Set (GECS), which includes hand-annotated examples of naturally-occurring English embedded clause sentences. Finally, we present a large-scale dataset of naturally-occurring English embedded clauses which we have extracted from the open-source corpus Dolma using our extraction tool.

Topik & Kata Kunci

cs.CL

Penulis (5)

Iona Carslaw

Sivan Milton

Nicolas Navarre

Ciyang Qing

Wataru Uegaki

Format Sitasi

APA MLA BibTeX

Carslaw, I., Milton, S., Navarre, N., Qing, C., Uegaki, W. (2025). Automatic Extraction of Clausal Embedding Based on Large-Scale English Text Data. https://arxiv.org/abs/2506.14064

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓