arXiv Open Access 2025

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

Mihai Nadas Laura Diosan Andrei Piscoran Andreea Tomescu

Lihat Sumber

Abstrak

Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (<24 GB VRAM) at approximately 13.5 cents per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models.

Topik & Kata Kunci

cs.CL cs.AI cs.DL cs.LG

Penulis (4)

Mihai Nadas

Laura Diosan

Andrei Piscoran

Andreea Tomescu

Format Sitasi

APA MLA BibTeX

Nadas, M., Diosan, L., Piscoran, A., Tomescu, A. (2025). TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models. https://arxiv.org/abs/2504.20605

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓