The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective
Abstrak
The use of copyrighted materials in training language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of publisher-controlled copyrighted corpora on the performance of generative large language models (LLMs) for Norwegian. When evaluated on a diverse set of tasks, we found that adding both books and newspapers to the data mixture of LLMs tend to improve their performance, while the addition of fiction works seems to be detrimental. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.
Topik & Kata Kunci
Penulis (19)
Javier de la Rosa
Vladislav Mikhailov
Lemei Zhang
Freddy Wetjen
David Samuel
Peng Liu
Rolv-Arild Braaten
Petter Mæhlum
Magnus Breder Birkenes
Andrey Kutuzov
Tita Enstad
Hans Christian Farsethås
Svein Arne Brygfjeld
Jon Atle Gulla
Stephan Oepen
Erik Velldal
Wilfred Østgulen
Liljia Øvrelid
Aslak Sira Myhre
Akses Cepat
- Tahun Terbit
- 2024
- Bahasa
- en
- Sumber Database
- arXiv
- Akses
- Open Access ✓