arXiv Open Access 2025

Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages

Andrea Nasuto Stefano Maria Iacus Francisco Rowe Devika Jain
Lihat Sumber

Abstrak

Large language models (LLMs) offer new opportunities for scalable analysis of online discourse. Yet their use in multilingual social science research remains constrained by model size, cost and linguistic bias. We develop a lightweight, open-source LLM framework using fine-tuned LLaMA 3.2-3B models to classify immigration-related tweets across 13 languages. Unlike prior work relying on BERT style models or translation pipelines, we combine topic classification with stance detection and demonstrate that LLMs fine-tuned in just one or two languages can generalize topic understanding to unseen languages. Capturing ideological nuance, however, benefits from multilingual fine-tuning. Our approach corrects pretraining biases with minimal data from under-represented languages and avoids reliance on proprietary systems. With 26-168x faster inference and over 1000x cost savings compared to commercial LLMs, our method supports real-time analysis of billions of tweets. This scale-first framework enables inclusive, reproducible research on public attitudes across linguistic and cultural contexts.

Topik & Kata Kunci

Penulis (4)

A

Andrea Nasuto

S

Stefano Maria Iacus

F

Francisco Rowe

D

Devika Jain

Format Sitasi

Nasuto, A., Iacus, S.M., Rowe, F., Jain, D. (2025). Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages. https://arxiv.org/abs/2508.06435

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓