arXiv Open Access 2025

FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation

Yulia Otmakhova Hung Thinh Truong Rahmad Mahendra Zenan Zhai Rongxin Zhu +2 lainnya
Lihat Sumber

Abstrak

We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels -- from orthography to dialect and style -- and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE's utility by evaluating both fine-tuned models and LLMs across six diverse NLP tasks (four classification and two generation tasks), and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) LLMs still exhibit significant brittleness to certain linguistic variations, with reasoning LLMs surprisingly showing less robustness on some tasks compared to base models, and scaling improving robustness only for surface-level modifications; (3) models are overall more brittle to natural, fluent modifications such as syntax or style changes (and especially to negation), compared to corruption-style tests such as letter flipping; (4) the ability of a model to use a linguistic feature in generation does not correlate to its robustness to this feature on downstream tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.

Topik & Kata Kunci

Penulis (7)

Y

Yulia Otmakhova

H

Hung Thinh Truong

R

Rahmad Mahendra

Z

Zenan Zhai

R

Rongxin Zhu

D

Daniel Beck

J

Jey Han Lau

Format Sitasi

Otmakhova, Y., Truong, H.T., Mahendra, R., Zhai, Z., Zhu, R., Beck, D. et al. (2025). FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation. https://arxiv.org/abs/2504.17311

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓