Efficient detection of AI-generated scientific abstracts with a lightweight transformer
Abstrak
The rapid growth of advanced large language models challenges the authenticity of scientific work, which requires reliable methods for detecting AI-generated scientific text. This paper addresses this challenge by developing and evaluating an efficient text classifier. We first constructed a balanced dataset, focusing initially on the Computer Vision (cs.CV) domain, and subsequently expanding it to include four additional diverse scientific domains (totaling 5,000 abstracts), using human-written samples from arXiv and corresponding AI-generated versions created using Google’s Gemini 2.0 Flash. We then fine-tuned a lightweight Transformer model, DistilBERT, for the classification task. On the primary in-domain (cs.CV) test set, our approach achieved excellent performance, with an accuracy of 99.4% and an Area Under the ROC Curve of 0.9999. Subsequent cross-domain evaluations demonstrated robust generalization (Macro-F1 = 0.948). Further analysis revealed that our model surpasses traditional machine learning baselines not only in accuracy but also in robustness, as it learns deep semantic patterns rather than relying on superficial statistical cues. This work provides a practical, high-performance tool for safeguarding scientific authenticity and establishes a valuable benchmark for future research in AI text detection.
Topik & Kata Kunci
Penulis (2)
Cuilian Zhang
Weijun Zhou
Akses Cepat
- Tahun Terbit
- 2026
- Bahasa
- en
- Total Sitasi
- 1×
- Sumber Database
- Semantic Scholar
- DOI
- 10.1038/s41598-026-35203-3
- Akses
- Open Access ✓