arXiv Open Access 2025

SCALEFeedback: A Large-Scale Dataset of Synthetic Computer Science Assignments for LLM-generated Educational Feedback Research

Keyang Qian Kaixun Yang Wei Dai Flora Jin Yixin Cheng +6 lainnya
Lihat Sumber

Abstrak

Using LLMs to give educational feedback to students for their assignments has attracted much attention in the AI in Education field. Yet, there is currently no large-scale open-source dataset of student assignments that includes detailed assignment descriptions, rubrics, and student submissions across various courses. As a result, research on generalisable methodology for automatic generation of effective and responsible educational feedback remains limited. In the current study, we constructed a large-scale dataset of Synthetic Computer science Assignments for LLM-generated Educational Feedback research (SCALEFeedback). We proposed a Sophisticated Assignment Mimicry (SAM) framework to generate the synthetic dataset by one-to-one LLM-based imitation from real assignment descriptions, student submissions to produce their synthetic versions. Our open-source dataset contains 10,000 synthetic student submissions spanning 155 assignments across 59 university-level computer science courses. Our synthetic submissions achieved BERTScore F1 0.84, PCC of 0.62 for assignment marks and 0.85 for length, compared to the corresponding real-world assignment dataset, while ensuring perfect protection of student private information. All these results of our SAM framework outperformed results of a naive mimicry method baseline. The LLM-generated feedback for our synthetic assignments demonstrated the same level of effectiveness compared to that of real-world assignment dataset. Our research showed that one-to-one LLM imitation is a promising method for generating open-source synthetic educational datasets that preserve the original dataset's semantic meaning and student data distribution, while protecting student privacy and institutional copyright. SCALEFeedback enhances our ability to develop LLM-based generalisable methods for offering high-quality, automated educational feedback in a scalable way.

Topik & Kata Kunci

Penulis (11)

K

Keyang Qian

K

Kaixun Yang

W

Wei Dai

F

Flora Jin

Y

Yixin Cheng

R

Rui Guan

S

Sadia Nawaz

Z

Zachari Swiecki

G

Guanliang Chen

L

Lixiang Yan

D

Dragan Gašević

Format Sitasi

Qian, K., Yang, K., Dai, W., Jin, F., Cheng, Y., Guan, R. et al. (2025). SCALEFeedback: A Large-Scale Dataset of Synthetic Computer Science Assignments for LLM-generated Educational Feedback Research. https://arxiv.org/abs/2508.05953

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓