A Multi-Objective Statistical Framework for Evaluating LLM-Based Code Modernization: Transformation Pattern Analysis and Effect Size Validation
Abstrak
Automated legacy code modernization using Large Language Models lacks rigorous evaluation frameworks and multi-objective quality assessment methodologies. Existing research suffers from three critical deficiencies: single-metric evaluation paradigms creating pathological optimization incentives, statistical validation limited to <i>p</i>-values without effect size analysis, and absence of systematic transformation pattern taxonomies explaining what works and why. We present a novel multi-objective statistical framework that jointly assesses Cyclomatic Complexity (CC) and Maintainability Index (MI) while providing comprehensive effect size analysis addressing software engineering research gaps. Applied to 47 legacy Java samples from Apache Ant (version 1.10.x, commit rel/1.10.14), our framework achieves 97.9% metric-level improvement with very large practical effects (Cohen’s <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>d</mi><mo>=</mo><mn>1.86</mn></mrow></semantics></math></inline-formula>, 95% CI [1.36, 2.35], <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>p</mi><mo><</mo><mn>0.0001</mn></mrow></semantics></math></inline-formula>) for maintainability—substantially exceeding prior work and conventional significance thresholds. We note that this success rate reflects quality metric improvement; functional equivalence was verified through syntactic validation and manual inspection of a 20% random sample, while comprehensive automated test-based verification remains a limitation addressed in future work. We contribute: (1) first multi-objective quality assessment framework for code modernization with weighted composite scoring and sensitivity analysis, (2) rigorous statistical methodology with effect size analysis beyond <i>p</i>-values, (3) systematic transformation pattern taxonomy identifying four successful patterns and three failure modes with predictive value (inter-rater agreement <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>κ</mi><mo>=</mo><mn>0.82</mn></mrow></semantics></math></inline-formula>), and (4) negative result showing iterative refinement provides no benefit (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>d</mi><mo>=</mo><mn>0.08</mn></mrow></semantics></math></inline-formula>, <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>p</mi><mo>=</mo><mn>0.179</mn></mrow></semantics></math></inline-formula>), saving community resources. Our transformation taxonomy enables practitioners to predict success likelihood from code characteristics, while our statistical framework provides replicable methodology for evaluating LLM-based software engineering tools. The very large effect size indicates metric-level improvements are materially meaningful for real-world software maintenance, not merely statistically detectable.
Topik & Kata Kunci
Penulis (1)
Bashair Althani
Akses Cepat
- Tahun Terbit
- 2026
- Sumber Database
- DOAJ
- DOI
- 10.3390/computers15030148
- Akses
- Open Access ✓