Hybrid ViT-RetinaNet with Explainable Ensemble Learning for Fine-Grained Vehicle Damage Classification
Abstrak
Efficient and explainable vehicle damage inspection is essential due to the increasing complexity and volume of vehicular incidents. Traditional manual inspection approaches are not time-effective, prone to human error, and lead to inefficiencies in insurance claims and repair workflows. Existing deep learning methods, such as CNNs, often struggle with generalization, require large annotated datasets, and lack interpretability. This study presents a robust and interpretable deep learning framework for vehicle damage classification, integrating Vision Transformers (ViTs) and ensemble detection strategies. The proposed architecture employs a RetinaNet backbone with a ViT-enhanced detection head, implemented in PyTorch using the Detectron2 object detection technique. It is pretrained on COCO weights and fine-tuned through focal loss and aggressive augmentation techniques to improve generalization under real-world damage variability. The proposed system applies the Weighted Box Fusion (WBF) ensemble strategy to refine detection outputs from multiple models, offering improved spatial precision. To ensure interpretability and transparency, we adopt numerous explainability techniques—Grad-CAM, Grad-CAM++, and SHAP—offering semantic and visual insights into model decisions. A custom vehicle damage dataset with 4500 images has been built, consisting of approximately 60% curated images collected through targeted web scraping and crawling covering various damage types (such as bumper dents, panel scratches, and frontal impacts), along with 40% COCO dataset images to support model generalization. Comparative evaluations show that Hybrid ViT-RetinaNet achieves superior performance with an F1-score of 84.6%, mAP of 87.2%, and 22 FPS inference speed. In an ablation analysis, WBF, augmentation, transfer learning, and focal loss significantly improve performance, with focal loss increasing F1 by 6.3% for underrepresented classes and COCO pretraining boosting mAP by 8.7%. Additional architectural comparisons demonstrate that our full hybrid configuration not only maintains competitive accuracy but also achieves up to 150 FPS, making it well suited for real-time use cases. Robustness tests under challenging conditions, including real-world visual disturbances (smoke, fire, motion blur, varying lighting, and occlusions) and artificial noise (Gaussian; salt-and-pepper), confirm the model’s generalization ability. This work contributes a scalable, explainable, and high-performance solution for real-world vehicle damage diagnostics.
Topik & Kata Kunci
Penulis (5)
Ananya Saha
Mahir Afser Pavel
Md Fahim Shahoriar Titu
Afifa Zain Apurba
Riasat Khan
Akses Cepat
- Tahun Terbit
- 2025
- Sumber Database
- DOAJ
- DOI
- 10.3390/vehicles7030089
- Akses
- Open Access ✓