arXiv Open Access 2026

LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

Patrick Amadeus Irawan Erland Hilman Fuadi Shanu Kumar Alham Fikri Aji Yova Kementchedjhieva

Lihat Sumber

Abstrak

Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.

Topik & Kata Kunci

cs.CV cs.CL

Penulis (5)

Patrick Amadeus Irawan

Erland Hilman Fuadi

Shanu Kumar

Alham Fikri Aji

Yova Kementchedjhieva

Format Sitasi

APA MLA BibTeX

Irawan, P.A., Fuadi, E.H., Kumar, S., Aji, A.F., Kementchedjhieva, Y. (2026). LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation. https://arxiv.org/abs/2604.00829

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓