arXiv Open Access 2026

Transformer Architectures for Respiratory Sound Analysis and Multimodal Diagnosis

Theodore Aptekarev Vladimir Sokolovsky Gregory Furman

Lihat Sumber

Abstrak

Respiratory sound analysis is a crucial tool for screening asthma and other pulmonary pathologies, yet traditional auscultation remains subjective and experience-dependent. Our prior research established a CNN baseline using DenseNet201, which demonstrated high sensitivity in classifying respiratory sounds. In this work, we (i) adapt the Audio Spectrogram Transformer (AST) for respiratory sound analysis and (ii) evaluate a multimodal Vision-Language Model (VLM) that integrates spectrograms with structured patient metadata. AST is initialized from publicly available weights and fine-tuned on a medical dataset containing hundreds of recordings per diagnosis. The VLM experiment uses a compact Moondream-type model that processes spectrogram images alongside a structured text prompt (sex, age, recording site) to output a JSON-formatted diagnosis. Results indicate that AST achieves approximately 97% accuracy with an F1-score around 97% and ROC AUC of 0.98 for asthma detection, significantly outperforming both the internal CNN baseline and typical external benchmarks. The VLM reaches 86-87% accuracy, performing comparably to the CNN baseline while demonstrating the capability to integrate clinical context into the inference process. These results confirm the effectiveness of self-attention for acoustic screening and highlight the potential of multimodal architectures for holistic diagnostic tools.

Topik & Kata Kunci

cs.SD

Penulis (3)

Theodore Aptekarev

Vladimir Sokolovsky

Gregory Furman

Format Sitasi

APA MLA BibTeX

Aptekarev, T., Sokolovsky, V., Furman, G. (2026). Transformer Architectures for Respiratory Sound Analysis and Multimodal Diagnosis. https://arxiv.org/abs/2601.14227

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓