arXiv Open Access 2024

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

Hung-Chieh Fang Nai-Xuan Ye Yi-Jen Shih Puyuan Peng Hsuan-Fu Wang +3 lainnya

Lihat Sumber

Abstrak

Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks. However, these models predominantly centered on frame-level training objectives, which can fall short in spoken language understanding tasks that require semantic comprehension. Existing works often rely on additional speech-text data as intermediate targets, which is costly in the real-world setting. To address this challenge, we propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process, where the targets are derived from a visually-ground speech model, notably eliminating the need for speech-text paired data. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.

Topik & Kata Kunci

eess.AS cs.CL cs.LG

Penulis (8)

Hung-Chieh Fang

Nai-Xuan Ye

Yi-Jen Shih

Puyuan Peng

Hsuan-Fu Wang

Layne Berry

Hung-yi Lee

David Harwath

Format Sitasi

APA MLA BibTeX

Fang, H., Ye, N., Shih, Y., Peng, P., Wang, H., Berry, L. et al. (2024). Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model. https://arxiv.org/abs/2402.05819

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓