arXiv Open Access 2024

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

Hung-Chieh Fang Nai-Xuan Ye Yi-Jen Shih Puyuan Peng Hsuan-Fu Wang +3 lainnya
Lihat Sumber

Abstrak

Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks. However, these models predominantly centered on frame-level training objectives, which can fall short in spoken language understanding tasks that require semantic comprehension. Existing works often rely on additional speech-text data as intermediate targets, which is costly in the real-world setting. To address this challenge, we propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process, where the targets are derived from a visually-ground speech model, notably eliminating the need for speech-text paired data. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.

Topik & Kata Kunci

Penulis (8)

H

Hung-Chieh Fang

N

Nai-Xuan Ye

Y

Yi-Jen Shih

P

Puyuan Peng

H

Hsuan-Fu Wang

L

Layne Berry

H

Hung-yi Lee

D

David Harwath

Format Sitasi

Fang, H., Ye, N., Shih, Y., Peng, P., Wang, H., Berry, L. et al. (2024). Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model. https://arxiv.org/abs/2402.05819

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓