arXiv Open Access 2023

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

Debaditya Shome Ali Etemad
Lihat Sumber

Abstrak

We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each component of our method.

Topik & Kata Kunci

Penulis (2)

D

Debaditya Shome

A

Ali Etemad

Format Sitasi

Shome, D., Etemad, A. (2023). Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations. https://arxiv.org/abs/2309.04849

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓