Monash University
Browse

Advancing Speech Emotion Recognition with Interpretable Neural Networks and Self-Supervised Paralinguistic Representations

Download (10.4 MB)
thesis
posted on 2025-05-21, 06:54 authored by Linh Ngoc Vu
This research focuses on novel approaches for speech-based emotion recognition (SER). SER technologies can be applied in various contexts, such as assessing customer satisfaction in call centers, tracking personal moods, and monitoring emotions in healthcare settings. Numerous machine learning methods have been proposed, ranging from traditional feature-based models to end-to-end interpretable neural networks and self-supervised learning techniques. These methods have produced explainable representations and identifiable features related to vocal cues, which we refer to as paralinguistic representations (i.e., beyond linguistics). By incorporating a pre-trained paralinguistic representation, our method achieved accuracy comparable to state-of-the-art techniques while maintaining high efficiency. A detailed analysis of errors and metadata indicated that our proposed method reduces gender bias and generalizes well to unseen speakers and spontaneous emotions, extending beyond recordings of scripted utterances.

History

Campus location

Malaysia

Principal supervisor

Lim Wern Han

Additional supervisor 1

Prof. Raphael Phan

Additional supervisor 2

Prof. Dinh Phung

Year of Award

2025

Department, School or Centre

School of Information Technology (Monash University Malaysia)

Course

Doctor of Philosophy

Degree Type

DOCTORATE

Faculty

Faculty of Information Technology