Abstract:
Studies have shown that emotional variability in speech degrades the performance of speaker recognition tasks. Of particular interest is the error produced due to mismatc...Show MoreMetadata
Abstract:
Studies have shown that emotional variability in speech degrades the performance of speaker recognition tasks. Of particular interest is the error produced due to mismatch between training speaker recognition models with neutral speech and testing them with expressive speech. While previous studies have considered categorical emotions, expressive speech during human interaction conveys subtle behaviors that are better characterized with continuous descriptors (e.g., attributes such as arousal, valence, dominance). As the emotion becomes more intense, we expect the performance of speaker recognition tasks to drop. Can we define emotional regions for which the speaker recognition performance is expected to be reliable? This study focuses on automatically predicting reliable regions for speaker recognition by analyzing and predicting the emotional content. We collected a unique emotional database from 80 speakers. We estimate speaker recognition performance as a function of arousal and valence, creating regions in this space where we can reliably recognize the identity of a speaker. Then, we train speech emotion recognizers designed to predict whether the emotional content in a sentence is within the reliable region. The experimental evaluation demonstrates that sentences that are classified as reliable for speaker recognition tasks have lower equal error rate (EER) than sentences that are considered unreliable.
Published in: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII)
Date of Conference: 23-26 October 2017
Date Added to IEEE Xplore: 01 February 2018
ISBN Information:
Electronic ISSN: 2156-8111