Abstract:
Vocal gestures play an important role in emotion expression and can be used by speech based emotion recognition systems. This paper proposes the use of BLSTM neural netwo...Show MoreMetadata
Abstract:
Vocal gestures play an important role in emotion expression and can be used by speech based emotion recognition systems. This paper proposes the use of BLSTM neural networks to model salient variable length phoneme sequences, which in turn can represent relevant vocal gestures. Unlike existing techniques, the proposed approach is not restricted to modelling phoneme sequences of a fixed length and both salience and optimal modelling length of phoneme sequences are learnt from the training data. Three possible phoneme representations that can be modelled by BLSTMs are compared and experimental results suggest that sequences of Phone Log Likelihood Ratios are more representative of emotions when compared to sequences of phoneme labels represented as one — hot vectors. On the IEMOCAP database, the proposed approach achieves an Unweighted Average Recall (UAR) of 56.4%, an improvement of 6.5% in absolute terms over the previous approach of modelling fixed length phoneme sequences on a 4-class classification problem. The proposed linguistic system is complementary to acoustic features with a fused system leading to an absolute improvement of 5% to the UAR.
Published in: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII)
Date of Conference: 23-26 October 2017
Date Added to IEEE Xplore: 01 February 2018
ISBN Information:
Electronic ISSN: 2156-8111