Abstract:
Phonetic variability has long been considered a confounding factor for emotional speech processing, so phonetic features have been rarely explored. However, surprisingly ...Show MoreMetadata
Abstract:
Phonetic variability has long been considered a confounding factor for emotional speech processing, so phonetic features have been rarely explored. However, surprisingly some features with purely phonetic information have shown state-of-the-art performance for continuous prediction of emotions (e.g., arousal and valence), for which the underlying causes are unknown to date. In this article, we present in-depth investigations into phonetic features on three widely used corpora - RECOLA, SEMAINE and USC CreativeIT - to explore this from two perspectives: acoustic space partitioning information and phonetic content. First, comparisons of multiple different partitioning methods confirm the significance of partitioning information in speech, and reveal the new understanding that varying the number of partitions has a greater effect on valence than arousal prediction: a detailed representation of the acoustic space is needed for valence, whilst a general one is adequate for arousal. Second, phoneme-specific examination of phonetic features suggests that phonetic content is less emotionally informative than partitioning information, and is more important for arousal than for valence. Furthermore, we propose a novel set of phonetically-aware acoustic features, attaining significant improvements for valence (in particular) and arousal prediction across RECOLA, SEMAINE and CreativeIT respectively, compared with conventional reference acoustic features.
Published in: IEEE Transactions on Affective Computing ( Volume: 11, Issue: 4, 01 Oct.-Dec. 2020)