Traditional adversarial learning (AL) algorithms learns a speaker independent embedding from low level audio features. This paper introduces discriminative adversarial learning (DAL) which learn a discriminative speaker independent embedding from low level audio features such as mel frequency cepstral coefficients (MFCC) and high level audio features such as Interspeech Para-linguistics Challenge 2010. To this end, DAL jointly minimize triplet and cross-entropy losses with gradient reversal strategy for speaker independent emotion recognition (SIER). Triplet loss reduce intra-class and increase the inter-class embedding distance to improve the discriminativeness of the embedding while the cross-entropy loss determine the emotion or speaker class of the embedding and gradient reversal learn speaker independent embedding for SIER. Experiments on Emo-DB and RAVDESS datasets show that DAL outperform other traditional adversarial learning (AL) algorithms.
Cite as: Kasun, C., Ahn, C.S., Rajapakse, J., Lin, Z., Huang, G.-B. (2022) Discriminative Adversarial Learning for Speaker Independent Emotion Recognition. Proc. Interspeech 2022, 4975-4979, doi: 10.21437/Interspeech.2022-285
@inproceedings{kasun22_interspeech, author={Chamara Kasun and Chung Soo Ahn and Jagath Rajapakse and Zhiping Lin and Guang-Bin Huang}, title={{Discriminative Adversarial Learning for Speaker Independent Emotion Recognition}}, year=2022, booktitle={Proc. Interspeech 2022}, pages={4975--4979}, doi={10.21437/Interspeech.2022-285} }