research-article

Exploring Siamese Neural Network Architectures for Preserving Speaker Identity in Speech Emotion Classification

Authors:
Priya Arora

Texas A&M University, College Station, Texas

Texas A&M University, College Station, Texas
View Profile

,
Theodora Chaspari

Texas A&M University, College Station, Texas

Texas A&M University, College Station, Texas
View Profile

MA3HMI'18: Proceedings of the 4th International Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine InteractionOctober 2018Pages 15–18https://doi.org/10.1145/3279972.3279980

Published:16 October 2018Publication History

MA3HMI'18: Proceedings of the 4th International Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction

Pages 15–18

ABSTRACT

Voice-enabled communication is increasingly being used in real-world applications, such as the ones involving conversational bots or "chatbots". Chatbots can spark and sustain user engagement by effectively recognizing their emotions and acting upon them. However, the majority of emotion recognition systems rely on rich spectrotemporal acoustic features. Beyond the emotion-related information, such systems tend to preserve information relevant to the identity of the speaker, therefore raising major privacy concerns from the users. This paper introduces two hybrid architectures for privacy-preserving emotion recognition from speech. These architectures rely on a Siamese neural network, whose input and intermediate layers are transformed using various privacy-performing operations in order to retain emotion-dependent content and suppress information related to the identity of a speaker. The proposed approach is evaluated through emotion classification and speaker identification performance metrics. Results indicate that the proposed framework can achieve up to 67.4% for classifying between happy, sad, frustrated, anger, neutral and other emotions, obtained from the publicly available Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. At the same time, the proposed approach reduces speaker identification accuracy to 50%, compared to 81%, the latter being achieved through a feedforward neural network solely trained on the speaker identification task using the same input features.

References

R Beale and C Peter. 2008. Affect and emotion in human-computer interaction. Springer. Google ScholarDigital Library
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42, 4 (2008), 335.Google Scholar
Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. 2014. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing 5, 4 (2014), 377--390.Google Scholar
Farah Chenchah and Zied Lachiri. 2014. Speech emotion recognition in acted and spontaneous context. Procedia Computer Science 39 (2014), 139--145.Google ScholarCross Ref
Scot Cunningham and Traian Marius Truta. 2008. Protecting privacy in recorded conversations. (mar 2008).Google Scholar
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. ACM, 1459--1462. Google ScholarDigital Library
Asbjørn Følstad and Petter Bae Brandtzæg. 2017. Chatbots and the new world of HCI. interactions 24, 4 (2017), 38--42. Google ScholarDigital Library
Cornelius Glackin, Gerard Chollet, Nazim Dugan, Nigel Cannings, Julie Wall, Shahzaib Tahir, Indranil Ghosh Ray, and Muttukrishnan Rajarajan. 2017. Privacy preserving encrypted phonetic search of speech data. (mar 2017).Google Scholar
Mohammad Hadian, Thamer Altuwaiyan, Xiaohui Liang, and Wei Li. 2017. Privacy-preserving voice-based search over mHealth data. (aug 2017).Google Scholar
Jihun Hamm. 2017. Enhancing utility and privacy with noisy minimax filter. (june 2017).Google Scholar
Daniel Jurafsky and James H Martin. 2017. Dialog Systems and Chatbots. In Speech and language processing. Chapter 29.Google Scholar
Nir Kshetri and Jeffrey Voas. 2018. Cyberthreats under the Bed. Computer 51, 5 (2018), 92--95.Google ScholarCross Ref
Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2011. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication 53, 9--10 (2011), 1162--1171. Google ScholarDigital Library
Kun Liu, H. Kargupta, and J. Ryan. 2005. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. (dec 2005).Google Scholar
Steven R Livingstone and Frank A Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one 13, 5 (2018), e0196391.Google ScholarCross Ref
Lingjuan Lyu, Xuanli He, Yee Wei Law, and Marimuthu Palaniswami. 2017. Privacy-Preserving Collaborative Deep Learning with Application to Human Activity Recognition. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 1219--1228. Google ScholarDigital Library
Lingjuan Lyu, Yee Wei Law, Sarah M. Erfani, Christopher Leckie, and Marimuthu Palaniswami. 2016. An improved scheme for privacy-preserving collaborative anomaly detection. (apr 2016).Google Scholar
Manas A. Pathak, Bhiksha Raj, Shantanu D. Rane, and Paris Smaragdis. 2013. Privacy preserving speech processing. (feb 2013).Google Scholar
Yogachandran Rahulamathavan and Muttukrishnan Rajarajan. 2015. Efficient privacy-preserving facial expression classification. (may 2015).Google Scholar
Srinivasan Ramakrishnan and Ibrahiem MM El Emary. 2013. Speech emotion recognition approaches in human computer interaction. Telecommunication Systems 52, 3 (2013), 1467--1478. Google ScholarDigital Library
LATANYA SWEENEY. 2002. A model for protecting privacy. (may 2002).Google Scholar
Dimitrios Ververidis and Constantine Kotropoulos. 2006. Emotional speech recognition: Resources, features, and methods. Speech communication 48, 9 (2006), 1162--1181.Google Scholar
Wikipedia. {n. d.}. https://en.wikipedia.org/wiki/Secure_Hash_Algorithms.Google Scholar

Recommendations

Speech Emotion Recognition via Contrastive Loss under Siamese Networks
ASMMC-MMAC'18: Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data

Speech emotion recognition is an important aspect of human-computer interaction. Prior work proposes various end-to-end models to improve the classification performance. However, most of them rely on the cross-entropy loss together with softmax as the ...
Read More
Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends

Tracing 20 years of progress in making machines hear our emotions based on speech signal properties.

Read More
Preserving Privacy in Image-based Emotion Recognition through User Anonymization
ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

The large amount of data captured by ambulatory sensing devices can afford us insights into longitudinal behavioral patterns, which can be linked to emotional, psychological, and cognitive outcomes. Yet, the sensitivity of behavioral data, which ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

MA3HMI'18: Proceedings of the 4th International Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction
October 2018
50 pages
ISBN:9781450360760
DOI:10.1145/3279972

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Emotionally-aware conversational agents
Siamese neural network
multiplicative perturbation
principal component analysis
repeated Gombertz Function
speaker identity
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 250
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploring Siamese Neural Network Architectures for Preserving Speaker Identity in Speech Emotion Classification

MA3HMI'18: Proceedings of the 4th International Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction

ABSTRACT

References

Cited By

Recommendations

Speech Emotion Recognition via Contrastive Loss under Siamese Networks

Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends

Preserving Privacy in Image-based Emotion Recognition through User Anonymization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Exploring Siamese Neural Network Architectures for Preserving Speaker Identity in Speech Emotion Classification

MA3HMI'18: Proceedings of the 4th International Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction

ABSTRACT

References

Cited By

Recommendations

Speech Emotion Recognition via Contrastive Loss under Siamese Networks

Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends

Preserving Privacy in Image-based Emotion Recognition through User Anonymization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media