research-article

Speech Emotion Classification using Raw Audio Input and Transcriptions

Authors:

JinYeong BakAuthors Info & Claims

SPML '18: Proceedings of the 2018 International Conference on Signal Processing and Machine Learning

Pages 41 - 46

https://doi.org/10.1145/3297067.3297089

Published: 28 November 2018 Publication History

Abstract

As new gadgets that interact with the user through voice become accessible, the importance of not only the content of the speech increases, but also the significance of the way the user has spoken. Even though many techniques have been developed to indicate emotion on speech, none of them can fully grasp the real emotion of the speaker. This paper presents a neural network model capable of predicting emotions in conversations by analyzing transcriptions and raw audio waveforms, focusing on feature extraction using convolutional layers and feature combination. The model achieves an accuracy of over 71% across four classes: Anger, Happiness, Neutrality and Sadness. We also analyze the effect of audio and textual features on the classification task, by interpreting attention scores and parts of speech. This paper explores the use of raw audio waveforms, that in the best of our knowledge, have not yet been used deeply in the emotion classification task, achieving close to state of art results.

References

[1]

K. Choi, G. Fazekas, M. Sandler, and K. Cho, "Convolutional recurrent neural networks for music classification," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2392--2396, IEEE, 2017.

[2]

Y. Kim, "Convolutional neural networks for sentence classification," arXiv preprint arXiv:1408.5882, 2014.

[3]

J. Lee, T. Kim, J. Park, and J. Nam, "Raw waveform-based audio classification using sample-level cnn architectures," arXiv preprint arXiv:1712.00866, 2017.

[4]

C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, "Iemocap: Interactive emotional dyadic motion capture database," Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.

[5]

D. Hazarika, S. Gorantla, S. Poria, and R. Zimmermann, "Self-attentive feature-level fusion for multimodal emotion detection," in 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 196--201, April 2018.

[6]

K. S. Rao and S. G. Koolagudi, "Recognition of emotions from video using acoustic and facial features," Signal, Image and Video Processing, vol. 9, no. 5, pp. 1029--1045, 2015.

[7]

S. Poria, H. Peng, A. Hussain, N. Howard, and E. Cambria, "Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis," Neurocomputing, vol. 261, pp. 217--230, 2017.

[8]

R. Xia and Y. Liu, "Using i-vector space model for emotion recognition," in Thirteenth Annual Conference of the International Speech Communication Association, 2012.

[9]

A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio.," in SSW, p. 125, 2016.

[10]

Y. Hoshen, R. J. Weiss, and K. W. Wilson, "Speech acoustic modeling from raw multichannel waveforms," in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 4624--4628, IEEE, 2015.

[11]

D. Palaz, M. M. Doss, and R. Collobert, "Convolutional neural networks-based continuous speech recognition using raw speech signal," in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 4295--4299, IEEE, 2015.

[12]

T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, "The application of two-level attention models in deep convolutional neural network for fine-grained image classification," in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

[13]

Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, "Hierarchical attention networks for document classification," in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480--1489, 2016.

[14]

K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.

[15]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, pp. 1097--1105, 2012.

Digital Library

[16]

T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.

[17]

K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770--778, 2016.

[18]

J. Dai, Y. Li, K. He, and J. Sun, "R-fcn: Object detection via region-based fully convolutional networks," in Advances in neural information processing systems, pp. 379--387, 2016.

Digital Library

[19]

B. Liang, H. Li, M. Su, P. Bian, X. Li, and W. Shi, "Deep text classification can be fooled," arXiv preprint arXiv:1704.08006, 2017.

Digital Library

[20]

N. Frosst and G. Hinton, "Distilling a neural network into a soft decision tree," arXiv preprint arXiv:1711.09784, 2017.

Cited By

Shi XDai X(2024)Speech Emotion Classification Based on Dynamic Graph Attention Network2024 5th International Conference on Electronic Communication and Artificial Intelligence (ICECAI)10.1109/ICECAI62591.2024.10675234(328-331)Online publication date: 31-May-2024
https://doi.org/10.1109/ICECAI62591.2024.10675234
Zhou FHu SWan XLu ZWu J(2023)Risevi: A Disease Risk Prediction Model Based on Vision Transformer Applied to Nursing HomesElectronics10.3390/electronics1215320612:15(3206)Online publication date: 25-Jul-2023
https://doi.org/10.3390/electronics12153206

Index Terms

Speech Emotion Classification using Raw Audio Input and Transcriptions
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Improved Speech Emotion Classification Using Deep Neural Network
Abstract
Speech emotion recognition (SER), which has gained greater attention in recent years, is a key aspect of the human–computer interaction process. However, a wide range of strategies has been offered in SER, and these approaches have yet to increase ...
Recognizing emotion in speech using neural networks
Telehealth/AT '08: Proceedings of the IASTED International Conference on Telehealth/Assistive Technologies

Emotion recognition is an important factor of affective computing and has potential use in assistive technologies. In this paper we used landmark and other acoustic features to recognize different emotional states in speech. We analyzed 2442 utterances ...
Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications

The role of automatic emotion recognition from speech is growing continuously because of the accepted importance of reacting to the emotional state of the user in human-computer interaction. Most state-of-the-art emotion recognition methods are based on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SPML '18: Proceedings of the 2018 International Conference on Signal Processing and Machine Learning

November 2018

177 pages

ISBN:9781450366052

DOI:10.1145/3297067

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 November 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Korean MSIT
Institute for Information and communications Technology Promotion

Conference

SPML '18

SPML '18: 2018 International Conference on Signal Processing and Machine Learning

November 28 - 30, 2018

Shanghai, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
176
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shi XDai X(2024)Speech Emotion Classification Based on Dynamic Graph Attention Network2024 5th International Conference on Electronic Communication and Artificial Intelligence (ICECAI)10.1109/ICECAI62591.2024.10675234(328-331)Online publication date: 31-May-2024
https://doi.org/10.1109/ICECAI62591.2024.10675234
Zhou FHu SWan XLu ZWu J(2023)Risevi: A Disease Risk Prediction Model Based on Vision Transformer Applied to Nursing HomesElectronics10.3390/electronics1215320612:15(3206)Online publication date: 25-Jul-2023
https://doi.org/10.3390/electronics12153206

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten