skip to main content
10.1145/3297067.3297089acmotherconferencesArticle/Chapter ViewAbstractPublication PagesspmlConference Proceedingsconference-collections
research-article

Speech Emotion Classification using Raw Audio Input and Transcriptions

Published: 28 November 2018 Publication History

Abstract

As new gadgets that interact with the user through voice become accessible, the importance of not only the content of the speech increases, but also the significance of the way the user has spoken. Even though many techniques have been developed to indicate emotion on speech, none of them can fully grasp the real emotion of the speaker. This paper presents a neural network model capable of predicting emotions in conversations by analyzing transcriptions and raw audio waveforms, focusing on feature extraction using convolutional layers and feature combination. The model achieves an accuracy of over 71% across four classes: Anger, Happiness, Neutrality and Sadness. We also analyze the effect of audio and textual features on the classification task, by interpreting attention scores and parts of speech. This paper explores the use of raw audio waveforms, that in the best of our knowledge, have not yet been used deeply in the emotion classification task, achieving close to state of art results.

References

[1]
K. Choi, G. Fazekas, M. Sandler, and K. Cho, "Convolutional recurrent neural networks for music classification," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2392--2396, IEEE, 2017.
[2]
Y. Kim, "Convolutional neural networks for sentence classification," arXiv preprint arXiv:1408.5882, 2014.
[3]
J. Lee, T. Kim, J. Park, and J. Nam, "Raw waveform-based audio classification using sample-level cnn architectures," arXiv preprint arXiv:1712.00866, 2017.
[4]
C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, "Iemocap: Interactive emotional dyadic motion capture database," Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.
[5]
D. Hazarika, S. Gorantla, S. Poria, and R. Zimmermann, "Self-attentive feature-level fusion for multimodal emotion detection," in 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 196--201, April 2018.
[6]
K. S. Rao and S. G. Koolagudi, "Recognition of emotions from video using acoustic and facial features," Signal, Image and Video Processing, vol. 9, no. 5, pp. 1029--1045, 2015.
[7]
S. Poria, H. Peng, A. Hussain, N. Howard, and E. Cambria, "Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis," Neurocomputing, vol. 261, pp. 217--230, 2017.
[8]
R. Xia and Y. Liu, "Using i-vector space model for emotion recognition," in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
[9]
A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio.," in SSW, p. 125, 2016.
[10]
Y. Hoshen, R. J. Weiss, and K. W. Wilson, "Speech acoustic modeling from raw multichannel waveforms," in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 4624--4628, IEEE, 2015.
[11]
D. Palaz, M. M. Doss, and R. Collobert, "Convolutional neural networks-based continuous speech recognition using raw speech signal," in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 4295--4299, IEEE, 2015.
[12]
T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, "The application of two-level attention models in deep convolutional neural network for fine-grained image classification," in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
[13]
Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, "Hierarchical attention networks for document classification," in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480--1489, 2016.
[14]
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[15]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, pp. 1097--1105, 2012.
[16]
T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
[17]
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770--778, 2016.
[18]
J. Dai, Y. Li, K. He, and J. Sun, "R-fcn: Object detection via region-based fully convolutional networks," in Advances in neural information processing systems, pp. 379--387, 2016.
[19]
B. Liang, H. Li, M. Su, P. Bian, X. Li, and W. Shi, "Deep text classification can be fooled," arXiv preprint arXiv:1704.08006, 2017.
[20]
N. Frosst and G. Hinton, "Distilling a neural network into a soft decision tree," arXiv preprint arXiv:1711.09784, 2017.

Cited By

View all
  • (2024)Speech Emotion Classification Based on Dynamic Graph Attention Network2024 5th International Conference on Electronic Communication and Artificial Intelligence (ICECAI)10.1109/ICECAI62591.2024.10675234(328-331)Online publication date: 31-May-2024
  • (2023)Risevi: A Disease Risk Prediction Model Based on Vision Transformer Applied to Nursing HomesElectronics10.3390/electronics1215320612:15(3206)Online publication date: 25-Jul-2023

Index Terms

  1. Speech Emotion Classification using Raw Audio Input and Transcriptions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SPML '18: Proceedings of the 2018 International Conference on Signal Processing and Machine Learning
    November 2018
    177 pages
    ISBN:9781450366052
    DOI:10.1145/3297067
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 November 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Convolutional Layers
    2. Emotion Classification
    3. Feature Extraction
    4. Neural Networks
    5. Signal Processing

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Korean MSIT
    • Institute for Information and communications Technology Promotion

    Conference

    SPML '18

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Speech Emotion Classification Based on Dynamic Graph Attention Network2024 5th International Conference on Electronic Communication and Artificial Intelligence (ICECAI)10.1109/ICECAI62591.2024.10675234(328-331)Online publication date: 31-May-2024
    • (2023)Risevi: A Disease Risk Prediction Model Based on Vision Transformer Applied to Nursing HomesElectronics10.3390/electronics1215320612:15(3206)Online publication date: 25-Jul-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media