research-article

End-to-End Speech Emotion Recognition Based on One-Dimensional Convolutional Neural Network

Authors:

Dongsheng Zhou,

Deyun YangAuthors Info & Claims

ICIAI '19: Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence

Pages 78 - 82

https://doi.org/10.1145/3319921.3319963

Published: 15 March 2019 Publication History

Abstract

Real-time speech emotion recognition has always been a problem. To this end, we proposed an end-to-end speech emotion recognition model based on one-dimensional convolutional neural network, which contains only three convolution layers, two pooling layers and one full-connected layer. Through Adam optimization algorithm and back propagation mechanism, more discriminative features can be extracted continuously. Our model is quite simple in structure and easy to quickly complete the emotional classification task. Compared with traditional methods, there is no need to carry out the complex process of manually extracting features, and the model can automatically learn the emotional features from raw speech signals. In the emotional recognition experiments with EMODB, CASIA, IEMOCAP, and CHEAVD four speech databases, relatively high recognition rates were obtained. Experiments show that the proposed algorithm is of great benefit to the implementation of real-time speech emotion recognition.

References

[1]

Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-r, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TNJISpm. 2012. Deep neuraccl networks for acoustic modeling in speech recognition. The shared views of four research groups. 29, 82--97.

[2]

Abdel-Hamid O, Deng L, Yu D. 2013. Exploring convolutional neural network structures and optimization techniques for speech recognition. In Interspeech. 1173--1175.

[3]

Sak H, Senior A, Beaufays FJapa. 2014. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. J.

[4]

Hsiao P-W, Chen C-P. 2018. Effective Attention Mechanism in Dynamic Models for Speech Emotion Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2526--2530.

[5]

Badshah AM, Ahmad J, Rahim N, Baik SW. 2017. Speech emotion recognition from spectrograms with deep convolutional neural network. In Platform Technology and Service (PlatCon), 2017 International Conference on. IEEE. 1--5.

[6]

Tao F, Liu G. 2018. Advanced LSTM. A study about better time dependency modeling in emotion recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2906--2910.

Digital Library

[7]

Hsiao P-W, Chen C-P. 2018. Effective Attention Mechanism in Dynamic Models for Speech Emotion Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2526--2530.

[8]

Harár P, Burget R, Dutta MK. 2017. Speech emotion recognition with deep learning. In Signal Processing and Integrated Networks (SPIN), 2017 4th International Conference on. IEEE. 137--140.

[9]

Bertero D, Fung P. 2017. A first look into a Convolutional Neural Network for speech emotion detection. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE. 5115--5119.

Digital Library

[10]

Peng Z, Zhu Z, Unoki M, Dang J, Akagi M. 2018. Auditory-Inspired End-to-End Speech Emotion Recognition Using 3D Convolutional Recurrent Neural Networks Based on Spectral-Temporal Representation. In 2018 IEEE International Conference on Multimedia and Expo (ICME). IEEE. 1--6.

[11]

Thomas-F Q. 2002. Discrete-Time Speech Signal Processing: Principles and Practice. Pearson Education, Inc. Prentice Hall PTR.

[12]

Anbu H. 2017. Explain the Profound in Simple Language Deep Learning: Principle Analysis and Python Practice. Publishing House of Electronics Industry.

[13]

Glorot X, Bordes A, Bengio Y. 2011. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. 315--323.

[14]

Rumelhart DE, Hinton GE, Williams RJJn. 1986. Learning representations by back-propagating errors. 323--533.

[15]

https://keras.io/zh/losses/#categorical_crossentropy{EB/OL}.

[16]

Kinga D, Adam JB. 2015. A method for stochastic optimization. In International Conference on Learning Representations (ICLR).

[17]

Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B. 2005. A database of German emotional speech. In Ninth European Conference on Speech Communication and Technology.

[18]

http://www.chineseldc.org/resource_info.php?rid=76{EB/OL}.

[19]

Busso C, Bulut M, Lee C C, et al. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. J. Language resources and evaluation. 42(4), 335.

[20]

Li Y, Tao J, Chao L, et al. 2017. CHEAVD: a Chinese natural emotional audio--visual database. J. Journal of Ambient Intelligence and Humanized Computing. 8(6), 913--924.

[21]

Chollet F. 2015. Keras: Deep learning library for theano and tensorflow. J. URL: https://keras. io/k, 7(8).

[22]

Prechelt LJNN. 1998. Automatic early stopping using cross validation: quantifying the criteria. 11, 761--767.

Digital Library

Cited By

Yu ASun XWu X(2024)TLBT-Net: A Multi-scale Cross-fusion Model for Speech Emotion RecognitionProceedings of the International Conference on Modeling, Natural Language Processing and Machine Learning10.1145/3677779.3677819(245-250)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1145/3677779.3677819
Parmar MTiwari A(2024)Emotion and Sentiment Analysis in Dialogue: A Multimodal Strategy Employing the BERT Model2024 Parul International Conference on Engineering and Technology (PICET)10.1109/PICET60765.2024.10716061(1-7)Online publication date: 3-May-2024
https://doi.org/10.1109/PICET60765.2024.10716061
Shi XDai X(2024)Speech Emotion Classification Based on Dynamic Graph Attention Network2024 5th International Conference on Electronic Communication and Artificial Intelligence (ICECAI)10.1109/ICECAI62591.2024.10675234(328-331)Online publication date: 31-May-2024
https://doi.org/10.1109/ICECAI62591.2024.10675234
Show More Cited By

Index Terms

End-to-End Speech Emotion Recognition Based on One-Dimensional Convolutional Neural Network
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Speech Emotion Recognition Based on Improved MFCC
CSAE '18: Proceedings of the 2nd International Conference on Computer Science and Application Engineering

Speech1 Emotion Recognition SER uses the Berlin EMO-DB database, seven emotions. Traditional emotional features and their statistics are used in SER. Two improved Mel Frequency Cepstrum Coefficients MFCC features are added to this experiment, which ...
Convolutional Neural Network with Spectrogram and Perceptual Features for Speech Emotion Recognition
Neural Information Processing
Abstract
Convolutional neural network (CNN) has demonstrated a great power at mining deep information from spectrogram for speech emotion recognition. However, perceptual features such as low-level descriptors (LLDs) and their statistical values were not ...
Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition

Recent years have witnessed the great progress for speech emotion recognition using deep convolutional neural networks (DCNNs). In order to improve the performance of speech emotion recognition, a novel feature fusion method is proposed. With going ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICIAI '19: Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence

March 2019

279 pages

ISBN:9781450361286

DOI:10.1145/3319921

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Xi'an Jiaotong-Liverpool University: Xi'an Jiaotong-Liverpool University
University of Texas-Dallas: University of Texas-Dallas

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 March 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

the National Natural Science Foundation of China
Program for Liaoning Distinguished Professor, the 13th Five-Year Plan of Education Science in Liaoning Province
Program for Changjiang Scholars and Innovative Research Team in University
Program for Dalian High-level Talent?s Innovation
The Liaoning Province Doctor Startup Fund
Innovation Fund Plan for Dalian Science and Technology

Conference

ICIAI 2019

ICIAI 2019: 2019 The 3rd International Conference on Innovation in Artificial Intelligence

March 15 - 18, 2019

Suzhou, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
381
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yu ASun XWu X(2024)TLBT-Net: A Multi-scale Cross-fusion Model for Speech Emotion RecognitionProceedings of the International Conference on Modeling, Natural Language Processing and Machine Learning10.1145/3677779.3677819(245-250)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1145/3677779.3677819
Parmar MTiwari A(2024)Emotion and Sentiment Analysis in Dialogue: A Multimodal Strategy Employing the BERT Model2024 Parul International Conference on Engineering and Technology (PICET)10.1109/PICET60765.2024.10716061(1-7)Online publication date: 3-May-2024
https://doi.org/10.1109/PICET60765.2024.10716061
Shi XDai X(2024)Speech Emotion Classification Based on Dynamic Graph Attention Network2024 5th International Conference on Electronic Communication and Artificial Intelligence (ICECAI)10.1109/ICECAI62591.2024.10675234(328-331)Online publication date: 31-May-2024
https://doi.org/10.1109/ICECAI62591.2024.10675234
Dabbabi KMars A(2024)Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS DatabasesJournal of Systems Science and Systems Engineering10.1007/s11518-024-5607-y33:5(576-606)Online publication date: 29-May-2024
https://doi.org/10.1007/s11518-024-5607-y
Li FLuo JWang LLiu WSang X(2023)GCF2-Net: global-aware cross-modal feature fusion network for speech emotion recognitionFrontiers in Neuroscience10.3389/fnins.2023.118313217Online publication date: 4-May-2023
https://doi.org/10.3389/fnins.2023.1183132
Li FLuo J(2023)Speech Emotion Recognition Using Global-Aware Cross-Modal Feature Fusion NetworkAdvanced Intelligent Computing Technology and Applications10.1007/978-981-99-4742-3_17(211-221)Online publication date: 30-Jul-2023
https://doi.org/10.1007/978-981-99-4742-3_17
Mohanty ACherukuri RPrusty A(2023)Improvement of Speech Emotion Recognition by Deep Convolutional Neural Network and Speech FeaturesThird Congress on Intelligent Systems10.1007/978-981-19-9225-4_10(117-129)Online publication date: 12-Mar-2023
https://doi.org/10.1007/978-981-19-9225-4_10
Yang WLi JTan STan YLu X(2022)Feature-enhanced embedding learning for heterogeneous collaborative filteringNeural Computing and Applications10.1007/s00521-022-07490-034:21(18741-18756)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1007/s00521-022-07490-0
Padi SSadjadi SSriram RManocha D(2021)Improved Speech Emotion Recognition using Transfer Learning and Spectrogram AugmentationProceedings of the 2021 International Conference on Multimodal Interaction10.1145/3462244.3481003(645-652)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3462244.3481003
Lee SHan DKo H(2021)Multimodal Emotion Recognition Fusion Analysis Adapting BERT With Heterogeneous Feature UnificationIEEE Access10.1109/ACCESS.2021.30927359(94557-94572)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3092735
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten