Skip to main content

Augmented Audio Data in Improving Speech Emotion Classification Tasks

  • Conference paper
  • First Online:
Advances and Trends in Artificial Intelligence. From Theory to Practice (IEA/AIE 2021)

Abstract

To achieve high performance and classification accuracy, classification of emotions from audio or speech signals requires large quantities of data. Big datasets, however, are not always readily accessible. A good solution to this issue is to increase the data and augment it to construct a larger dataset for the classifier’s training. This paper proposes a unimodal approach that focuses on two main concepts: (1) augmenting speech signals to generate additional data samples; and (2) constructing classification models to identify emotion expressed through speech. In addition, three classifiers (Convolutional Neural Network (CNN), Naïve Bayes (NB) and K-Nearest Neighbor (kNN)) were further tested in order to decide which of the classifiers had the best results. We used augmented audio data from a dataset (SAVEE) in the proposed method to conduct training (50%), and testing (50%) was executed using the original data. The best performance of approximately 83% was found to be a mixture of augmentation strategies using the CNN classifier. Our proposed augmentation approach together with appropriate classification model enhances the efficiency of voice emotion recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Shoumy, N.J., Ang, L.-M., Seng, K.P., Rahaman, D.M.M., Zia, T.: Multimodal big data affective analytics: a comprehensive survey using text, audio, visual and physiological signals. J. Netw. Comput. Appl. 149, 102447 (2020). https://doi.org/10.1016/j.jnca.2019.102447

  2. Deb, S., Dandapat, S.: Emotion Classification using Segmentation of Vowel-Like and Non-Vowel-Like Regions. http://ieeexplore.ieee.org/document/7987785/ (2017). https://doi.org/10.1109/TAFFC.2017.2730187

  3. Mairesse, F., Polifroni, J., Di Fabbrizio, G.: Can prosody inform sentiment analysis? experiments on short spoken reviews. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings, pp. 5093–5096 (2012). https://doi.org/10.1109/ICASSP.2012.6289066

  4. Sawata, R., Ogawa, T., Haseyama, M.: Novel audio feature projection using KDLPCCA-based correlation with EEG features for favorite music classification. IEEE Trans. Affect. Comput. 1–1 (2017). https://doi.org/10.1109/TAFFC.2017.2729540

  5. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019). https://doi.org/10.1186/s40537-019-0197-0

    Article  Google Scholar 

  6. Gavali, P., Banu, J.S.: Deep convolutional Neural Network for image classification on CUDA platform. In: Deep Learning and Parallel Computing Environment for Bioengineering Systems, pp. 99–122. Elsevier (2019). https://doi.org/10.1016/B978-0-12-816718-2.00013-0

  7. Padi, S., Manocha, D., Sriram, R.D.: Multi-Window Data Augmentation Approach for Speech Emotion Recognition. Presented at the (2020)

    Google Scholar 

  8. Bhakre, S.K., Bang, A.: Emotion recognition on the basis of audio signal using Naive Bayes classifier. In: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 2363–2367 (2016). https://doi.org/10.1109/ICACCI.2016.7732408

  9. Meftah, I.T., Le. Thanh, N., Ben Amar, C.: Emotion recognition using KNN classification for user modeling and sharing of affect states. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012. LNCS, vol. 7663, pp. 234–242. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34475-6_29

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nusrat J. Shoumy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shoumy, N.J., Ang, LM., Rahaman, D.M.M., Zia, T., Seng, K.P., Khatun, S. (2021). Augmented Audio Data in Improving Speech Emotion Classification Tasks. In: Fujita, H., Selamat, A., Lin, J.CW., Ali, M. (eds) Advances and Trends in Artificial Intelligence. From Theory to Practice. IEA/AIE 2021. Lecture Notes in Computer Science(), vol 12799. Springer, Cham. https://doi.org/10.1007/978-3-030-79463-7_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-79463-7_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-79462-0

  • Online ISBN: 978-3-030-79463-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics