Skip to main content

Advertisement

Log in

An audio-based anger detection algorithm using a hybrid artificial neural network and fuzzy logic model

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Audio Emotion Recognition (AER) is an important factor for Human Emotion Analysis with or without any visual aiding components. Such audio has different modular parameters, such as rhythm, tone, and pitch. However, emotions are highly complex, and the way they get delivered to human ears with preconceived emotions are then instantly understood by humans, and this is something that has been perfected after thousands of years of human evolution. Artificial intelligence (AI) enabled AER has captured worldwide attention in the last couple of years and has gained increasing importance amongst AI researchers in various fields. It has become increasingly important in recent years, especially after the start of the Covid-19 pandemic that has resulted in work from home, online schooling, and online learning on a mass scale due to large-scale lockdowns and movement control orders around the world. The audio quality on online platforms differs from device to device and is dependent on the quality or the bandwidth of the Internet connection used in such applications. Therefore, as the world is recovering from the Covid-19 pandemic, an algorithm for anger detection proves necessary in maintaining public security and general safety and can also help in the early detection of mental health issues or anger management issues. This is because the presence of an angry person in public can pose a threat to the people around and may also impose a risk of damage to public property. As a result, detecting the presence of anger emotion through voices in all public places proves to be the first line of defense against any outbreaks of public nuisance or even violent crimes. Moreover, the more prominent the anger emotion of a person, the more amount of attention must be given to the person by the public security forces. This study uses a collection of audio files from the CREMA-D dataset as the input, where a collection of 364 audio files from 91 actors, each with three degrees of showing anger and a neutral emotion were used. All audio files in this collection use the script “It’s eleven o’clock”. A hybrid algorithm of artificial neural network (ANN) and fuzzy logic, along with a dedicated preprocessing technique specifically for handling audio files were introduced. A comprehensive discussion and analysis of the results was presented in which the proposed algorithm was compared with all the other audio classification algorithms that exist in literature, many of which merely deployed a readily made general purpose neural network-based algorithm. This brute force method of relying on an overly complicated computational structure proves too low in efficiency as the number of nodes involved in the computational process far surpasses the number of preprocessed inputs. On top of this, descriptions about preprocessing procedures for audio classification among all recent works are found to be unclear. Finally, the limitations and suggestions for improvements of the experimental setup, and the potential applications of the findings are also discussed and analyzed in the conclusion of this study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

This study uses the publicly available CREMA-D dataset which can be accessed on Github at https://github.com/CheyneyComputerScience/CREMA-D.

References

  1. Yaffe P (2011) The 7% rule: fact, fiction, or misunderstanding. Ubiquity 2011:1. https://doi.org/10.1145/2043155.2043156

    Article  Google Scholar 

  2. Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286. https://doi.org/10.1109/5.18626

    Article  Google Scholar 

  3. Nwe TL, Foo SW, De Silva LC (2003) Speech emotion recognition using hidden Markov models. Speech Commun 41(4):603–623

    Article  Google Scholar 

  4. Abbaschian BJ, Sierra-Sosa D, Elmaghraby A (2021) Deep learning techniques for speech emotion recognition, from databases to models. Sensors 21(4):1249

    Article  Google Scholar 

  5. Swain M, Routray A, Kabisatpathy P (2018) Databases, features and classifiers for speech emotion recognition: a review. Int J Speech Technol 21:93–120

    Article  Google Scholar 

  6. Voelkel S, Mello LV (2014) Audio feedback – Better feedback? Bioscience Education 22(1):16–30

    Article  Google Scholar 

  7. Akçay MB, Oğuz K (2020) Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun 116:56–76

    Article  Google Scholar 

  8. El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recogn 44(3):572–587

    Article  Google Scholar 

  9. Koolagudi SG, Rao KS (2012) Emotion recognition from speech using source, system, and prosodic features. Int J Speech Technol 15(2):265–289

    Article  Google Scholar 

  10. Chen L, Mao X, Xue Y, Cheng LL (2012) Speech emotion recognition: Features and classification models. Digit Signal Process 22(6):1154–1160

    Article  MathSciNet  Google Scholar 

  11. Langari S, Marvi H, Zahedi M (2020) Efficient speech emotion recognition using modified feature extraction. Inf Med Unlocked 20:100424

    Article  Google Scholar 

  12. Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimed 16(8):2203–2213

    Article  Google Scholar 

  13. Chen M, He X, Yang J, Zhang H (2018) 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444

    Article  Google Scholar 

  14. Yeh JH, Pao TL, Lin CY, Tsai YW, Chen YT (2011) Segment-based emotion recognition from continuous Mandarin Chinese speech. Comput Hum Behav 27(5):1545–1552

    Article  Google Scholar 

  15. Ooi CS, Seng KP, Ang L, Chew LW (2014) A new approach of audio emotion recognition. Expert Syst Appl 41(13):5858–5869

    Article  Google Scholar 

  16. Demircan S, Kahramanlı H (2014) Feature extraction from speech data for emotion recognition. J Adv Comput Netw 2(1):28–30

    Article  Google Scholar 

  17. Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303

    Article  Google Scholar 

  18. Neiberg, D, Elenius, K, Laskowski, K (2006) Emotion recognition in spontaneous speech using GMMs. Proceedings of the Ninth International Conference on Spoken Language Processing (INTERSPEECH 2006 – ICSLP), 809–812. https://doi.org/10.21437/Interspeech.2006-277

  19. Cao H, Verma R, Nenkova A (2015) Speaker-sensitive emotion recognition via ranking: Studies on acted and spontaneous speech. Comput Speech Lang 29(1):186–202

    Article  Google Scholar 

  20. Albornoz EM, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25(3):556–570

    Article  Google Scholar 

  21. Nikopoulou, R, Vernikos, I, Spyrou, E, Mylonas, P (2018) Emotion recognition from speech: A classroom experiment. Proceedings of the 11th PErvasive Technologies Related to Assistive Environments Conference (PETRA '18), 104–105, Corfu, Greece. https://doi.org/10.1145/3197768.3197782

  22. Livingstone SR, Russo FA (2018) The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5):e0196391

    Article  Google Scholar 

  23. Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, Verma R (2014) CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Trans Affect Comput 5(4):377–390

    Article  Google Scholar 

  24. Lee W, Son G (2023) Investigation of human state classification via EEG signals elicited by emotional audio-visual stimulation. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-16294-w

    Article  Google Scholar 

  25. Kumar S, Gupta SK, Kumar V, Kumar M, Chaube MK, Naik NS (2022) Ensemble multimodal deep learning for early diagnosis and accurate classification of COVID-19. Comput Electr Eng 103:108396

    Article  Google Scholar 

  26. Kumar S, Chaube MK, Alsamhi SH, Gupta SK, Guizani M, Gravina R, Fortino G (2022) A novel multimodal fusion framework for early diagnosis and accurate classification of COVID-19 patients using X-ray images and speech signal processing techniques. Comput Methods Programs Biomed 226:107109

    Article  Google Scholar 

  27. Koutini K, Zadeh HE, Widmer G (2021) Receptive field regularization techniques for audio classification and tagging with deep convolutional neural networks. IEEE/ACM Trans Audio, Speech, Lang Process 29:1987–2000

    Article  Google Scholar 

  28. Schoneveld L, Othmani A, Abdelkawy H (2021) Leveraging recent advances in deep learning for audio-Visual emotion recognition. Pattern Recogn Lett 146:1–7

    Article  Google Scholar 

  29. Nemani P, Krishna GS, Sai BDS, Kumar S (2022) Deep learning based holistic speaker independent visual speech recognition. IEEE Trans Artif Intell. https://doi.org/10.1109/TAI.2022.3220190

    Article  Google Scholar 

  30. Tian J, She Y (2022) A visual-audio-based emotion recognition system integrating dimensional analysis. IEEE Trans Comput Soc Syst. https://doi.org/10.1109/TCSS.2022.3200060

    Article  Google Scholar 

  31. Khurana Y, Gupta S, Sathyaraj R, Raja SP (2022) A multimodal speech emotion recognition system with speaker recognition for social interactions. IEEE Trans Comput Soc Syst. https://doi.org/10.1109/TCSS.2022.3228649

    Article  Google Scholar 

  32. Kumar, S, Jaiswal, S, Kumar, R, Singh, SK (2018) Emotion recognition using facial expression. In R. Pal (Ed.), Innovative Research in Attention Modeling and Computer Vision Applications (pp. 327–345). IGI Global. https://doi.org/10.4018/978-1-4666-8723-3.ch013

  33. Nandini D, Yadav J, Rani A, Singh V (2023) Design of subject independent 3D VAD emotion detection system using EEG signals and machine learning algorithms. Biomed Signal Process Control 85:104894

    Article  Google Scholar 

  34. Chauhan K, Sharma KK, Varma T (2023) Improved Speech emotion recognition using channel-wise global head pooling (CwGHP). Circ Syst Signal Process 42:5500–5522

    Article  Google Scholar 

  35. Mocanu B, Tapu R, Zaharia T (2023) Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning. Image Vis Comput 133:104676

    Article  Google Scholar 

  36. Min C, Lin H, Li X, Zhao H, Lu J, Yang L, Xu B (2023) Finding hate speech with auxiliary emotion detection from self-training multi-label learning perspective. Inf Fus 96:214–223

    Article  Google Scholar 

  37. Li Y, Kazemeini A, Mehta Y, Cambria E (2022) Multitask learning for emotion and personality traits detection. Neurocomputing 493:340–350

    Article  Google Scholar 

  38. Pradhan A, Srivastava S (2023) Hierarchical extreme puzzle learning machine-based emotion recognition using multimodal physiological signals. Biomed Signal Process Control 83:104624

    Article  Google Scholar 

  39. Ahmed N, Angbari ZA, Girijia S (2023) A systematic survey on multimodal emotion recognition using learning algorithms. Intell Syst Appl 17:200171

    Google Scholar 

  40. Firdaus M, Singh GV, Ekbal A, Bhattacharyya P (2023) Affect-GCN: a multimodal graph convolutional network for multi-emotion with intensity recognition and sentiment analysis in dialogues. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-14885-1

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the Editor-in-Chief, Editor(s), and the anonymous reviewers for their valuable comments and suggestions which has helped to improve the quality and clarity of the paper.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the conception and design of the study. Material preparation, data collection, data visualization and data analysis were performed by Arihant Surana, Manish Rathod, Shilpa Gite and Shruti Patil. Advanced data analysis and validation were done by Ketan Kotecha, Ganeshsree Selvachandran, Shio Gai Quek and Ajith Abraham. The first draft of the manuscript was written by Arihant Surana and Manish Rathod, while the second draft of the manuscript was written by Shilpa Gite, Shruti Patil and Ketan Kotecha. This project was supervised and administered by Shilpa Gite, Shruti Patil and Ketan Kotecha. The final draft and the revised manuscript were prepared and edited by Shio Gai Quek, Ganeshsree Selvachandran and Ajith Abraham. All authors commented on previous versions of the manuscript. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Shilpa Gite, Ketan Kotecha or Ganeshsree Selvachandran.

Ethics declarations

Ethical Compliance

(i). Authors’ declaration: This manuscript is the authors' original work and has not been published elsewhere. All authors have checked the manuscript and have agreed to this submission.

(ii). Ethical approval: This article does not contain any studies with human participants or animals performed by any of the authors.

Competing Interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Surana, A., Rathod, M., Gite, S. et al. An audio-based anger detection algorithm using a hybrid artificial neural network and fuzzy logic model. Multimed Tools Appl 83, 38909–38929 (2024). https://doi.org/10.1007/s11042-023-16815-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16815-7

Keywords

Navigation