Abstract
The intelligent human-computer interface should not only provide automatic emotion inference, but it should also provide information about emotion change. The former has generated promising results, whilst the latter is currently being researched. Emotional transformation, or a quick change in one’s emotional state, is a normal part of life that can be triggered by mental stress, ongoing situations, or people we connect with. To better control these emotions, it’s vital to recognize triggers and early warning signs of imminent emotional swings. The Single frequency filtering (SFF) spectrogram is a visual representation of voice that captures both temporal and frequency resolutions at the same time. In this study, convolutional neural network (CNN) EfficientNetB0 is used to study patterns in SFF spectrogram for localizing the instants of emotion change. The process used in detecting the instant of emotion change detection could be broken into two stages. First stage deals with the construction of SFF spectrograms from speech samples belonging to each pitch cycle. Second stage deals with predicting the time when emotional changes occur. The performance of the proposed method is evaluated using parameters binary accuracy (BAC), binary cross-entropy loss (BECL), binary error (BError), and F1-Score. The proposed method obtains an accuracy of 0.95 and 0.952 on the datasets used. The experimental results obtained using the proposed method on Interactive emotional dyadic motion capture(IEMOCAP) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets establish the supremacy of the proposed method on existing methods and other CNN architectures.









Similar content being viewed by others
Data availability
Availability of data materials, and software applications which is developed by present authors.
Code availability
Code Availability did by Shalini Kapoor.
References
Abbaschian BJ, Sierra-Sosa D, Elmaghraby A (2021) Deep learning techniques for speech emotion recognition, from databases to models. Sensors 21(4):1249
Alisamir S, Ringeval F (2021) On the evolution of speech representations for affective computing: a brief history and critical overview. IEEE Signal Process Mag 38(6):12–21
Altun H, Polat G (2009) Boosting selection of speech related features to improve performance of multi-class SVMs in emotion detection. Expert Syst Appl 36(4):8197–8203
Aneeja G, Yegnanarayana B (2015) Single frequency filtering approach for discriminating speech and nonspeech. IEEE Trans Audio Speech Lang Process 23(4):705–717. https://doi.org/10.1109/TASLP.2015.2404035
Aneeja G, Yegnanarayana B (2017) Extraction of fundamental frequency from degraded speech using temporal envelopes at high SNR frequencies. IEEE/ACM Trans Audio Speech Language Process 25(4):829–838. https://doi.org/10.1109/TASLP.2017.2666425
Badshah AM, … Baik SW (2019) Deep features-based speech emotion recognition for smart affective services. Multimed Tools Appl 78(5):5571–5589. https://doi.org/10.1007/s11042-017-5292-7
Bakhshi A, Harimi A, Chalup S (2022) CyTex: transforming speech to textured images for speech emotion recognition. Speech Commun 139:62–75
Ben-Ze’ev A (2003) Privacy, emotional closeness, and openness in cyberspace. Comput Hum Behav 19(4):451–467
Busso C, … Narayanan SS (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359. https://doi.org/10.1007/s10579-008-9076-6
Busso C, Lee S, Narayanan S (2009) Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans Audio Speech Lang Process 17(4):582–596
Cowie R, Cornelius RR (2003) Describing the emotional states that are expressed in speech. Speech Comm 40(1–2):5–32
Davidson RJ (1998) Affective style and affective disorders: perspectives from affective neuroscience. Cognit Emot 12:307–330. https://doi.org/10.1080/026999398379628
Fredrickson BL, … Tugade MM (2000) The undoing effect of positive emotions. Motiv Emot 24(4):237–258
Gupta S, Fahad M, Deepak A (2020) Pitch-synchronous single frequency filtering spectrogram for speech emotion recognition. Multimed Tools Appl 79(31):23347–23365.
Huang Z (2015) ‘An investigation of emotion changes from speech’, in 2015 international conference on affective computing and intelligent interaction, ACII 2015, pp. 733–736. https://doi.org/10.1109/ACII.2015.7344650.
Huang Z et al (2014) ‘Speech emotion recognition using CNN’, in MM 2014 - Proceedings of the 2014 ACM Conference on multimedia, pp. 801–804. https://doi.org/10.1145/2647868.2654984
Issa D, Demirci MF, Yazici A (2020) Speech emotion recognition with deep convolutional neural networks. Biomed Signal Process Control 59:101894
Jiang W, … Li C (2019) Speech emotion recognition with heterogeneous feature unification of deep neural network. Sensors (Switzerland) 19(12):2730.19. https://doi.org/10.3390/s19122730
Kadiri SR, Yegnanarayana B (2017) Epoch extraction from emotional speech using single frequency filtering approach. Speech Comm 86:52–63. https://doi.org/10.1016/j.specom.2016.11.005
Kadiri SR, Yegnanarayana B (2019) Analysis of aperiodicity in artistic Noh singing voice using an impulse sequence representation of excitation source. J Acoustical Soc America 146(6):4446–4457. https://doi.org/10.1121/1.5139225
Kim Y, Provost EM (2016) ‘Emotion spotting: discovering regions of evidence in audio-visual emotion expressions’, in ICMI 2016 - Proceedings of the 18th ACM international conference on multimodal interaction, pp. 92–99. https://doi.org/10.1145/2993148.2993151
Kwon S et al (2020) A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1):183
Livingstone SR, Russo FA (2018) The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north American english. PLoS One 13(5):e0196391. https://doi.org/10.1371/journal.pone.0196391
Mao Q, … Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimedia 16(8):2203–2213. https://doi.org/10.1109/TMM.2014.2360798
Meng H, … Wei H (2019) Speech emotion recognition from 3D log-Mel spectrograms with deep learning network. IEEE Access 7:125868–125881. https://doi.org/10.1109/ACCESS.2019.2938007
Nam Y, Lee C (2021) Cascaded convolutional neural network architecture for speech emotion recognition in noisy conditions. Sensors 21(13):4399
Sezgin MC, Gunsel B, Kurt GK (2012) Perceptual audio features for emotion detection. EURASIP J Audio, Speech, Music Process 2012(1):1–21
Suveg C, … Kendall PC (2009) Changes in emotion regulation following cognitive-behavioral therapy for anxious youth. J Clin Child Adolesc Psychol 38(3):390–401
Tan M, Le QV (2019) ‘EfficientNet: rethinking model scaling for convolutional neural networks’, in 36th international conference on machine learning. ICML 2019:10691–10700
Thanaraj KP, Noel JRA, Vijayarajan R (2021) Emotion classification from speech signal based on empirical mode decomposition and non-linear features. Complex Intell Syst 7(4):1919–1934
Wani TM, … Ambikairajah E (2021) A comprehensive review of speech emotion recognition systems. IEEE Access 9:47795–47814
Zhang J, … Cui D (2018) Analysis on speech signal features of manic patients. J Psychiatr Res 98:59–63
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D \& 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323
Zheng WQ, Yu JS, Zou YX (2015) An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 international conference on affective computing and intelligent interaction (ACII), pp 827–831
Zhu J, Thagard P (2002) Emotion and action. Philos Psychol 15(1):19–36
Author information
Authors and Affiliations
Contributions
The methodology was performed by Shalini Kapoor. -Literature survey is jointly done by all authors.
Corresponding author
Ethics declarations
Ethical approval
All authors ensure ethical approval.
Consent to participate
All authors have the consent to participate.
Consent to publish
All authors has the consent to publish the paper in this journal.
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kapoor, S., Kumar, T. A novel approach to detect instant emotion change through spectral variation in single frequency filtering spectrogram of each pitch cycle. Multimed Tools Appl 82, 9413–9429 (2023). https://doi.org/10.1007/s11042-022-13731-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13731-0