Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture

Ahmad, Jamil; Sajjad, Muhammad; Rho, Seungmin; Kwon, Soon-il; Lee, Mi Young; Baik, Sung Wook

doi:10.1007/s11042-016-4041-7

Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture

Published: 28 October 2016

Volume 77, pages 4883–4907, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jamil Ahmad¹,
Muhammad Sajjad²,
Seungmin Rho³,
Soon-il Kwon¹,
Mi Young Lee¹ &
…
Sung Wook Baik¹

489 Accesses
7 Citations
Explore all metrics

Abstract

In the millions of emergency reporting calls made each year, about a quarter are non-emergencies. To avoid responding to such situations, forensic examination of the reported situation in the presence of speech as evidence has become an indispensable requirement for emergency response centers. Caller profile information like gender, age, emotional state, transcript, and contextual sounds determined from emergency calls, may be highly beneficial for their sophisticated forensic analysis. However, callers reporting emergency situations often express emotional stress which cause variations in speech production. Furthermore, low voice quality, and background noise make it very difficult to efficiently recognize caller attributes in such unconstrained environments. To overcome limitations of traditional classification systems in such situations, a hybrid two-stage classification scheme is proposed in this paper. Our framework consist of an ensemble of support vector machines (e-SVM) and deep neural networks (DNN) in a cascade. The first stage e-SVM consists of two models discriminatively trained on normal and stressful speech from emergency calls. Deep neural network forming the second stage of classification pipeline, is utilized only in case of ambiguous prediction results from the first stage. The adaptive nature of this two stage classification scheme helps achieve efficiency and high performance. Experiments conducted with a large dataset affirm the suitability of proposed architecture for efficient real-time speaker attribute recognition. The framework is evaluated for gender recognition from emergency calls in the presence of emotions and background noise. The framework yields significant performance improvements in comparison with other similar state-of-the-art gender recognition approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Article Open access 13 February 2024

Automatic speech recognition: a survey

Article 10 November 2020

References

Ahmad J, Sajjad M, Jan Z, Mehmood I, Rho S, Baik SW (2016a) Analysis of interaction trace maps for active authentication on smart devices. Multimedia Tools and Applications:1–19. doi:10.1007/s11042-016-3450-y
Ahmad J, Muhammad K, Kwon Si, Baik SW, and Rho S 2016b Dempster-Shafer Fusion Based Gender Recognition for Speech Analysis Applications," in 2016 International Conference on Platform Technology and Service (PlatCon), pp. 1–4
Baber C, Mellor B, Graham R, Noyes JM, Tunley C (1996) Workload and the use of automatic speech recognition: the effects of time and resource demands. Speech Comm 20:37–53
Article Google Scholar
Bahari MH, McLaren M, van Leeuwen DA (2014) Speaker age estimation using i-vectors. Eng Appl Artif Intell 34:99–108
Article Google Scholar
Barkana BD, Zhou J (2015) A new pitch-range based feature set for a speaker’s age and gender classification. Appl Acoust 98:52–61
Article Google Scholar
Bond ZS., and Moore TJ (1990) A note on loud and lombard speech,” in International Conference on Speech Language Processing pp. 969–972.
Burkhardt F, Eckert M, Johannsen W, and Stegmann J (2010) A Database of Age and Gender Annotated Telephone Speech,” in Proc. 7th International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, 2010
Campbell WM, Campbell JP, Reynolds DA, Jones DA, and Leek TR (2004) "Phonetic speaker recognition with support vector machines" S. Thrun, L. Saul, B. Schokopf (Eds.), Advances in Neural Information Processing Systems, Vol. 16, MIT Press, Cambridge, MA (2004)
Campbell WM, Campbell JP, Reynolds DA, Singer E, Torres-Carrasquillo PA (2006) Support vector machines for speaker and language recognition. Comput Speech Lang 20:210–229
Article Google Scholar
Cummins N, Epps J, Sethu V, Breakspear M, and Goecke R 2013 Modeling spectral variability for the classification of depressed speech," in Interspeech, pp. 857–861
Dahl GE, Yu D, Deng L, Acero A (2012) "context-dependent pre-trained deep neural networks for large-vocabulary speech recognition," Audio, Speech, and Language Processing. IEEE Transactions on 20:30–42
Google Scholar
Deng J, Berg AC, and Fei-Fei L (2011) Hierarchical semantic indexing for large scale image retrieval," in Computer Vision and Pattern Recognition (CVPR), 2011 I.E. Conference on, pp. 785–792
"EENA Operations Document (2011).
Fayek H, Lech M, and Cavedon L (2015) Towards real-time Speech Emotion Recognition using deep neural networks," in Signal Processing and Communication Systems (ICSPCS), 2015 9th International Conference on, pp. 1–5.
Fujimura H (2014) Simultaneous gender classification and voice activity detection using deep neural networks," in INTERSPEECH, pp. 1139–1143
Germain F, Sun DL, and Mysore GJ (2013) "Speaker and noise independent voice activity detection" in Proc. Interspeech 2013, pp 732–736
Hansen JH and Patil S (2007) Speech under stress: Analysis, modeling and recognition," in Speaker Classification I, ed: Springer, pp. 108–137.
Harb H, and Chen L (2003) Gender identification using a general audio classifier," in Multimedia and Expo, 2003. ICME'03. Proceedings. 2003 International Conference on, pp. II-733-6 vol. 2
Harb H, Chen L (2005) Voice-based gender identification in multimedia applications. J Intell Inf Syst 24:179–198
Article Google Scholar
Hinton GE (2012) A practical guide to training restricted boltzmann machines," in Neural Networks: Tricks of the Trade, ed: Springer, pp. 599–619
Hu Y, Wu D, Nucci A (2012) Pitch-based gender identification with two-stage classification. Security and Communication Networks 5:211–225
Article Google Scholar
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52:12–40
Article Google Scholar
Kockmann, M., Burget, L.,Cernocḱy, J., (2010). Brno University of Technology System for Interspeech 2010 Paralinguistic Challenge. In: Proceedings of the Interspeech, Makuhari, Japan, pp. 2822–2825.
Li M, Han KJ, Narayanan S (2013) Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput Speech Lang 27:151–167
Article Google Scholar
Lu L, Zhang H-J, Jiang H (2002) "content analysis for audio classification and segmentation," Speech and Audio Processing. IEEE Transactions on 10:504–516
Google Scholar
Martin A, Charlet D, and Mauuary L (2001) Robust speech/non-speech detection using LDA applied to MFCC," in Acoustics, Speech, and Signal Processing. Proceedings.(ICASSP'01). 2001 I.E. International Conference on, 2001, pp. 237–240.
Meinedo H and Trancoso I (2010) Age and gender classification using fusion of acoustic and prosodic features," in INTERSPEECH, pp. 2818–2821.
Metze F, Ajmera J, Englert R, Bub U, Burkhardt F, Stegmann J, et al., (2007) Comparison of four approaches to age and gender recognition for telephone applications," in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, pp. IV-1089-IV-1092.
Rao KS, and Sarkar S (2014) Robust speaker recognition in noisy environments. Springer Science+Business Media, pp 13–27. doi:10.1007/978-3-319-07130-5
Sainath TN, Kingsbury B, Soltau H, Ramabhadran B (2013) "optimization techniques to improve training speed of deep neural networks for large speech tasks," Audio, Speech, and Language Processing. IEEE Transactions on 21:2267–2276
Google Scholar
Saul LK, Jaakkola T, Jordan MI (1996) Mean field theory for sigmoid belief networks. J Artif Intell Res 4:61–76
MATH Google Scholar
Shahin I (2013) Speaker identification in emotional talking environments based on CSPHMM2s. Eng Appl Artif Intell 26:1652–1659
Article Google Scholar
Shriberg E, Ferrer L, Kajarekar S, Venkataraman A, Stolcke A (2005) Modeling prosodic feature sequences for speaker recognition. Speech Comm 46:455–472
Article Google Scholar
Sigurdsson S, Petersen KB, and Lehn-Schiøler T (2006) Mel frequency cepstral coefficients: an evaluation of robustness of mp3 encoded music," in Seventh International Conference on Music Information Retrieval (ISMIR)
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15:1929–1958
MathSciNet MATH Google Scholar
Ting H, Yingchun Y, and Zhaohui W (2006) Combining MFCC and pitch to enhance the performance of the gender recognition. In: 2006 8th international Conference on Signal Processing, 16-20 November 2006. doi:10.1109/ICOSP.2006.345541
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research 11:3371–3408
MathSciNet MATH Google Scholar
Wolters MK, Vipperla R, and Renals S (2009) Age recognition for spoken dialogue systems: Do we need it? Proc. INTERSPEECH, pp. 1435–1438
Woźniak M, Graña M, Corchado E (2014) A survey of multiple classifier systems as hybrid systems. Information Fusion 16:3–17
Article Google Scholar
Zhou G, Hansen JH, Kaiser JF (2001) "nonlinear feature based classification of speech under stress," Speech and Audio Processing. IEEE Transactions on 9:201–216
Google Scholar

Download references

Acknowledgments

This work was supported by the ICT R&D program of MSIP/IITP. (No. R0126-15-1119, Development of a solution for situation-awareness based on the analysis of speech and environmental sounds).

Author information

Authors and Affiliations

College of Electronics and Information Engineering, Sejong University, Seoul, Republic of Korea
Jamil Ahmad, Soon-il Kwon, Mi Young Lee & Sung Wook Baik
Department of Computer Science, Islamia College, Peshawar, Pakistan
Muhammad Sajjad
Deparment of Multimedia, Sungkyul University, Anyang, Republic of Korea
Seungmin Rho

Authors

Jamil Ahmad
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Sajjad
View author publications
You can also search for this author in PubMed Google Scholar
Seungmin Rho
View author publications
You can also search for this author in PubMed Google Scholar
Soon-il Kwon
View author publications
You can also search for this author in PubMed Google Scholar
Mi Young Lee
View author publications
You can also search for this author in PubMed Google Scholar
Sung Wook Baik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sung Wook Baik.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ahmad, J., Sajjad, M., Rho, S. et al. Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture. Multimed Tools Appl 77, 4883–4907 (2018). https://doi.org/10.1007/s11042-016-4041-7

Download citation

Received: 02 May 2016
Revised: 17 August 2016
Accepted: 04 October 2016
Published: 28 October 2016
Issue Date: February 2018
DOI: https://doi.org/10.1007/s11042-016-4041-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Automatic speech recognition: a survey

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Automatic speech recognition: a survey

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation