Skip to main content
Log in

Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In the millions of emergency reporting calls made each year, about a quarter are non-emergencies. To avoid responding to such situations, forensic examination of the reported situation in the presence of speech as evidence has become an indispensable requirement for emergency response centers. Caller profile information like gender, age, emotional state, transcript, and contextual sounds determined from emergency calls, may be highly beneficial for their sophisticated forensic analysis. However, callers reporting emergency situations often express emotional stress which cause variations in speech production. Furthermore, low voice quality, and background noise make it very difficult to efficiently recognize caller attributes in such unconstrained environments. To overcome limitations of traditional classification systems in such situations, a hybrid two-stage classification scheme is proposed in this paper. Our framework consist of an ensemble of support vector machines (e-SVM) and deep neural networks (DNN) in a cascade. The first stage e-SVM consists of two models discriminatively trained on normal and stressful speech from emergency calls. Deep neural network forming the second stage of classification pipeline, is utilized only in case of ambiguous prediction results from the first stage. The adaptive nature of this two stage classification scheme helps achieve efficiency and high performance. Experiments conducted with a large dataset affirm the suitability of proposed architecture for efficient real-time speaker attribute recognition. The framework is evaluated for gender recognition from emergency calls in the presence of emotions and background noise. The framework yields significant performance improvements in comparison with other similar state-of-the-art gender recognition approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Ahmad J, Sajjad M, Jan Z, Mehmood I, Rho S, Baik SW (2016a) Analysis of interaction trace maps for active authentication on smart devices. Multimedia Tools and Applications:1–19. doi:10.1007/s11042-016-3450-y

  2. Ahmad J, Muhammad K, Kwon Si, Baik SW, and Rho S 2016b Dempster-Shafer Fusion Based Gender Recognition for Speech Analysis Applications," in 2016 International Conference on Platform Technology and Service (PlatCon), pp. 1–4

  3. Baber C, Mellor B, Graham R, Noyes JM, Tunley C (1996) Workload and the use of automatic speech recognition: the effects of time and resource demands. Speech Comm 20:37–53

    Article  Google Scholar 

  4. Bahari MH, McLaren M, van Leeuwen DA (2014) Speaker age estimation using i-vectors. Eng Appl Artif Intell 34:99–108

    Article  Google Scholar 

  5. Barkana BD, Zhou J (2015) A new pitch-range based feature set for a speaker’s age and gender classification. Appl Acoust 98:52–61

    Article  Google Scholar 

  6. Bond ZS., and Moore TJ (1990) A note on loud and lombard speech,” in International Conference on Speech Language Processing pp. 969–972.

  7. Burkhardt F, Eckert M, Johannsen W, and Stegmann J (2010) A Database of Age and Gender Annotated Telephone Speech,” in Proc. 7th International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, 2010

  8. Campbell WM, Campbell JP, Reynolds DA, Jones DA, and Leek TR (2004) "Phonetic speaker recognition with support vector machines" S. Thrun, L. Saul, B. Schokopf (Eds.), Advances in Neural Information Processing Systems, Vol. 16, MIT Press, Cambridge, MA (2004)

  9. Campbell WM, Campbell JP, Reynolds DA, Singer E, Torres-Carrasquillo PA (2006) Support vector machines for speaker and language recognition. Comput Speech Lang 20:210–229

    Article  Google Scholar 

  10. Cummins N, Epps J, Sethu V, Breakspear M, and Goecke R 2013 Modeling spectral variability for the classification of depressed speech," in Interspeech, pp. 857–861

  11. Dahl GE, Yu D, Deng L, Acero A (2012) "context-dependent pre-trained deep neural networks for large-vocabulary speech recognition," Audio, Speech, and Language Processing. IEEE Transactions on 20:30–42

    Google Scholar 

  12. Deng J, Berg AC, and Fei-Fei L (2011) Hierarchical semantic indexing for large scale image retrieval," in Computer Vision and Pattern Recognition (CVPR), 2011 I.E. Conference on, pp. 785–792

  13. "EENA Operations Document (2011).

  14. Fayek H, Lech M, and Cavedon L (2015) Towards real-time Speech Emotion Recognition using deep neural networks," in Signal Processing and Communication Systems (ICSPCS), 2015 9th International Conference on, pp. 1–5.

  15. Fujimura H (2014) Simultaneous gender classification and voice activity detection using deep neural networks," in INTERSPEECH, pp. 1139–1143

  16. Germain F, Sun DL, and Mysore GJ (2013) "Speaker and noise independent voice activity detection" in Proc. Interspeech 2013, pp 732–736

  17. Hansen JH and Patil S (2007) Speech under stress: Analysis, modeling and recognition," in Speaker Classification I, ed: Springer, pp. 108–137.

  18. Harb H, and Chen L (2003) Gender identification using a general audio classifier," in Multimedia and Expo, 2003. ICME'03. Proceedings. 2003 International Conference on, pp. II-733-6 vol. 2

  19. Harb H, Chen L (2005) Voice-based gender identification in multimedia applications. J Intell Inf Syst 24:179–198

    Article  Google Scholar 

  20. Hinton GE (2012) A practical guide to training restricted boltzmann machines," in Neural Networks: Tricks of the Trade, ed: Springer, pp. 599–619

  21. Hu Y, Wu D, Nucci A (2012) Pitch-based gender identification with two-stage classification. Security and Communication Networks 5:211–225

    Article  Google Scholar 

  22. Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52:12–40

    Article  Google Scholar 

  23. Kockmann, M., Burget, L.,Cernocḱy, J., (2010). Brno University of Technology System for Interspeech 2010 Paralinguistic Challenge. In: Proceedings of the Interspeech, Makuhari, Japan, pp. 2822–2825.

  24. Li M, Han KJ, Narayanan S (2013) Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput Speech Lang 27:151–167

    Article  Google Scholar 

  25. Lu L, Zhang H-J, Jiang H (2002) "content analysis for audio classification and segmentation," Speech and Audio Processing. IEEE Transactions on 10:504–516

    Google Scholar 

  26. Martin A, Charlet D, and Mauuary L (2001) Robust speech/non-speech detection using LDA applied to MFCC," in Acoustics, Speech, and Signal Processing. Proceedings.(ICASSP'01). 2001 I.E. International Conference on, 2001, pp. 237–240.

  27. Meinedo H and Trancoso I (2010) Age and gender classification using fusion of acoustic and prosodic features," in INTERSPEECH, pp. 2818–2821.

  28. Metze F, Ajmera J, Englert R, Bub U, Burkhardt F, Stegmann J, et al., (2007) Comparison of four approaches to age and gender recognition for telephone applications," in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, pp. IV-1089-IV-1092.

  29. Rao KS, and Sarkar S (2014) Robust speaker recognition in noisy environments. Springer Science+Business Media, pp 13–27. doi:10.1007/978-3-319-07130-5

  30. Sainath TN, Kingsbury B, Soltau H, Ramabhadran B (2013) "optimization techniques to improve training speed of deep neural networks for large speech tasks," Audio, Speech, and Language Processing. IEEE Transactions on 21:2267–2276

    Google Scholar 

  31. Saul LK, Jaakkola T, Jordan MI (1996) Mean field theory for sigmoid belief networks. J Artif Intell Res 4:61–76

    MATH  Google Scholar 

  32. Shahin I (2013) Speaker identification in emotional talking environments based on CSPHMM2s. Eng Appl Artif Intell 26:1652–1659

    Article  Google Scholar 

  33. Shriberg E, Ferrer L, Kajarekar S, Venkataraman A, Stolcke A (2005) Modeling prosodic feature sequences for speaker recognition. Speech Comm 46:455–472

    Article  Google Scholar 

  34. Sigurdsson S, Petersen KB, and Lehn-Schiøler T (2006) Mel frequency cepstral coefficients: an evaluation of robustness of mp3 encoded music," in Seventh International Conference on Music Information Retrieval (ISMIR)

  35. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15:1929–1958

    MathSciNet  MATH  Google Scholar 

  36. Ting H, Yingchun Y, and Zhaohui W (2006) Combining MFCC and pitch to enhance the performance of the gender recognition. In: 2006 8th international Conference on Signal Processing, 16-20 November 2006. doi:10.1109/ICOSP.2006.345541

  37. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research 11:3371–3408

    MathSciNet  MATH  Google Scholar 

  38. Wolters MK, Vipperla R, and Renals S (2009) Age recognition for spoken dialogue systems: Do we need it? Proc. INTERSPEECH, pp. 1435–1438

  39. Woźniak M, Graña M, Corchado E (2014) A survey of multiple classifier systems as hybrid systems. Information Fusion 16:3–17

    Article  Google Scholar 

  40. Zhou G, Hansen JH, Kaiser JF (2001) "nonlinear feature based classification of speech under stress," Speech and Audio Processing. IEEE Transactions on 9:201–216

    Google Scholar 

Download references

Acknowledgments

This work was supported by the ICT R&D program of MSIP/IITP. (No. R0126-15-1119, Development of a solution for situation-awareness based on the analysis of speech and environmental sounds).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sung Wook Baik.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ahmad, J., Sajjad, M., Rho, S. et al. Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture. Multimed Tools Appl 77, 4883–4907 (2018). https://doi.org/10.1007/s11042-016-4041-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-016-4041-7

Keywords

Navigation