Skip to main content
Log in

Detection of interactive voice response (IVR) in phone call records

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Separation of pre-recorded messages (Interactive Voice Response, IVR) from live speech fragments in real-time plays a significant role in speech emotion recognition (SER) systems, unwanted calls filtering, automatic detection of answering machine responses, reduction of stored record sizes, voice mail spam filtration, etc. The problem complexity is that, unlike with silent, music, and noise fragments studied by the conventional voice activity recognition (VAD), IVR usually contains speech. Three classifiers for live speech fragments detection in phone call records are considered: based on the support vector machine (SVM), gradient boosting (XGBoost) and convolutional neural network (CNN). The Geneva Minimalistic Acoustic Parameter Set for XGBoost and SVM, and log-spectrograms and gammatonegrams for CNN were used for feature representation of audio fragments. Experiments with a dataset of phone calls demonstrate comparable quality (around 0.96 according to the F1-averaged measure) of the considered algorithms with CNN having a advantage (0.98).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Acero, A., Fisher, C. M., Yu, D., Wang, Y.-Y., & Ju, Y.-C. (2011). Detecting an answering machine using speech recognition. USA: Google Patents.

    Google Scholar 

  • Agbinya, J. I. (1996). Discrete wavelet transform techniques in speech processing. In Proceedings of Digital Processing Applications (TENCON’96) (Vol. 2, pp. 514–519).

  • Chen, T., & Guestrin, C. (2016). XGBoost : Reliable large-scale tree boosting system. arXiv, 1–6. https://doi.org/10.1145/2939672.2939785

  • Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., et al. (2001). Emotion recognition in human-computer interaction. Signal Processing Magazine IEEE, 18(1), 32–80. https://doi.org/10.1109/79.911197.

    Article  Google Scholar 

  • Dahake, P. P., Shaw, K., & Malathi, P. (2016). Speaker dependent speech emotion recognition using MFCC and Support Vector Machine. In 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT) (pp. 1080–1084). IEEE. https://doi.org/10.1109/ICACDOT.2016.7877753

  • Deng, J., Xu, X., Zhang, Z., Fruhholz, S., & Schuller, B. (2016). Exploitation of phase-based features for whispered speech emotion recognition. IEEE Access, 4, 4299–4309. https://doi.org/10.1109/ACCESS.2016.2591442.

    Article  Google Scholar 

  • El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587. https://doi.org/10.1016/j.patcog.2010.09.020.

    Article  MATH  Google Scholar 

  • Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., Andre, E., Busso, C., et al. (2016). The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190–202. https://doi.org/10.1109/TAFFC.2015.2457417.

    Article  Google Scholar 

  • Eyben, F., Weninger, F., Gross, F., & Schuller, B. (2013). Recent developments in openSMILE, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on Multimedia - MM ’13 (pp. 835–838). New York, New York: ACM Press. https://doi.org/10.1145/2502081.2502224

  • Eyben, F., Wöllmer, M., & Schuller, B. (2009). OpenEAR: Introducing the Munich open-source emotion and affect recognition toolkit. In Proceedings: 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, ACII 2009 (pp. 1–6). IEEE. https://doi.org/10.1109/ACII.2009.5349350

  • Fayek, H. M., Lech, M., & Cavedon, L. (2017). Evaluating deep learning architectures for speech emotion recognition. Neural Networks. https://doi.org/10.1016/j.neunet.2017.02.013.

    Article  Google Scholar 

  • Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.2307/2699986.

    Article  MathSciNet  MATH  Google Scholar 

  • Ju, Y. C., Wang, Y. Y., & Acero, A. (2006). Call analysis with classification using speech and non-speech features. In INTERSPEECH 2006 and 9th International Conference on Spoken Language Processing, INTERSPEECH 2006 - ICSLP (Vol. 4, pp. 1902–1905). International Speech Communication Association.

  • Kim, J. M., & Saurous, R. A. (2018). Emotion recognition from human speech using temporal information and deep learning. In Proceedings of the Annual Conference of the International Speech Communication and Association INTERSPEECH, vol. 2018-Septe, no. September, pp. 937–940. https://doi.org/10.21437/Interspeech.2018-1132

  • Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech: A review. International Journal of Speech Technology, 15(2), 99–117. https://doi.org/10.1007/s10772-011-9125-1.

    Article  Google Scholar 

  • Kopylov, A., Seredin, O., Naidyonov, A., & Zenin, D. (2017). The creation of a corpus of emotional data for the system of emotion-related states assessment of a dialogue with the call center operator. In 18-th All-Russian Conference with International Participation MMPR-18 (pp. 132–133). Moscow: TORUS PRESS. https://www.researchgate.net/publication/342571475_The_creation_of_a_corpus_of_emotional_data_for_the_system_of_emotion-related_states_assessment_of_a_dialogue_with_the_call_center_operator

  • Lim, W., Jang, D., & Lee, T. (2016). Speech emotion recognition using convolutional and recurrent neural networks. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016, 1–4. https://doi.org/10.1109/APSIPA.2016.7820699.

    Article  Google Scholar 

  • Makarova, V., & Petrushin, V. A. (2002). Ruslana: a Database of Russian Emotional Utterances. In 7th International Conference on Spoken Language Processing (ICSLP02) (pp. 2041–2044).

  • Meier, M., Borsky, M., Magnusdottir, E. H., Johannsdottir, K. R., & Gudnason, J. (2016). Vocal tract and voice source features for monitoring cognitive workload. In 2016 7th IEEE International Conference on Cognitive Infocommunications (CogInfoCom) (pp. 000097–000102). IEEE. https://doi.org/10.1109/CogInfoCom.2016.7804532

  • Mu, Y., HernándezGómez, L. A., Cano Montes, A., Martínez, C. A., Wang, X., & Gao, H. (2017). Speech emotion recognition using convolutional: Recurrent neural networks with attention model. Information Science and Internet Technology, 15, 341–350.

    Google Scholar 

  • Pathak, S., & Kolhe, V. (2016). A survey on emotion recognition from speech signal. International Journal of Advanced Research in Computer and Communication Engineering, 5(7), 447–450.

    Google Scholar 

  • Peng, Z., Zhu, Z., Unoki, M., Dang, J., & Akagi, M. (2017). Speech emotion recognition using multichannel parallel convolutional recurrent neural networks based on gammatone auditory filterbank. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (Vol. 2018-Febru, pp. 1750–1755). IEEE. https://doi.org/10.1109/APSIPA.2017.8282316

  • Poria, S., Majumder, N., Mihalcea, R., & Hovy, E. (2019). Emotion recognition in conversation: research challenges, datasets, and recent advances. IEEE Access. https://doi.org/10.1109/access.2019.2929050.

    Article  Google Scholar 

  • Prasomphan, S. (2015). Detecting human emotion via speech recognition by using speech spectrogram. In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 1–10). IEEE. https://doi.org/10.1109/DSAA.2015.7344793

  • Sainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B. (2013). Learning filter banks within a deep neural network framework. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings, 297–302. https://doi.org/10.1109/ASRU.2013.6707746

  • Savargiv, M., & Bastanfard, A. (2016). Real-time speech emotion recognition by minimum number of features. In 2016 Artificial Intelligence and Robotics, IRANOPEN 2016 (pp. 72–76). IEEE. https://doi.org/10.1109/RIOS.2016.7529493

  • Seehapoch, T., & Wongthanavasu, S. (2013). Speech Emotion Recognition Using Support Vector Machines. In 2013 5th International Conference on Knowledge and Smart Technology (KST), 86–91. https://doi.org/10.1109/KST.2013.6512793

  • Shiota, S., Villavicencio, F., Yamagishi, J., Ono, N., Echizen, I., & Matsui, T. (2015). Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verification. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2015-Janua, 239–243.

  • Siegert, I., & Wendemuth, A. (2017). IKANNOTATE2: A tool supporting annotation of emotions in audio-visual data. Elektronische Sprachsignalverarbeitung 2017. Tagungsband der 28. Konferenz, 86, 17–14.

    Google Scholar 

  • Tabibi, S., Kegel, A., Lai, W. K., & Dillier, N. (2017). Investigating the use of a Gammatonefilterbank for a cochlear implant coding strategy. Journal of Neuroscience Methods, 277, 63–74.

    Article  Google Scholar 

  • Vapnik, V. N. (1998). Statistical learning theory. In S. Haykin (Ed.), Interpreting. Wiley-Interscience: New York.2

    Google Scholar 

  • Voicemail detection powered by AI | Voximplant.com. (n.d.). Retrieved October 19, 2019, from https://voximplant.com/blog/voicemail-detection-powered-by-ai.

  • Wen, Z., Shi, J., Li, Q., He, B., & Chen, J. (2018). ThunderSVM: A fast SVM library on GPUs and CPUs. Journal of Machine Learning Research, 19, 1–5.

    MathSciNet  Google Scholar 

  • Zhang, L., Tan, S., & Yang, J. (2017). Hearing your voice is not enough: An Articulatory gesture based liveness detection for voice authentication. Proceedings of the ACM Conference on Computer and Communications Security. https://doi.org/10.1145/3133956.3133962.

    Article  Google Scholar 

  • Zhang, Y., & Abdulla, W. H. (2006). Gammatone auditory filterbank and independent component analysis for speaker identification. INTERSPEECH 2006 and 9th International Conference on Spoken Language Processing, INTERSPEECH 2006 - ICSLP, 5(January), 2098–2101.

Download references

Acknowledgements

The results of the research project are published with the financial support of Tula State University within the framework of the scientific project NIR_2018_20.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrei Filin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kopylov, A., Seredin, O., Filin, A. et al. Detection of interactive voice response (IVR) in phone call records. Int J Speech Technol 23, 907–915 (2020). https://doi.org/10.1007/s10772-020-09754-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-020-09754-3

Keywords

Navigation