Detection of interactive voice response (IVR) in phone call records

Kopylov, Andrei; Seredin, Oleg; Filin, Andrei; Tyshkevich, Boris

doi:10.1007/s10772-020-09754-3

Detection of interactive voice response (IVR) in phone call records

Published: 17 November 2020

Volume 23, pages 907–915, (2020)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

283 Accesses
2 Citations
Explore all metrics

Abstract

Separation of pre-recorded messages (Interactive Voice Response, IVR) from live speech fragments in real-time plays a significant role in speech emotion recognition (SER) systems, unwanted calls filtering, automatic detection of answering machine responses, reduction of stored record sizes, voice mail spam filtration, etc. The problem complexity is that, unlike with silent, music, and noise fragments studied by the conventional voice activity recognition (VAD), IVR usually contains speech. Three classifiers for live speech fragments detection in phone call records are considered: based on the support vector machine (SVM), gradient boosting (XGBoost) and convolutional neural network (CNN). The Geneva Minimalistic Acoustic Parameter Set for XGBoost and SVM, and log-spectrograms and gammatonegrams for CNN were used for feature representation of audio fragments. Experiments with a dataset of phone calls demonstrate comparable quality (around 0.96 according to the F1-averaged measure) of the considered algorithms with CNN having a advantage (0.98).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ScamBlk: A Voice Recognition-Based Natural Language Processing Approach for the Detection of Telecommunication Fraud

Speech Signal Processing for Identification of Under-Resourced Languages

Optimized Analysis of Emotion Recognition Through Speech Signals

References

Acero, A., Fisher, C. M., Yu, D., Wang, Y.-Y., & Ju, Y.-C. (2011). Detecting an answering machine using speech recognition. USA: Google Patents.
Google Scholar
Agbinya, J. I. (1996). Discrete wavelet transform techniques in speech processing. In Proceedings of Digital Processing Applications (TENCON’96) (Vol. 2, pp. 514–519).
Chen, T., & Guestrin, C. (2016). XGBoost : Reliable large-scale tree boosting system. arXiv, 1–6. https://doi.org/10.1145/2939672.2939785
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., et al. (2001). Emotion recognition in human-computer interaction. Signal Processing Magazine IEEE, 18(1), 32–80. https://doi.org/10.1109/79.911197.
Article Google Scholar
Dahake, P. P., Shaw, K., & Malathi, P. (2016). Speaker dependent speech emotion recognition using MFCC and Support Vector Machine. In 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT) (pp. 1080–1084). IEEE. https://doi.org/10.1109/ICACDOT.2016.7877753
Deng, J., Xu, X., Zhang, Z., Fruhholz, S., & Schuller, B. (2016). Exploitation of phase-based features for whispered speech emotion recognition. IEEE Access, 4, 4299–4309. https://doi.org/10.1109/ACCESS.2016.2591442.
Article Google Scholar
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587. https://doi.org/10.1016/j.patcog.2010.09.020.
Article MATH Google Scholar
Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., Andre, E., Busso, C., et al. (2016). The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190–202. https://doi.org/10.1109/TAFFC.2015.2457417.
Article Google Scholar
Eyben, F., Weninger, F., Gross, F., & Schuller, B. (2013). Recent developments in openSMILE, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on Multimedia - MM ’13 (pp. 835–838). New York, New York: ACM Press. https://doi.org/10.1145/2502081.2502224
Eyben, F., Wöllmer, M., & Schuller, B. (2009). OpenEAR: Introducing the Munich open-source emotion and affect recognition toolkit. In Proceedings: 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, ACII 2009 (pp. 1–6). IEEE. https://doi.org/10.1109/ACII.2009.5349350
Fayek, H. M., Lech, M., & Cavedon, L. (2017). Evaluating deep learning architectures for speech emotion recognition. Neural Networks. https://doi.org/10.1016/j.neunet.2017.02.013.
Article Google Scholar
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.2307/2699986.
Article MathSciNet MATH Google Scholar
Ju, Y. C., Wang, Y. Y., & Acero, A. (2006). Call analysis with classification using speech and non-speech features. In INTERSPEECH 2006 and 9th International Conference on Spoken Language Processing, INTERSPEECH 2006 - ICSLP (Vol. 4, pp. 1902–1905). International Speech Communication Association.
Kim, J. M., & Saurous, R. A. (2018). Emotion recognition from human speech using temporal information and deep learning. In Proceedings of the Annual Conference of the International Speech Communication and Association INTERSPEECH, vol. 2018-Septe, no. September, pp. 937–940. https://doi.org/10.21437/Interspeech.2018-1132
Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech: A review. International Journal of Speech Technology, 15(2), 99–117. https://doi.org/10.1007/s10772-011-9125-1.
Article Google Scholar
Kopylov, A., Seredin, O., Naidyonov, A., & Zenin, D. (2017). The creation of a corpus of emotional data for the system of emotion-related states assessment of a dialogue with the call center operator. In 18-th All-Russian Conference with International Participation MMPR-18 (pp. 132–133). Moscow: TORUS PRESS. https://www.researchgate.net/publication/342571475_The_creation_of_a_corpus_of_emotional_data_for_the_system_of_emotion-related_states_assessment_of_a_dialogue_with_the_call_center_operator
Lim, W., Jang, D., & Lee, T. (2016). Speech emotion recognition using convolutional and recurrent neural networks. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016, 1–4. https://doi.org/10.1109/APSIPA.2016.7820699.
Article Google Scholar
Makarova, V., & Petrushin, V. A. (2002). Ruslana: a Database of Russian Emotional Utterances. In 7th International Conference on Spoken Language Processing (ICSLP02) (pp. 2041–2044).
Meier, M., Borsky, M., Magnusdottir, E. H., Johannsdottir, K. R., & Gudnason, J. (2016). Vocal tract and voice source features for monitoring cognitive workload. In 2016 7th IEEE International Conference on Cognitive Infocommunications (CogInfoCom) (pp. 000097–000102). IEEE. https://doi.org/10.1109/CogInfoCom.2016.7804532
Mu, Y., HernándezGómez, L. A., Cano Montes, A., Martínez, C. A., Wang, X., & Gao, H. (2017). Speech emotion recognition using convolutional: Recurrent neural networks with attention model. Information Science and Internet Technology, 15, 341–350.
Google Scholar
Pathak, S., & Kolhe, V. (2016). A survey on emotion recognition from speech signal. International Journal of Advanced Research in Computer and Communication Engineering, 5(7), 447–450.
Google Scholar
Peng, Z., Zhu, Z., Unoki, M., Dang, J., & Akagi, M. (2017). Speech emotion recognition using multichannel parallel convolutional recurrent neural networks based on gammatone auditory filterbank. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (Vol. 2018-Febru, pp. 1750–1755). IEEE. https://doi.org/10.1109/APSIPA.2017.8282316
Poria, S., Majumder, N., Mihalcea, R., & Hovy, E. (2019). Emotion recognition in conversation: research challenges, datasets, and recent advances. IEEE Access. https://doi.org/10.1109/access.2019.2929050.
Article Google Scholar
Prasomphan, S. (2015). Detecting human emotion via speech recognition by using speech spectrogram. In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 1–10). IEEE. https://doi.org/10.1109/DSAA.2015.7344793
Sainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B. (2013). Learning filter banks within a deep neural network framework. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings, 297–302. https://doi.org/10.1109/ASRU.2013.6707746
Savargiv, M., & Bastanfard, A. (2016). Real-time speech emotion recognition by minimum number of features. In 2016 Artificial Intelligence and Robotics, IRANOPEN 2016 (pp. 72–76). IEEE. https://doi.org/10.1109/RIOS.2016.7529493
Seehapoch, T., & Wongthanavasu, S. (2013). Speech Emotion Recognition Using Support Vector Machines. In 2013 5th International Conference on Knowledge and Smart Technology (KST), 86–91. https://doi.org/10.1109/KST.2013.6512793
Shiota, S., Villavicencio, F., Yamagishi, J., Ono, N., Echizen, I., & Matsui, T. (2015). Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verification. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2015-Janua, 239–243.
Siegert, I., & Wendemuth, A. (2017). IKANNOTATE2: A tool supporting annotation of emotions in audio-visual data. Elektronische Sprachsignalverarbeitung 2017. Tagungsband der 28. Konferenz, 86, 17–14.
Google Scholar
Tabibi, S., Kegel, A., Lai, W. K., & Dillier, N. (2017). Investigating the use of a Gammatonefilterbank for a cochlear implant coding strategy. Journal of Neuroscience Methods, 277, 63–74.
Article Google Scholar
Vapnik, V. N. (1998). Statistical learning theory. In S. Haykin (Ed.), Interpreting. Wiley-Interscience: New York.2
Google Scholar
Voicemail detection powered by AI | Voximplant.com. (n.d.). Retrieved October 19, 2019, from https://voximplant.com/blog/voicemail-detection-powered-by-ai.
Wen, Z., Shi, J., Li, Q., He, B., & Chen, J. (2018). ThunderSVM: A fast SVM library on GPUs and CPUs. Journal of Machine Learning Research, 19, 1–5.
MathSciNet Google Scholar
Zhang, L., Tan, S., & Yang, J. (2017). Hearing your voice is not enough: An Articulatory gesture based liveness detection for voice authentication. Proceedings of the ACM Conference on Computer and Communications Security. https://doi.org/10.1145/3133956.3133962.
Article Google Scholar
Zhang, Y., & Abdulla, W. H. (2006). Gammatone auditory filterbank and independent component analysis for speaker identification. INTERSPEECH 2006 and 9th International Conference on Spoken Language Processing, INTERSPEECH 2006 - ICSLP, 5(January), 2098–2101.

Download references

Acknowledgements

The results of the research project are published with the financial support of Tula State University within the framework of the scientific project NIR_2018_20.

Author information

Authors and Affiliations

Tula State University, Tula, Russia
Andrei Kopylov, Oleg Seredin & Andrei Filin
ITooLabs, Tula, Russia
Andrei Filin & Boris Tyshkevich

Authors

Andrei Kopylov
View author publications
You can also search for this author in PubMed Google Scholar
Oleg Seredin
View author publications
You can also search for this author in PubMed Google Scholar
Andrei Filin
View author publications
You can also search for this author in PubMed Google Scholar
Boris Tyshkevich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrei Filin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kopylov, A., Seredin, O., Filin, A. et al. Detection of interactive voice response (IVR) in phone call records. Int J Speech Technol 23, 907–915 (2020). https://doi.org/10.1007/s10772-020-09754-3

Download citation

Received: 08 January 2020
Accepted: 11 September 2020
Published: 17 November 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10772-020-09754-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detection of interactive voice response (IVR) in phone call records

Abstract

Access this article

Similar content being viewed by others

ScamBlk: A Voice Recognition-Based Natural Language Processing Approach for the Detection of Telecommunication Fraud

Speech Signal Processing for Identification of Under-Resourced Languages

Optimized Analysis of Emotion Recognition Through Speech Signals

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Detection of interactive voice response (IVR) in phone call records

Abstract

Access this article

Similar content being viewed by others

ScamBlk: A Voice Recognition-Based Natural Language Processing Approach for the Detection of Telecommunication Fraud

Speech Signal Processing for Identification of Under-Resourced Languages

Optimized Analysis of Emotion Recognition Through Speech Signals

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation