Abstract
Separation of pre-recorded messages (Interactive Voice Response, IVR) from live speech fragments in real-time plays a significant role in speech emotion recognition (SER) systems, unwanted calls filtering, automatic detection of answering machine responses, reduction of stored record sizes, voice mail spam filtration, etc. The problem complexity is that, unlike with silent, music, and noise fragments studied by the conventional voice activity recognition (VAD), IVR usually contains speech. Three classifiers for live speech fragments detection in phone call records are considered: based on the support vector machine (SVM), gradient boosting (XGBoost) and convolutional neural network (CNN). The Geneva Minimalistic Acoustic Parameter Set for XGBoost and SVM, and log-spectrograms and gammatonegrams for CNN were used for feature representation of audio fragments. Experiments with a dataset of phone calls demonstrate comparable quality (around 0.96 according to the F1-averaged measure) of the considered algorithms with CNN having a advantage (0.98).
Similar content being viewed by others
References
Acero, A., Fisher, C. M., Yu, D., Wang, Y.-Y., & Ju, Y.-C. (2011). Detecting an answering machine using speech recognition. USA: Google Patents.
Agbinya, J. I. (1996). Discrete wavelet transform techniques in speech processing. In Proceedings of Digital Processing Applications (TENCON’96) (Vol. 2, pp. 514–519).
Chen, T., & Guestrin, C. (2016). XGBoost : Reliable large-scale tree boosting system. arXiv, 1–6. https://doi.org/10.1145/2939672.2939785
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., et al. (2001). Emotion recognition in human-computer interaction. Signal Processing Magazine IEEE, 18(1), 32–80. https://doi.org/10.1109/79.911197.
Dahake, P. P., Shaw, K., & Malathi, P. (2016). Speaker dependent speech emotion recognition using MFCC and Support Vector Machine. In 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT) (pp. 1080–1084). IEEE. https://doi.org/10.1109/ICACDOT.2016.7877753
Deng, J., Xu, X., Zhang, Z., Fruhholz, S., & Schuller, B. (2016). Exploitation of phase-based features for whispered speech emotion recognition. IEEE Access, 4, 4299–4309. https://doi.org/10.1109/ACCESS.2016.2591442.
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587. https://doi.org/10.1016/j.patcog.2010.09.020.
Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., Andre, E., Busso, C., et al. (2016). The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190–202. https://doi.org/10.1109/TAFFC.2015.2457417.
Eyben, F., Weninger, F., Gross, F., & Schuller, B. (2013). Recent developments in openSMILE, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on Multimedia - MM ’13 (pp. 835–838). New York, New York: ACM Press. https://doi.org/10.1145/2502081.2502224
Eyben, F., Wöllmer, M., & Schuller, B. (2009). OpenEAR: Introducing the Munich open-source emotion and affect recognition toolkit. In Proceedings: 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, ACII 2009 (pp. 1–6). IEEE. https://doi.org/10.1109/ACII.2009.5349350
Fayek, H. M., Lech, M., & Cavedon, L. (2017). Evaluating deep learning architectures for speech emotion recognition. Neural Networks. https://doi.org/10.1016/j.neunet.2017.02.013.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.2307/2699986.
Ju, Y. C., Wang, Y. Y., & Acero, A. (2006). Call analysis with classification using speech and non-speech features. In INTERSPEECH 2006 and 9th International Conference on Spoken Language Processing, INTERSPEECH 2006 - ICSLP (Vol. 4, pp. 1902–1905). International Speech Communication Association.
Kim, J. M., & Saurous, R. A. (2018). Emotion recognition from human speech using temporal information and deep learning. In Proceedings of the Annual Conference of the International Speech Communication and Association INTERSPEECH, vol. 2018-Septe, no. September, pp. 937–940. https://doi.org/10.21437/Interspeech.2018-1132
Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech: A review. International Journal of Speech Technology, 15(2), 99–117. https://doi.org/10.1007/s10772-011-9125-1.
Kopylov, A., Seredin, O., Naidyonov, A., & Zenin, D. (2017). The creation of a corpus of emotional data for the system of emotion-related states assessment of a dialogue with the call center operator. In 18-th All-Russian Conference with International Participation MMPR-18 (pp. 132–133). Moscow: TORUS PRESS. https://www.researchgate.net/publication/342571475_The_creation_of_a_corpus_of_emotional_data_for_the_system_of_emotion-related_states_assessment_of_a_dialogue_with_the_call_center_operator
Lim, W., Jang, D., & Lee, T. (2016). Speech emotion recognition using convolutional and recurrent neural networks. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016, 1–4. https://doi.org/10.1109/APSIPA.2016.7820699.
Makarova, V., & Petrushin, V. A. (2002). Ruslana: a Database of Russian Emotional Utterances. In 7th International Conference on Spoken Language Processing (ICSLP02) (pp. 2041–2044).
Meier, M., Borsky, M., Magnusdottir, E. H., Johannsdottir, K. R., & Gudnason, J. (2016). Vocal tract and voice source features for monitoring cognitive workload. In 2016 7th IEEE International Conference on Cognitive Infocommunications (CogInfoCom) (pp. 000097–000102). IEEE. https://doi.org/10.1109/CogInfoCom.2016.7804532
Mu, Y., HernándezGómez, L. A., Cano Montes, A., Martínez, C. A., Wang, X., & Gao, H. (2017). Speech emotion recognition using convolutional: Recurrent neural networks with attention model. Information Science and Internet Technology, 15, 341–350.
Pathak, S., & Kolhe, V. (2016). A survey on emotion recognition from speech signal. International Journal of Advanced Research in Computer and Communication Engineering, 5(7), 447–450.
Peng, Z., Zhu, Z., Unoki, M., Dang, J., & Akagi, M. (2017). Speech emotion recognition using multichannel parallel convolutional recurrent neural networks based on gammatone auditory filterbank. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (Vol. 2018-Febru, pp. 1750–1755). IEEE. https://doi.org/10.1109/APSIPA.2017.8282316
Poria, S., Majumder, N., Mihalcea, R., & Hovy, E. (2019). Emotion recognition in conversation: research challenges, datasets, and recent advances. IEEE Access. https://doi.org/10.1109/access.2019.2929050.
Prasomphan, S. (2015). Detecting human emotion via speech recognition by using speech spectrogram. In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 1–10). IEEE. https://doi.org/10.1109/DSAA.2015.7344793
Sainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B. (2013). Learning filter banks within a deep neural network framework. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2013 - Proceedings, 297–302. https://doi.org/10.1109/ASRU.2013.6707746
Savargiv, M., & Bastanfard, A. (2016). Real-time speech emotion recognition by minimum number of features. In 2016 Artificial Intelligence and Robotics, IRANOPEN 2016 (pp. 72–76). IEEE. https://doi.org/10.1109/RIOS.2016.7529493
Seehapoch, T., & Wongthanavasu, S. (2013). Speech Emotion Recognition Using Support Vector Machines. In 2013 5th International Conference on Knowledge and Smart Technology (KST), 86–91. https://doi.org/10.1109/KST.2013.6512793
Shiota, S., Villavicencio, F., Yamagishi, J., Ono, N., Echizen, I., & Matsui, T. (2015). Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verification. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2015-Janua, 239–243.
Siegert, I., & Wendemuth, A. (2017). IKANNOTATE2: A tool supporting annotation of emotions in audio-visual data. Elektronische Sprachsignalverarbeitung 2017. Tagungsband der 28. Konferenz, 86, 17–14.
Tabibi, S., Kegel, A., Lai, W. K., & Dillier, N. (2017). Investigating the use of a Gammatonefilterbank for a cochlear implant coding strategy. Journal of Neuroscience Methods, 277, 63–74.
Vapnik, V. N. (1998). Statistical learning theory. In S. Haykin (Ed.), Interpreting. Wiley-Interscience: New York.2
Voicemail detection powered by AI | Voximplant.com. (n.d.). Retrieved October 19, 2019, from https://voximplant.com/blog/voicemail-detection-powered-by-ai.
Wen, Z., Shi, J., Li, Q., He, B., & Chen, J. (2018). ThunderSVM: A fast SVM library on GPUs and CPUs. Journal of Machine Learning Research, 19, 1–5.
Zhang, L., Tan, S., & Yang, J. (2017). Hearing your voice is not enough: An Articulatory gesture based liveness detection for voice authentication. Proceedings of the ACM Conference on Computer and Communications Security. https://doi.org/10.1145/3133956.3133962.
Zhang, Y., & Abdulla, W. H. (2006). Gammatone auditory filterbank and independent component analysis for speaker identification. INTERSPEECH 2006 and 9th International Conference on Spoken Language Processing, INTERSPEECH 2006 - ICSLP, 5(January), 2098–2101.
Acknowledgements
The results of the research project are published with the financial support of Tula State University within the framework of the scientific project NIR_2018_20.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kopylov, A., Seredin, O., Filin, A. et al. Detection of interactive voice response (IVR) in phone call records. Int J Speech Technol 23, 907–915 (2020). https://doi.org/10.1007/s10772-020-09754-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-020-09754-3