Abstract
In voice recognition, the two main problems are speech recognition (what was said), and speaker recognition (who was speaking). The usual method for speaker recognition is to postulate a model where the speaker identity corresponds to the parameters of the model, which estimation could be time-consuming when the number of candidate speakers is large. In this paper, we model the speaker as a high dimensional point cloud of entropy-based features, extracted from the speech signal. The method allows indexing, and hence it can manage large databases. We experimentally assessed the quality of the identification with a publicly available database formed by extracting audio from a collection of YouTube videos of 1,000 different speakers. With 20 second audio excerpts, we were able to identify a speaker with 97% accuracy when the recording environment is not controlled, and with 99% accuracy for controlled recording environments.
Similar content being viewed by others
References
Beltrán J, Chávez E, Favela J (2015) Scalable identification of mixed environmental sounds, recorded from heterogeneous sources. Pattern Recogn Lett 68:153–160
Bernhardsson E Annoy: approximate nearest neighbors in C++/Python optimized for memory usage and loading/saving to disk. https://github.com/spotify/annoy
Davis SB, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: IEEE transactions on acoustics, speech, and signal processing, vol 28, pp 357–366
Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. In: IEEE transactions on audio, speech and language processing, vol 19. pp 788–798
Greenberg C, Bansé D (2014) The NIST 2014 speaker recognition i-vector machine learning challenge. In: Proc the speaker and language recognition workshop, pp 224–230
Hansen JH, Hasan T (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal Proc Mag 32(6):74–99
Kenny P (2005) Joint factor analysis of speaker and session variability: theory and algorithms. CRIM, Montreal,(Report) CRIM-06/08-13, pp 1–17
Kenny P, Mihoubi M, Dumouchel P (2003) New MAP estimators for speaker recognition. Interspeech, pp 1–4
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52:12–40
Kulis B, Grauman K (2012) Kernelized locality-sensitive hashing. IEEE Trans Pattern Anal Mach Intell 34(6):1092–1104
Liu Y, Nie L, Liu L, Rosenblum DS (2016) From action to activity: sensor-based activity recognition. Neurocomputing 181:108–115
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digital Signal Processing: A Review Journal 10(1):19–41
Schmidt L (2014) Large scale speaker identification. In: 2014 IEEE international conference on acoustic, speech and signal processing (ICASSP), pp 1669–1673
Shannon CE (2001) A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5(1):3
Snyder D, Garcia-romero D, Povey D (2015) Time delay deep neural network-based universal background models for speaker recognition. In: 2015 IEEE workshop on automatic speech recognition and understanding (ASRU), IEEE, pp 92–97
Uhlmann JK (1991) Satisfying general proximity / similarity queries with metric trees. Inf Process Lett 40:175–179
Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. Annual ACM-SIAM Symposium on Discrete Algorithms, pp 311–321
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Luque-Suárez, F., Camarena-Ibarrola, A. & Chávez, E. Efficient speaker identification using spectral entropy. Multimed Tools Appl 78, 16803–16815 (2019). https://doi.org/10.1007/s11042-018-7035-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-7035-9