research-article

A review on Deep Learning approaches in Speaker Identification

Authors:

Sreenivas Sremath Tirumala,

Seyed Reza ShahamiriAuthors Info & Claims

ICSPS 2016: Proceedings of the 8th International Conference on Signal Processing Systems

Pages 142 - 147

https://doi.org/10.1145/3015166.3015210

Published: 21 November 2016 Publication History

Abstract

Deep learning (DL) is becoming an increasingly interesting and powerful machine learning method with successful applications in many domains, such as natural language processing, image recognition, hand-written character recognition, and computer vision. Despite of its eminent success, limitations of traditional learning approach may still prevent deep learning from achieving a wide range of realistic learning tasks. DL approaches has shown success in speech recognition and speaker identification over traditional approaches such as those that use Mel Frequency Cepstrum Coefficients for feature extraction with Gaussian Mixture Models. However, speaker identification research community are not fully aware of the DL process and its application with respect to speaker identification. This paper is motivated to reduce this knowledge gap and to promote the research of implementing deep learning techniques for speaker identification. In this paper, we present a review of the DL methodologies used for speaker identification and surveys important DL algorithms that can potentially be explored for future works. We categorised the applications of DL for speaker identification according to the process of speaker identification and presented a review of these implementations.

References

[1]

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278--2324, 1998.

[2]

S. S. Tiruala. Deep learning: Fundamentals, methods and applications. In J. Porter, editor, DEEPLEARNING USING UNCONVENTIONALPARADIGMS, chapter 1, pages 11--. NOVA publishes, New York, 2014.

[3]

R. Rajesh, K. Ganesh, S. C. L. Koh, N. Singh, R. Khan, and R. Shree. International conference on modelling optimization and computing applications of speaker recognition. Procedia Engineering, 38:3122--3126, 2012.

[4]

S. R. Shahamiri and S. S Binti Salim. Real-time frequency-based noise-robust Automatic Speech Recognition using Multi-Nets Artificial Neural Networks: A multi-views multi-learners approach. Neurocomputing, 129:199--207, 2014

Digital Library

[5]

H. Kekre, A. Athawale, and M. Desai. Speaker identification using row mean vector of spectrogram. In Proceedings of the International Conference & Workshop on Emerging Trends in Technology, pages 171--174. ACM, 2011.

Digital Library

[6]

F. Richardson, D. Reynolds, and N. Dehak. A unified deep neural network for speaker and language recognition. arXiv preprint arXiv:1504.00923, 2015.

[7]

M. McLaren, Y. Lei, and L. Ferrer. Advances in deep neural network approaches to speaker recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4814--4818. IEEE, 2015.

[8]

O. Ghahabi and J. Hernando. Deep learning for single and multi-session i-vector speaker recognition. arXiv preprint arXiv:1512.02560, 2015.

[9]

Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pages 253--256, May 2010.

[10]

Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In NIPS, 2012.

Digital Library

[11]

M. Pobar and I. Ipsić. Online speaker de-identification using voice transformation. In Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2014 37th International Convention on, pages 1264--1267. IEEE, 2014.

[12]

T. Justin, V. Struc, S. Dobri;sek, B. Vesnicer, I. Ipsić, and F. Mihelic. Speaker de-identification using diphone recognition and speech synthesis. In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, volume 4, pages 1--7. IEEE, 2015.

[13]

M. Dutta, C. Patgiri, M. Sarma, and K. K. Sarma. Closed-set text-independent speaker identification system using multiple ann classifiers. In Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) 2014, pages 377--385. Springer, 2015.

[14]

G. Tesauro. Practical issues in temporal difference learning. In Machine Learning, pages 257--277, 1992.

Digital Library

[15]

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278--2324, 1998.

[16]

Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pages 253--256, May 2010.

[17]

D. Reynolds. An overview of automatic speaker recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)(S. 4072-4075), 2002.

[18]

G. K. Verma. Multi-feature fusion for closed set text independent speaker identification. In International Conference on Information Intelligence, Systems, Technology and Management, pages 170--179. Springer, 2011.

[19]

C. Zhao, H. Wang, S. Hyon, J. Wei, and J. Dang. Efficient feature extraction of speaker identification using phoneme mean f-ratio for chinese. In Chinese Spoken Language Processing (ISCSLP), 2012 8th International Symposium on, pages 345--348. IEEE, 2012.

[20]

S. K. Sarangi and G. Saha. A novel approach in feature level for robust text-independent speaker identification system. In Intelligent Human Computer Interaction (IHCI), 2012 4th International Conference on, pages 1--5. IEEE, 2012.

[21]

S. R. Shahamiri and S. S Binti Salim. Artificial neural networks as speech recognisers for dysarthric speech: Identifying the best-performing set of MFCC parameters and studying a speaker-independent approach. Advanced Engineering Informatics, 28 (1): 102--110, 2014

Digital Library

[22]

N. Sen and T. Basu. Features extracted using frequency-time analysis approach from nyquist filter bank and gaussian filter bank for text-independent speaker identification. In European Workshop on Biometrics and Identity Management, pages 125--136. Springer, 2011.

Digital Library

[23]

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788--798, 2011.

Digital Library

[24]

Y. Qian, T. Tan, D. Yu, and Y. Zhang. Integratedadaptation with multi-factor joint-learning for far-field speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5770--5774. IEEE, 2016.

Digital Library

[25]

K. Kumar, Q. Wu, Y. Wang, and M. Savvides. Noise robust speaker identification using bhattacharyya distance in adapted gaussian models space. In Signal Processing Conference, 2008 16th European, pages 1--4. IEEE, 2008.

[26]

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82--97, 2012

[27]

O. Ghahabi, A. Bonafonte, J. Hernando, and A. Moreno. Deep neural networks for i-vector language identification of short utterances in cars. Interspeech 2016, pages 367--371, 2016.

[28]

E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4052--4056. IEEE, 2014

[29]

K. Vesely, M. Karafiát, and F. Gŕezl. Convolutive bottleneck network features for lvcsr. In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, pages 42--47. IEEE, 2011.

[30]

P. Matejka, L. Zhang, T. Ng, S. H. Mallidi, O. Glembek, J. Ma, and B. Zhang. Neural network

[31]

F. Richardson, D. Reynolds, and N. Dehak. Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10):1671--1675, 2015.

[32]

F. Richardson, D. Reynolds, and N. Dehak. Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10):1671--1675, 2015.

[33]

V. Gupta, P. Kenny, P. Ouellet, and T. Stafylakis. I-vector-based speaker adaptation of deep neural networks for french broadcast audio transcription. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 6334--6338. IEEE, 2014.

[34]

Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu. Deep feature for text-dependent speaker verification. Speech Communication, 73:1--13, 2015.

Digital Library

[35]

D. Garcia-Romero, X. Zhang, A. McCree, and D. Povey. Improving speaker recognition performance in the domain adaptation challenge using deep neural networks. In Spoken Language Technology Workshop (SLT), 2014 IEEE, pages 378--383. IEEE, 2014.

[36]

O. Ghahabi and J. Hernando. Global impostor selection for dbns in multi-session i-vector speaker recognition. In Advances in Speech and Language Technologies for Iberian Languages, pages 89--98. Springer, 2014.

Digital Library

[37]

Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren. A novel scheme for speaker recognition using a phonetically-aware deep neural network. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1695--1699. IEEE, 2014.

[38]

O. Ghahabi and J. Hernando. Deep belief networks for i-vector based speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1700--1704. IEEE, 2014.

Cited By

Brima YKrumnack UPika SHeidemann G(2024)Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy ReductionInformation10.3390/info1502011415:2(114)Online publication date: 15-Feb-2024
https://doi.org/10.3390/info15020114
Khazaleh OKhrais L(2024)An investigation into the reliability of speaker recognition schemes: analysing the impact of environmental factors utilising deep learning techniquesJournal of Engineering and Applied Science10.1186/s44147-023-00351-071:1Online publication date: 6-Jan-2024
https://doi.org/10.1186/s44147-023-00351-0
Lata SKishore NSangwan P(2024)Deep Learning Approaches and Security Domains in Sentiment Analysis2024 First International Conference on Electronics, Communication and Signal Processing (ICECSP)10.1109/ICECSP61809.2024.10698274(1-6)Online publication date: 8-Aug-2024
https://doi.org/10.1109/ICECSP61809.2024.10698274
Show More Cited By

A review on Deep Learning approaches in Speaker Identification
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches

Recommendations

Text-Independent Speaker Identification Using Vowel Formants

Automatic speaker identification has become a challenging research problem due to its wide variety of applications. Neural networks and audio-visual identification systems can be very powerful, but they have limitations related to the number of ...
Evaluating Acoustic Feature Maps in 2D-CNN for Speaker Identification
ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and Computing

This paper presents a study evaluating different acoustic feature map representations in two-dimensional convolutional neural networks (2D-CNN) on the speech dataset for various speech-related activities. Specifically, the task involves identifying ...
Speaker Identification with Short Sequences of Speech Frames
ICPRAM 2015: Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 2

In biometric person identification systems, speaker identification plays a crucial role as the voice is the more natural signal to produce and the simplest to acquire. Mel frequency cepstral coefficients (MFCCs) have been widely adopted for decades in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICSPS 2016: Proceedings of the 8th International Conference on Signal Processing Systems

November 2016

235 pages

ISBN:9781450347907

DOI:10.1145/3015166

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 November 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICSPS 2016

ICSPS 2016: 8th International Conference on Signal Processing Systems

November 21 - 24, 2016

Auckland, New Zealand

Acceptance Rates

ICSPS 2016 Paper Acceptance Rate 46 of 83 submissions, 55%;

Overall Acceptance Rate 46 of 83 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
677
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Brima YKrumnack UPika SHeidemann G(2024)Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy ReductionInformation10.3390/info1502011415:2(114)Online publication date: 15-Feb-2024
https://doi.org/10.3390/info15020114
Khazaleh OKhrais L(2024)An investigation into the reliability of speaker recognition schemes: analysing the impact of environmental factors utilising deep learning techniquesJournal of Engineering and Applied Science10.1186/s44147-023-00351-071:1Online publication date: 6-Jan-2024
https://doi.org/10.1186/s44147-023-00351-0
Lata SKishore NSangwan P(2024)Deep Learning Approaches and Security Domains in Sentiment Analysis2024 First International Conference on Electronics, Communication and Signal Processing (ICECSP)10.1109/ICECSP61809.2024.10698274(1-6)Online publication date: 8-Aug-2024
https://doi.org/10.1109/ICECSP61809.2024.10698274
Saritha BLaskar MKirupakaran ALaskar RChoudhury MShome N(2024)Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech SignalCircuits, Systems, and Signal Processing10.1007/s00034-023-02542-943:3(1839-1861)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s00034-023-02542-9
Gallacher MBoano CSankar MRoedig ULunardi WBaddeley MEskicioglu RHuang PPatwari N(2023)Poster Abstract: Towards Speaker Identification on Resource-Constrained Embedded DevicesProceedings of the 21st ACM Conference on Embedded Networked Sensor Systems10.1145/3625687.3628387(518-519)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3625687.3628387
Terriza MNavarro JRetuerta IAlfageme NSan-Segundo RKontaxakis GGarcia-Martin EMarijuan PPanetsos F(2022)Use of Laughter for the Detection of Parkinson’s Disease: Feasibility Study for Clinical Decision Support Systems, Based on Speech Recognition and Automatic Classification TechniquesInternational Journal of Environmental Research and Public Health10.3390/ijerph19171088419:17(10884)Online publication date: 1-Sep-2022
https://doi.org/10.3390/ijerph191710884
Argones Rúa EVan hamme TPreuveneers DJoosen W(2022)Discriminative training of spiking neural networks organised in columns for stream‐based biometric authenticationIET Biometrics10.1049/bme2.1209911:5(485-497)Online publication date: 3-Oct-2022
https://doi.org/10.1049/bme2.12099
Farsiani SIzadkhah HLotfi S(2022)An optimum end-to-end text-independent speaker identification system using convolutional neural networkComputers and Electrical Engineering10.1016/j.compeleceng.2022.107882100(107882)Online publication date: May-2022
https://doi.org/10.1016/j.compeleceng.2022.107882
Rakotomalala FRandriatsarafara HHajalalaina ARavonimanantsoa N(2021)Voice User Interface: Literature review, challenges and future directionsSYSTEM THEORY, CONTROL AND COMPUTING JOURNAL10.52846/stccj.2021.1.2.261:2(65-89)Online publication date: 31-Dec-2021
https://doi.org/10.52846/stccj.2021.1.2.26
Kabir MMridha MShin JJahan IOhi A(2021)A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and OpportunitiesIEEE Access10.1109/ACCESS.2021.30842999(79236-79263)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3084299
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten