research-article

Speaker Recognition on Low Power Device Using Fully Convolutional QuartzNet

Authors:

Blessius Sheldo Putra Laksono,

Barlian Henryranu PrasetioAuthors Info & Claims

SIET '23: Proceedings of the 8th International Conference on Sustainable Information Engineering and Technology

Pages 619 - 624

https://doi.org/10.1145/3626641.3626946

Published: 27 December 2023 Publication History

SIET '23: Proceedings of the 8th International Conference on Sustainable Information Engineering and Technology

Speaker Recognition on Low Power Device Using Fully Convolutional QuartzNet

Pages 619 - 624

Abstract
References

Abstract

The need for a small and lightweight algorithm used for speaker recognition that can run on low-power devices is on the rise. This is mainly caused by security and privacy concerns of users with the use of their personal and biometric data. The speaker recognition task is mainly used as a biometric authentication, so an accurate model is also needed. The previous method uses feature engineering to extract features from raw audio files with heavy reliance on the training data and a dissimilarity between the training data and real-world implementation causes a significant decrease in its accuracy. We propose a Fully Convolutional QuartzNet as a deep learning approach to this problem. We achieved 84.6% accuracy when testing on a small subset DR-VCTK dataset with 30 classes and 56.40% accuracy on a small subset of the VoxCeleb dataset with fewer files for each of the 125 classes. The proposed model was also tested for binary speaker recognition, achieving 5.07% EER. We also achieve a small parameter count of only 33K parameters without sacrificing significant performance, and the proposed method can achieve its highest accuracy with only 53K parameters.

References

[1]

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Factor Analysis for Speaker Verification,” IEEE Trans Audio Speech Lang Process, vol. 19, no. 4, pp. 788–798, May 2011.

Digital Library

[2]

A. Ashar, M. Shahid Bhatti, and U. Mushtaq, “Speaker Identification Using a Hybrid CNN-MFCC Approach,” 2020.

[3]

W. Chen, M. Zhenjiang, and M. Xiao, “Differential MFCC and vector quantization used for real-time speaker recognition system,” in Proceedings - 1st International Congress on Image and Signal Processing, CISP 2008, 2008, pp. 319–323.

Digital Library

[4]

K. N. Van, T. P. Minh, T. N. Son, M. H. Ly, T. T. Dang, and A. Dinh, “Text-dependent Speaker Recognition System Based on Speaking Frequency Characteristics,” 2018, pp. 214–227.

Digital Library

[5]

S. Kriman, “QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions.” 2019.

[6]

J. Balian, R. Tavarone, M. Poumeyrol, and A. Coucke, “Small footprint Text-Independent Speaker Verification for Embedded Systems,” CoRR, vol. abs/2011.01709, 2020, [Online]. Available: https://arxiv.org/abs/2011.01709

[7]

M. Ravanelli and Y. Bengio, “Speaker Recognition from Raw Waveform with SincNet.” 2019.

[8]

D. Salvati, C. Drioli, and G. L. Foresti, “A late fusion deep neural network for robust speaker identification using raw waveforms and gammatone cepstral coefficients,” Expert Syst Appl, vol. 222, p. 119750, Jul. 2023.

Digital Library

[9]

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Apr. 2015, pp. 5206–5210.

[10]

D. B. Paul and J. M. Baker, “The design for the wall street journal-based CSR corpus,” in Proceedings of the workshop on Speech and Natural Language - HLT ’91, Morristown, NJ, USA: Association for Computational Linguistics, 1992, p. 357.

Digital Library

[11]

A. Wong, M. Famouri, M. Pavlova, and S. Surana, “TinySpeech: Attention Condensers for Deep Speech Recognition Neural Networks on Edge Devices.” 2020.

[12]

P. Warden, “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition,” Apr. 2018, Accessed: Jun. 21, 2023. [Online]. Available: https://arxiv.org/abs/1804.03209v1

[13]

A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: a large-scale speaker identification dataset,” CoRR, vol. abs/1706.08612, 2017, [Online]. Available: http://arxiv.org/abs/1706.08612

[14]

C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation, O-COCOSDA/CASLRE 2013, 2013.

[15]

R. David, “TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems.” 2021.

[16]

D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization.” 2017.

[17]

A. Boulmaiz, N. Doghmane, S. Harize, N. Kouadria, and D. Messadeg, “The use of WSN (wireless sensor network) in the surveillance of endangered bird species,” in Advances in Ubiquitous Computing, Elsevier, 2020, pp. 261–306.

Index Terms

Speaker Recognition on Low Power Device Using Fully Convolutional QuartzNet
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Text-independent speaker recognition using LSTM-RNN and speech enhancement
Abstract
Speaker recognition revolution has lead to the inclusion of speaker recognition modules in several commercial products. Most published algorithms for speaker recognition focus on text-dependent speaker recognition. In contrast, text-independent ...
In-Set/Out-of-Set Speaker Recognition Under Sparse Enrollment

In this paper, the problem of identifying in-set versus out-of-set speakers using extremely limited enrollment data is addressed. The recognition objective is to form a binary decision regarding an input speaker as being a legitimate member of a set of ...
Affect-insensitive speaker recognition systems via emotional speech clustering using prosodic features

Voice-based biometric security systems involving only neutral speech have achieved promising performance. However, the speakers are very likely to fail the recognition when the test data exhibit multiple emotions. This paper aimed to address the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SIET '23: Proceedings of the 8th International Conference on Sustainable Information Engineering and Technology

October 2023

722 pages

ISBN:9798400708503

DOI:10.1145/3626641

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 December 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SIET 2023

SIET 2023: International Conference on Sustainable Information Engineering and Technology

October 24 - 25, 2023

Badung, Bali, Indonesia

Acceptance Rates

Overall Acceptance Rate 45 of 57 submissions, 79%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
23
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten