TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition

Li, Dongdong; Liu, Jinlin; Wang, Zhe; Li, Yanqiong; Chen, Baijun; Cai, Lizhi

doi:10.1007/s00034-022-01964-1

TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition

Published: 18 February 2022

Volume 41, pages 3931–3956, (2022)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Dongdong Li^1,2,
Jinlin Liu¹,
Zhe Wang ORCID: orcid.org/0000-0002-3759-2041¹,
Yanqiong Li¹,
Baijun Chen¹ &
…
Lizhi Cai^1,3

389 Accesses
1 Altmetric
Explore all metrics

Abstract

Performance degradation with intraspeaker variability is a hot topic in speaker recognition. Accuracy dropping over time has become a common and accepted phenomenon in the field of speaker recognition. In China, many people travel between their birthplace and workplaces. Different cultural atmospheres and customs have an effect on the pronunciation of speech. The ongoing work focuses on time-varying and region-changed factors that are caused by population migration. This paper introduces a time-varying and region-changed speech database (TRSD) collected from 55 university students over 3 years. In total, it contains 3795 utterances. To study the impact of the time-varying and region-changed factors on speaker identification and explore hidden factors that may lead to performance degradation, there are also many experimental studies for the database. In the experiments, the changes in characteristic parameters (pitch, intensity, formant and spectrogram) are analyzed and grouped by gender and birthplace. The Gaussian mixture model-universal background model, deep neural network model, i-vector/PLDA and x-vector/PLDA are evaluated on TRSD to provide a reference performance. For the time-varying and region-changed factors, this paper also provided three kinds of corresponding solutions: speaker model adaption, cepstral mean normalization and mel-frequency cepstrum coefficient normalization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VQ/GMM-Based Speaker Identification with Emphasis on Language Dependency

Closed-Set Text-Independent Automatic Speaker Recognition System Using VQ/GMM

Speaker Identification in Spoken Language Mismatch Condition: An Experimental Study

Data availability

Data will be made available on reasonable request.

References

A. Abo Absa, M. Deriche, A two-stage hierarchical multilingual emotion recognition system using hidden Markov models and neural networks, in 2017 9th IEEE-GCC Conference and Exhibition (GCCCE), Manama, Bahrain (2017), pp. 1–6. https://doi.org/10.1109/IEEEGCC.2017.8448155
A. Abumallouh, Z. Qawaqneh, B. Barkana, New transformed features generated by deep bottleneck extractor and a GMM-UBM classifier for speaker age and gender classification. Neural Comput. Appl. 30(8), 2581–2593 (2018). https://doi.org/10.1007/s00521-017-2848-4
Article Google Scholar
M. Ajili, J.F. Bonastre, S. Rossato, J. Kahn, G. Bernard, Fabiole, a speech database for forensic speaker comparison, in 10th Edition of Its Language Resources and Evaluation Conference (LREC 2016), Paris, France (2016)
S. Alcorn, K. Meemann, C. Clopper, R. Smiljanic, Acoustic cues and linguistic experience as factors in regional dialect classification. J. Acoust. Soc. Am. 141(5), 3979–3979 (2020). https://doi.org/10.1121/1.4989083
Article Google Scholar
K. Amino, T. Osanai, Native vs. non-native accent identification using Japanese spoken telephone numbers. Speech Commun. 56, 70–81 (2014)
Article Google Scholar
H. Aronowitz, Inter dataset variability compensation for speaker recognition, in ICASSP 2014—2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy (2014), pp. 4002–4006. https://doi.org/10.1109/ICASSP.2014.6854353
B. Barkana, J. Zhou, A new pitch-range based feature set for a speaker’s age and gender classification. Appl. Acoust. 98, 52–61 (2015). https://doi.org/10.1016/j.apacoust.2015.04.013
Article Google Scholar
Y. Bengio, Learning deep architectures for ai. Foundations 2(1), 1–127 (2009). https://doi.org/10.1561/2200000006
Article MathSciNet MATH Google Scholar
P. Boersma, D. Weenink, Praat, a system for doing phonetics by computer. Glot Int. 5, 341–345 (2001)
Google Scholar
L. Brandschain, D. Graff, C. Cieri, K. Walker, C. Caruso, A. Neely, Greybeard longitudinal speech study, in International Conference on Language Resources and Evaluation, Valletta, Malta (2010)
H. Chao, B.Y. Lu, Y.L. Liu, H.L. Zhi, Vocal effort detection based on spectral information entropy feature and model fusion. J. Inf. Process. Syst. 14(1), 218–227 (2018). https://doi.org/10.3745/JIPS.04.0063
Article Google Scholar
W. Chen, Y. Yang, First study on time-varying speaker recognition, in Phonetic Conference of China (2010), pp. 1–6
X. Chen, Y. Peng, H. Song, Research on time-varying robustness in speaker recognition based on PLDA. Microcomput. Its Appl. (2016)
R. Cole, M. Noel, V. Noel, The cslu speaker recognition corpus, in International Conference on Semiconductor Laser and Photonics (1999), pp. 3167–3170
R.K. Das, S. Jelil, S.R.M. Prasanna, Multi-style speaker recognition database in practical conditions. Int. J. Speech Technol. 21(3), 409–419 (2017). https://doi.org/10.1007/s10772-017-9475-4
Article Google Scholar
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307
Article Google Scholar
G. Dobry, R. Hecht, M. Avigal, Y. Zigel, Supervector dimension reduction for efficient speaker age estimation based on the acoustic speech signal. IEEE Trans. Audio Speech Lang. Process. 19(7), 1975–1985 (2011). https://doi.org/10.1109/TASL.2011.2104955
Article Google Scholar
G. Droua-Hamdani, S.A. Selouani, M. Boudraa, Algerian Arabic speech database (ALGASD): corpus design and automatic speech recognition application. Arab. J. Sci. Eng. 35(2C), 157–166 (2010)
Google Scholar
O. Ghahabi, J. Hernando, Deep learning for single and multi-session i-vector speaker recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(4), 807–817 (2017). https://doi.org/10.1109/TASLP.2017.2661705
Article Google Scholar
A. Hanani, M. Russell, M. Carey, Human and computer recognition of regional accents and ethnic groups from British English speech. Comput. Speech Lang. 27(1), 59–74 (2013). https://doi.org/10.1016/j.csl.2012.01.003
Article Google Scholar
J. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015). https://doi.org/10.1109/MSP.2015.2462851
Article Google Scholar
J. Harnsberger, R. Shrivastav, W. Brown, H. Rothman, H. Hollien, Speaking rate and fundamental frequency as speech cues to perceived age. J. Voice Off. J. Voice Found. 22(1), 58–69 (2008). https://doi.org/10.1016/j.jvoice.2006.07.004
Article Google Scholar
J. Harnsberger, R. Shrivastav, W. Jr, Modeling perceived vocal age in American English, in INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association (2010), pp. 466–469
P. Harár, R. Burget, M.K. Dutta, Speech emotion recognition with deep learning, in International Conference on Signal Processing and Integrated Networks, Noida, India (2017), pp. 137–140. https://doi.org/10.1109/SPIN.2017.8049931
B.C. Haris, G. Pradhan, A. Misra, S. Prasanna, R. Das, R. Sinha, Multivariability speaker recognition database in Indian scenario. Int. J. Speech Technol. 15(4), 441–453 (2012). https://doi.org/10.1007/s10772-012-9140-x
Article Google Scholar
M. Hrúz, Z. Zajíc, Convolutional neural network for speaker change detection in telephone speaker diarization system, in IEEE International Conference on Acoustics, New Orleans, LA, USA (2017), pp. 4945–4949. https://doi.org/10.1109/ICASSP.2017.7953097
K. Jones, S. Strassel, K. Walker, D. Graff, J. Wright, Call my net corpus: a multilingual corpus for evaluation of speaker recognition technology, in Interspeech 2017 (2017), pp. 2621–2624
F. Kelly, A. Drygajlo, N. Harte, Speaker verification with long-term ageing data, in Proceedings—2012 5th IAPR International Conference on Biometrics, ICB 2012, New Delhi, India (2012), pp. 478–483. https://doi.org/10.1109/ICB.2012.6199796
F. Kelly, A. Drygajlo, N. Harte, Speaker verification in score-ageing-quality classification space. Comput. Speech Lang. 27(5), 1068–1084 (2013). https://doi.org/10.1016/j.csl.2012.12.005
Article Google Scholar
R.A. Khalil, E. Jones, M. Babar, T. Jan, M. Zafar, T. Alhussain, Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019). https://doi.org/10.1109/ACCESS.2019.2936124
Article Google Scholar
A. Kolokolov, I. Lyubinskii, Measuring the pitch of a speech signal using the autocorrelation function. Autom. Remote. Control 80(2), 317–323 (2019). https://doi.org/10.1134/S0005117919020097
Article MathSciNet MATH Google Scholar
A. Krobba, M. Debyeche, S.A. Selouani, Maximum entropy PLDA for robust speaker recognition under speech coding distortion. Int. J. Speech Technol. 22(5), 1115–1122 (2019). https://doi.org/10.1007/s10772-019-09642-5
Article Google Scholar
N. Kurpukdee, S. Kasuriya, V. Chunwijitra, C. Wutiwiwatchai, P. Lamsrichan, A study of support vector machines for emotional speech recognition, in International Conference of Information and Communication Technology for Embedded Systems, Chonburi, Thailand (2017), pp. 1–6. https://doi.org/10.1109/ICTEmSys.2017.7958773
A. Lawson, A. Stauffer, E. Cupples, S. Wenndt, W. Bray, J. Grieco, The multi-session audio research project (MARP) corpus: goals, design and initial findings, in InterSpeech, Brighton, United Kingdom (2009), pp. 1811–1814
A. Lazaridis, E. Khoury, J.P. Goldman, M. Avanzi, S. Marcel, P. Garner, Swiss French regional accent identification, in Odyssey: The Speaker and Language Recognition Workshop, Joensuu, Finland (2014)
Y. Lecun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015). https://doi.org/10.1038/nature14539
Article Google Scholar
K.A. Lee, A. Larcher, G. Wang, P. Kenny, N. Brummer, D. Van Leeuwen, H. Aronowitz, M. Kockmann, C. Vaquero, B. Ma, H. Li, T. Stafylakis, M.J. Alam, A. Swart, J. Perez, The reddots data collection for speaker recognition, in Interspeech, Dresden, Germany (2015), pp. 2996–3000
D. Li, J. Wang, Y. Yang, PVD: a new pathological voice dataset for intra-speaker recognition research interest, in 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), Tianjin, China (2016), pp. 1–5. https://doi.org/10.1109/ISCSLP.2016.7918488
Y. Lukic, C. Vogt, O. Durr, T. Stadelmann, Speaker identification and clustering using convolutional neural networks, in IEEE International Workshop on Machine Learning for Signal Processing, Vietri sul Mare, Italy (2016), pp. 1–6. https://doi.org/10.1109/MLSP.2016.7738816
S. Mao, D. Tao, G. Zhang, P. Ching, T. Lee, Revisiting hidden Markov models for speech emotion recognition, in ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK (2019), pp. 6715–6719. https://doi.org/10.1109/ICASSP.2019.8683172
M. McLaren, L. Ferrer, D. Castán Lavilla, A. Lawson, The speakers in the wild (sitw) speaker recognition database, in Interspeech 2016, San Francisco, CA, USA (2016), pp. 818–822. https://doi.org/10.21437/Interspeech.2016-1129
Y.J. Miao, X.F. Liu, X.M. Zhang, Compensation of speech enhancement distortion with combination of CMN and PMC. Microelectron. Comput. 28(6), 147–160 (2011)
Google Scholar
O. Novotny, O. Plchot, O. Glembek, J.H. Cernocky, L. Burget, Analysis of DNN speech signal enhancement for robust speaker recognition. Comput. Speech Lang. 58, 403–421 (2019). https://doi.org/10.1016/j.csl.2019.06.004
Article Google Scholar
T. Ozseven, A novel feature selection method for speech emotion recognition. Appl. Acoust. 146, 320–326 (2019). https://doi.org/10.1016/j.apacoust.2018.11.028
Article Google Scholar
S. Paulose, D. Mathew, A. Thomas, Performance evaluation of different modeling methods and classifiers with MFCC and IHC features for speaker recognition. Procedia Comput. Sci. 115, 55–62 (2017). https://doi.org/10.1016/j.procs.2017.09.076 (7th International Conference on Advances in Computing & Communications, ICACC-2017, 22–24 August 2017, Cochin, India)
Article Google Scholar
Z. Peng, X. Li, Z. Zhu, M. Unoki, J. Dang, M. Akagi, Speech emotion recognition using 3d convolutions and attention-based sliding recurrent networks with auditory front-ends. IEEE Access 8, 16560–16572 (2020). https://doi.org/10.1109/ACCESS.2020.2967791
Article Google Scholar
X. Qin, H. Bu, M. Li, Hi-mia: a far-field text-dependent speaker verification database and the baselines, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain (2020), pp. 7609–7613. https://doi.org/10.1109/ICASSP40776.2020.9054423
S. Roweis, L. Saul, Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2001). https://doi.org/10.1126/science.290.5500.2323
Article Google Scholar
R. Saeidi, P. Alku, T. Backstrom, Feature extraction using power-law adjusted linear prediction with application to speaker recognition under severe vocal effort mismatch. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 42–53 (2016). https://doi.org/10.1109/TASLP.2015.2493366
Article Google Scholar
S. Schötz, Perception Analysis and Synthesis of Speaker Age (Department of Linguistics and Phonetics, Centre for Languages and Literature, 2006)
D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, in Interspeech (2017), pp. 999–1003
I. Tashev, Z.Q. Wang, K. Godin, Speech emotion recognition based on gaussian mixture models and deep neural networks, in Information Theory and Applications Workshop, San Diego, CA, USA (2017), pp. 1–4. https://doi.org/10.1109/ITA.2017.8023477
K. Wang, N. An, B.N. Li, Y. Zhang, L. Li, Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 6(1), 69–75 (2015). https://doi.org/10.1109/TAFFC.2015.2392101
Article Google Scholar
L. Wang, N. Kitaoka, S. Nakagawa, Analysis of effect of compensation parameter estimation for cmn on speech/speaker recognition, in International Symposium on Signal Processing and Its Applications, Sharjah, United Arab Emirates (2007), pp. 1–4. https://doi.org/10.1109/ISSPA.2007.4555505
L. Wang, X. Wu, F. Zheng, C. Zhang, An investigation into better frequency warping for time-varying speaker recognition, in Asia-Pacific Signal and Information Processing Association Summit and Conference, Hollywood, CA, USA (2012), pp. 1–4
L. Wang, F. Zheng, Creation of time-varying voiceprint database. Oriental-COCOSDA (2010)
L. Wang, T. Zheng, C. Zhang, G. Wang, Discrimination-emphasized mel-frequency-warping for time-varying speaker recognition, in APSIPA ASC 2011—Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2011 (2011), pp. 731–734
Z. Wu, P. Swietojanski, C. Veaux, S. Renals, A study of speaker adaptation for dnn-based speech synthesis, in INTERSPEECH, Dresden, Germany (2015), pp. 879–883
W.S. Yang, X. Wang, S. Zhou, H.x. Zhao, J. Huang, An improved method for voiceprint recognition, in Complex, Intelligent, and Software Intensive Systems—Proceedings of the 12th International Conference on Complex, Intelligent, and Software Intensive Systems, CISIS-2018, Matsue, Japan, 4–6 July 2018, Advances in Intelligent Systems and Computing, eds. by L. Barolli, N. Javaid, M. Ikeda, M. Takizawa, vol. 772 (Springer, Berlin, 2018), pp. 735–746. https://doi.org/10.1007/978-3-319-93659-8_67
S. Zhang, A. Chen, W. Guo, Y. Cui, X. Zhao, L. Liu, Learning deep binaural representations with deep convolutional neural networks for spontaneous speech emotion recognition. IEEE Access 8, 23496–23505 (2020). https://doi.org/10.1109/ACCESS.2020.2969032
Article Google Scholar
W. Zhang, D. Zhao, Z. Chai, L. Yang, X. Liu, F. Gong, S. Yang, Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services. Softw. Pract. Exp. 47(8), 1127–1138 (2017). https://doi.org/10.1002/spe.2487
Article Google Scholar
Y. Zhang, J. Du, Z. Wang, J. Zhang, t. Yanhui, Attention based fully convolutional network for speech emotion recognition, in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA (2019), pp. 1771–1775. https://doi.org/10.23919/APSIPA.2018.8659587
F. Zheng, Q. Jin, L. Li, J. Wang, F. Bie, An overview of robustness related issues in speaker recognition, in Asia-Pacific Signal and Information Processing Association, Summit and Conference, Chiang Mai, Thailand (2014), pp. 1–10. https://doi.org/10.1109/APSIPA.2014.7041826

Download references

Acknowledgements

This work is supported by Natural Science Foundation of China under Grant No. 61806078, No.62076094, No. 61976091, Shanghai Science and Technology Program “Distributed and generative few-shot algorithm and theory research” under Grant No.20511100600.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai, 200237, People’s Republic of China
Dongdong Li, Jinlin Liu, Zhe Wang, Yanqiong Li, Baijun Chen & Lizhi Cai
Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, 215006, People’s Republic of China
Dongdong Li
Shanghai Key Laboratory of Computer Software Testing and Evaluating, Shanghai Development Center of Computer Software Technology, Shanghai, 201112, People’s Republic of China
Lizhi Cai

Authors

Dongdong Li
View author publications
You can also search for this author inPubMed Google Scholar
Jinlin Liu
View author publications
You can also search for this author inPubMed Google Scholar
Zhe Wang
View author publications
You can also search for this author inPubMed Google Scholar
Yanqiong Li
View author publications
You can also search for this author inPubMed Google Scholar
Baijun Chen
View author publications
You can also search for this author inPubMed Google Scholar
Lizhi Cai
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Zhe Wang.

Ethics declarations

Conflict of interest

The manuscript has been approved by all authors for publication, and no conflict of interest exits in the submission of it.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, D., Liu, J., Wang, Z. et al. TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition. Circuits Syst Signal Process 41, 3931–3956 (2022). https://doi.org/10.1007/s00034-022-01964-1

Download citation

Received: 08 April 2021
Revised: 05 January 2022
Accepted: 13 January 2022
Published: 18 February 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s00034-022-01964-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

VQ/GMM-Based Speaker Identification with Emphasis on Language Dependency

Closed-Set Text-Independent Automatic Speaker Recognition System Using VQ/GMM

Speaker Identification in Spoken Language Mismatch Condition: An Experimental Study

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now