Abstract
Performance degradation with intraspeaker variability is a hot topic in speaker recognition. Accuracy dropping over time has become a common and accepted phenomenon in the field of speaker recognition. In China, many people travel between their birthplace and workplaces. Different cultural atmospheres and customs have an effect on the pronunciation of speech. The ongoing work focuses on time-varying and region-changed factors that are caused by population migration. This paper introduces a time-varying and region-changed speech database (TRSD) collected from 55 university students over 3 years. In total, it contains 3795 utterances. To study the impact of the time-varying and region-changed factors on speaker identification and explore hidden factors that may lead to performance degradation, there are also many experimental studies for the database. In the experiments, the changes in characteristic parameters (pitch, intensity, formant and spectrogram) are analyzed and grouped by gender and birthplace. The Gaussian mixture model-universal background model, deep neural network model, i-vector/PLDA and x-vector/PLDA are evaluated on TRSD to provide a reference performance. For the time-varying and region-changed factors, this paper also provided three kinds of corresponding solutions: speaker model adaption, cepstral mean normalization and mel-frequency cepstrum coefficient normalization.












Similar content being viewed by others
Data availability
Data will be made available on reasonable request.
References
A. Abo Absa, M. Deriche, A two-stage hierarchical multilingual emotion recognition system using hidden Markov models and neural networks, in 2017 9th IEEE-GCC Conference and Exhibition (GCCCE), Manama, Bahrain (2017), pp. 1–6. https://doi.org/10.1109/IEEEGCC.2017.8448155
A. Abumallouh, Z. Qawaqneh, B. Barkana, New transformed features generated by deep bottleneck extractor and a GMM-UBM classifier for speaker age and gender classification. Neural Comput. Appl. 30(8), 2581–2593 (2018). https://doi.org/10.1007/s00521-017-2848-4
M. Ajili, J.F. Bonastre, S. Rossato, J. Kahn, G. Bernard, Fabiole, a speech database for forensic speaker comparison, in 10th Edition of Its Language Resources and Evaluation Conference (LREC 2016), Paris, France (2016)
S. Alcorn, K. Meemann, C. Clopper, R. Smiljanic, Acoustic cues and linguistic experience as factors in regional dialect classification. J. Acoust. Soc. Am. 141(5), 3979–3979 (2020). https://doi.org/10.1121/1.4989083
K. Amino, T. Osanai, Native vs. non-native accent identification using Japanese spoken telephone numbers. Speech Commun. 56, 70–81 (2014)
H. Aronowitz, Inter dataset variability compensation for speaker recognition, in ICASSP 2014—2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy (2014), pp. 4002–4006. https://doi.org/10.1109/ICASSP.2014.6854353
B. Barkana, J. Zhou, A new pitch-range based feature set for a speaker’s age and gender classification. Appl. Acoust. 98, 52–61 (2015). https://doi.org/10.1016/j.apacoust.2015.04.013
Y. Bengio, Learning deep architectures for ai. Foundations 2(1), 1–127 (2009). https://doi.org/10.1561/2200000006
P. Boersma, D. Weenink, Praat, a system for doing phonetics by computer. Glot Int. 5, 341–345 (2001)
L. Brandschain, D. Graff, C. Cieri, K. Walker, C. Caruso, A. Neely, Greybeard longitudinal speech study, in International Conference on Language Resources and Evaluation, Valletta, Malta (2010)
H. Chao, B.Y. Lu, Y.L. Liu, H.L. Zhi, Vocal effort detection based on spectral information entropy feature and model fusion. J. Inf. Process. Syst. 14(1), 218–227 (2018). https://doi.org/10.3745/JIPS.04.0063
W. Chen, Y. Yang, First study on time-varying speaker recognition, in Phonetic Conference of China (2010), pp. 1–6
X. Chen, Y. Peng, H. Song, Research on time-varying robustness in speaker recognition based on PLDA. Microcomput. Its Appl. (2016)
R. Cole, M. Noel, V. Noel, The cslu speaker recognition corpus, in International Conference on Semiconductor Laser and Photonics (1999), pp. 3167–3170
R.K. Das, S. Jelil, S.R.M. Prasanna, Multi-style speaker recognition database in practical conditions. Int. J. Speech Technol. 21(3), 409–419 (2017). https://doi.org/10.1007/s10772-017-9475-4
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307
G. Dobry, R. Hecht, M. Avigal, Y. Zigel, Supervector dimension reduction for efficient speaker age estimation based on the acoustic speech signal. IEEE Trans. Audio Speech Lang. Process. 19(7), 1975–1985 (2011). https://doi.org/10.1109/TASL.2011.2104955
G. Droua-Hamdani, S.A. Selouani, M. Boudraa, Algerian Arabic speech database (ALGASD): corpus design and automatic speech recognition application. Arab. J. Sci. Eng. 35(2C), 157–166 (2010)
O. Ghahabi, J. Hernando, Deep learning for single and multi-session i-vector speaker recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(4), 807–817 (2017). https://doi.org/10.1109/TASLP.2017.2661705
A. Hanani, M. Russell, M. Carey, Human and computer recognition of regional accents and ethnic groups from British English speech. Comput. Speech Lang. 27(1), 59–74 (2013). https://doi.org/10.1016/j.csl.2012.01.003
J. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015). https://doi.org/10.1109/MSP.2015.2462851
J. Harnsberger, R. Shrivastav, W. Brown, H. Rothman, H. Hollien, Speaking rate and fundamental frequency as speech cues to perceived age. J. Voice Off. J. Voice Found. 22(1), 58–69 (2008). https://doi.org/10.1016/j.jvoice.2006.07.004
J. Harnsberger, R. Shrivastav, W. Jr, Modeling perceived vocal age in American English, in INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association (2010), pp. 466–469
P. Harár, R. Burget, M.K. Dutta, Speech emotion recognition with deep learning, in International Conference on Signal Processing and Integrated Networks, Noida, India (2017), pp. 137–140. https://doi.org/10.1109/SPIN.2017.8049931
B.C. Haris, G. Pradhan, A. Misra, S. Prasanna, R. Das, R. Sinha, Multivariability speaker recognition database in Indian scenario. Int. J. Speech Technol. 15(4), 441–453 (2012). https://doi.org/10.1007/s10772-012-9140-x
M. Hrúz, Z. Zajíc, Convolutional neural network for speaker change detection in telephone speaker diarization system, in IEEE International Conference on Acoustics, New Orleans, LA, USA (2017), pp. 4945–4949. https://doi.org/10.1109/ICASSP.2017.7953097
K. Jones, S. Strassel, K. Walker, D. Graff, J. Wright, Call my net corpus: a multilingual corpus for evaluation of speaker recognition technology, in Interspeech 2017 (2017), pp. 2621–2624
F. Kelly, A. Drygajlo, N. Harte, Speaker verification with long-term ageing data, in Proceedings—2012 5th IAPR International Conference on Biometrics, ICB 2012, New Delhi, India (2012), pp. 478–483. https://doi.org/10.1109/ICB.2012.6199796
F. Kelly, A. Drygajlo, N. Harte, Speaker verification in score-ageing-quality classification space. Comput. Speech Lang. 27(5), 1068–1084 (2013). https://doi.org/10.1016/j.csl.2012.12.005
R.A. Khalil, E. Jones, M. Babar, T. Jan, M. Zafar, T. Alhussain, Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019). https://doi.org/10.1109/ACCESS.2019.2936124
A. Kolokolov, I. Lyubinskii, Measuring the pitch of a speech signal using the autocorrelation function. Autom. Remote. Control 80(2), 317–323 (2019). https://doi.org/10.1134/S0005117919020097
A. Krobba, M. Debyeche, S.A. Selouani, Maximum entropy PLDA for robust speaker recognition under speech coding distortion. Int. J. Speech Technol. 22(5), 1115–1122 (2019). https://doi.org/10.1007/s10772-019-09642-5
N. Kurpukdee, S. Kasuriya, V. Chunwijitra, C. Wutiwiwatchai, P. Lamsrichan, A study of support vector machines for emotional speech recognition, in International Conference of Information and Communication Technology for Embedded Systems, Chonburi, Thailand (2017), pp. 1–6. https://doi.org/10.1109/ICTEmSys.2017.7958773
A. Lawson, A. Stauffer, E. Cupples, S. Wenndt, W. Bray, J. Grieco, The multi-session audio research project (MARP) corpus: goals, design and initial findings, in InterSpeech, Brighton, United Kingdom (2009), pp. 1811–1814
A. Lazaridis, E. Khoury, J.P. Goldman, M. Avanzi, S. Marcel, P. Garner, Swiss French regional accent identification, in Odyssey: The Speaker and Language Recognition Workshop, Joensuu, Finland (2014)
Y. Lecun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015). https://doi.org/10.1038/nature14539
K.A. Lee, A. Larcher, G. Wang, P. Kenny, N. Brummer, D. Van Leeuwen, H. Aronowitz, M. Kockmann, C. Vaquero, B. Ma, H. Li, T. Stafylakis, M.J. Alam, A. Swart, J. Perez, The reddots data collection for speaker recognition, in Interspeech, Dresden, Germany (2015), pp. 2996–3000
D. Li, J. Wang, Y. Yang, PVD: a new pathological voice dataset for intra-speaker recognition research interest, in 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), Tianjin, China (2016), pp. 1–5. https://doi.org/10.1109/ISCSLP.2016.7918488
Y. Lukic, C. Vogt, O. Durr, T. Stadelmann, Speaker identification and clustering using convolutional neural networks, in IEEE International Workshop on Machine Learning for Signal Processing, Vietri sul Mare, Italy (2016), pp. 1–6. https://doi.org/10.1109/MLSP.2016.7738816
S. Mao, D. Tao, G. Zhang, P. Ching, T. Lee, Revisiting hidden Markov models for speech emotion recognition, in ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK (2019), pp. 6715–6719. https://doi.org/10.1109/ICASSP.2019.8683172
M. McLaren, L. Ferrer, D. Castán Lavilla, A. Lawson, The speakers in the wild (sitw) speaker recognition database, in Interspeech 2016, San Francisco, CA, USA (2016), pp. 818–822. https://doi.org/10.21437/Interspeech.2016-1129
Y.J. Miao, X.F. Liu, X.M. Zhang, Compensation of speech enhancement distortion with combination of CMN and PMC. Microelectron. Comput. 28(6), 147–160 (2011)
O. Novotny, O. Plchot, O. Glembek, J.H. Cernocky, L. Burget, Analysis of DNN speech signal enhancement for robust speaker recognition. Comput. Speech Lang. 58, 403–421 (2019). https://doi.org/10.1016/j.csl.2019.06.004
T. Ozseven, A novel feature selection method for speech emotion recognition. Appl. Acoust. 146, 320–326 (2019). https://doi.org/10.1016/j.apacoust.2018.11.028
S. Paulose, D. Mathew, A. Thomas, Performance evaluation of different modeling methods and classifiers with MFCC and IHC features for speaker recognition. Procedia Comput. Sci. 115, 55–62 (2017). https://doi.org/10.1016/j.procs.2017.09.076 (7th International Conference on Advances in Computing & Communications, ICACC-2017, 22–24 August 2017, Cochin, India)
Z. Peng, X. Li, Z. Zhu, M. Unoki, J. Dang, M. Akagi, Speech emotion recognition using 3d convolutions and attention-based sliding recurrent networks with auditory front-ends. IEEE Access 8, 16560–16572 (2020). https://doi.org/10.1109/ACCESS.2020.2967791
X. Qin, H. Bu, M. Li, Hi-mia: a far-field text-dependent speaker verification database and the baselines, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain (2020), pp. 7609–7613. https://doi.org/10.1109/ICASSP40776.2020.9054423
S. Roweis, L. Saul, Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2001). https://doi.org/10.1126/science.290.5500.2323
R. Saeidi, P. Alku, T. Backstrom, Feature extraction using power-law adjusted linear prediction with application to speaker recognition under severe vocal effort mismatch. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 42–53 (2016). https://doi.org/10.1109/TASLP.2015.2493366
S. Schötz, Perception Analysis and Synthesis of Speaker Age (Department of Linguistics and Phonetics, Centre for Languages and Literature, 2006)
D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, in Interspeech (2017), pp. 999–1003
I. Tashev, Z.Q. Wang, K. Godin, Speech emotion recognition based on gaussian mixture models and deep neural networks, in Information Theory and Applications Workshop, San Diego, CA, USA (2017), pp. 1–4. https://doi.org/10.1109/ITA.2017.8023477
K. Wang, N. An, B.N. Li, Y. Zhang, L. Li, Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 6(1), 69–75 (2015). https://doi.org/10.1109/TAFFC.2015.2392101
L. Wang, N. Kitaoka, S. Nakagawa, Analysis of effect of compensation parameter estimation for cmn on speech/speaker recognition, in International Symposium on Signal Processing and Its Applications, Sharjah, United Arab Emirates (2007), pp. 1–4. https://doi.org/10.1109/ISSPA.2007.4555505
L. Wang, X. Wu, F. Zheng, C. Zhang, An investigation into better frequency warping for time-varying speaker recognition, in Asia-Pacific Signal and Information Processing Association Summit and Conference, Hollywood, CA, USA (2012), pp. 1–4
L. Wang, F. Zheng, Creation of time-varying voiceprint database. Oriental-COCOSDA (2010)
L. Wang, T. Zheng, C. Zhang, G. Wang, Discrimination-emphasized mel-frequency-warping for time-varying speaker recognition, in APSIPA ASC 2011—Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2011 (2011), pp. 731–734
Z. Wu, P. Swietojanski, C. Veaux, S. Renals, A study of speaker adaptation for dnn-based speech synthesis, in INTERSPEECH, Dresden, Germany (2015), pp. 879–883
W.S. Yang, X. Wang, S. Zhou, H.x. Zhao, J. Huang, An improved method for voiceprint recognition, in Complex, Intelligent, and Software Intensive Systems—Proceedings of the 12th International Conference on Complex, Intelligent, and Software Intensive Systems, CISIS-2018, Matsue, Japan, 4–6 July 2018, Advances in Intelligent Systems and Computing, eds. by L. Barolli, N. Javaid, M. Ikeda, M. Takizawa, vol. 772 (Springer, Berlin, 2018), pp. 735–746. https://doi.org/10.1007/978-3-319-93659-8_67
S. Zhang, A. Chen, W. Guo, Y. Cui, X. Zhao, L. Liu, Learning deep binaural representations with deep convolutional neural networks for spontaneous speech emotion recognition. IEEE Access 8, 23496–23505 (2020). https://doi.org/10.1109/ACCESS.2020.2969032
W. Zhang, D. Zhao, Z. Chai, L. Yang, X. Liu, F. Gong, S. Yang, Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services. Softw. Pract. Exp. 47(8), 1127–1138 (2017). https://doi.org/10.1002/spe.2487
Y. Zhang, J. Du, Z. Wang, J. Zhang, t. Yanhui, Attention based fully convolutional network for speech emotion recognition, in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA (2019), pp. 1771–1775. https://doi.org/10.23919/APSIPA.2018.8659587
F. Zheng, Q. Jin, L. Li, J. Wang, F. Bie, An overview of robustness related issues in speaker recognition, in Asia-Pacific Signal and Information Processing Association, Summit and Conference, Chiang Mai, Thailand (2014), pp. 1–10. https://doi.org/10.1109/APSIPA.2014.7041826
Acknowledgements
This work is supported by Natural Science Foundation of China under Grant No. 61806078, No.62076094, No. 61976091, Shanghai Science and Technology Program “Distributed and generative few-shot algorithm and theory research” under Grant No.20511100600.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The manuscript has been approved by all authors for publication, and no conflict of interest exits in the submission of it.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, D., Liu, J., Wang, Z. et al. TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition. Circuits Syst Signal Process 41, 3931–3956 (2022). https://doi.org/10.1007/s00034-022-01964-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-01964-1