Abstract
This work explores the scope of duration modification for speaker verification (SV) under mismatch speech tempo condition. The SV performance is found to depend on speaking rate of a speaker. The mismatch in the speaking rate can degrade the performance of a system and is crucial from the perspective of deployable systems. In this work, an analysis of SV performance is carried out by varying the speaking rate of train and test speech. Based on the studies, a framework is proposed to compensate the mismatch in speech tempo. The framework changes the duration of test speech in terms of speaking rate according to the derived mismatch factor between train and test speech. This in turn matches speech tempo of the test speech to that of the claimed speaker model. The proposed approach is found to have significant impact on SV performance while comparing the performance under mismatch conditions. A set of practical data having mismatch in speech tempo is also used to cross-validate the framework.
Similar content being viewed by others
References
Chakrabarty, D., Prasanna, S. M., & Das, R. K. (2013). Development and evaluation of online text-independent speaker verification system for remote person authentication. International Journal of Speech Technology, 16(1), 75–88.
Crochiere, R. E. (1980). A weighted overlap-add method of short-time fourier analysis/synthesis. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(1), 99–102.
Das, R. K., Jelil, S., & Prasanna, S. M. (2016). Development of multi-level speech based person authentication system. Journal of Signal Processing Systems, 88, 1–13. https://doi.org/10.1007/s11265-016-1148-z.
Das, R. K., Prasanna, S. R. M. (2015). Speaker verification for variable duration segments and the effect of session variability. Lecture Notes in Electrical Engineering (pp. 193–200). New York: Springer.
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.
Dey, S., Barman, S., Bhukya, R. K., Das, R. K., Haris B C, Prasanna, S. R. M., & Sinha, R (2014). Speech biometric based attendance system. In National Conference on Communications (NCC) 2014, IIT Kanpur.
Duda, R . O., Hart, P . E., & Stork, D . G. (2000). Pattern classification. Hoboken: Wiley.
Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech and Signal Processing, 29(2), 254–272.
Garcia-Romero, D., & Espy-Wilson, C. Y. (2011) . Analysis of i-vector length normalization in speaker recognition systems. In Interspeech (pp. 249–252).
Kanagasundaram, A., Vogt, R., Dean, D., Sridharan, S. & Mason, M. (2011). i-vector based speaker recognition on short utterances. In Interspeech 2011.
Lee, K. A., Larcher, A., Thai, H., Ma, B. & Li, H. (2011). Joint application of speech and speaker recognition for automation and security in smart home. In Interspeech (pp. 3317–3318).
Martinez, F., Tapias, D., & Alvarez, J. (1998). Towards speech rate independence in large vocabulary continuous speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (Vol. 2, pp. 725–728).
Matsui, T., & Furui, S. (1994). Comparison of text-independent speaker recognition methods using vq-distortion and discrete/continuous HMM’s. IEEE Transactions on Speech and Audio Processing, 2(3), 456–459.
Morgan, N., & Fosler-Lussier, E. (1998). Combining multiple estimators of speaking rate. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2, 729–732.
Murty, K., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1613.
Murty, K., Yegnanarayana, B., & Joseph, M. A. (2009). Characterization of glottal activity from speech signals. IEEE Signal Processing Letters, 16(6), 469–472.
NIST. (2003). The NIST Year 2003 Speaker Recognition Evaluation Plan.
Prasanna, S. R. M., & Pradhan, G. (2011). Significance of vowel-like regions for speaker verification under degraded conditions. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2552–2565.
Putra, B. (2011). Implementation of secure speaker verification at web login page using mel frequency cepstral coefficient-gaussian mixture model (mfcc-gmm). In Instrumentation Control and Automation (ICA), 2011 2nd International Conference (pp. 358–363).
Rao, K., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 972–980.
Roucos, S., & Wilgus, A. M. (1985). High quality time-scale modification for speech. In IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP) ’85. (Vol. 10, pp. 493–496).
Sarkar, G., & Saha, G. (2010). Real time implementation of speaker identification system with frame picking algorithm. Procedia Computer Science, 2, 173 – 180.
Sharma, B., & Prasanna, S. R. M. (2014). Faster prosody modification using time scaling of epochs. Annual IEEE India Conference (INDICON) (pp. 1–5).
Siegler, M. A., Stern, R. M. (1995). On the effects of speech rate in large vocabulary speech recognition systems. In International Conference on Acoustics, Speech, and Signal Processing, 1995 (ICASSP-95) (Vol. 1, pp. 612–615).
Yasuda, H. & Kudo, M. (2012),Speech rate change detection in martingale framework. In 12th International Conference on Intelligent Systems Design and Applications (ISDA) (pp. 859–864).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Das, R.K., Sharma, B. & Prasanna, S.R.M. Significance of duration modification for speaker verification under mismatch speech tempo condition. Int J Speech Technol 21, 401–408 (2018). https://doi.org/10.1007/s10772-017-9474-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-017-9474-5