Skip to main content
Log in

A Robust Speaking Rate Estimator Using a CNN-BLSTM Network

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Direct acoustic feature-based speaking rate estimation is useful in applications including pronunciation assessment, dysarthria detection and automatic speech recognition. Most of the existing works on speaking rate estimation have steps which are heuristically designed. In contrast to the existing works, in this work a data-driven approach with convolutional neural network-bidirectional long short-term memory (CNN-BLSTM) network is proposed to jointly optimize all steps in speaking rate estimation through a single framework. Also, unlike existing deep learning-based methods for speaking rate estimation, the proposed approach estimates the speaking rate for an entire speech utterance in one go instead of considering segments of a fixed duration. We consider the traditional 19 sub-band energy (SBE) contours as the low-level features as the input of the proposed CNN-BLSTM network. The state-of-the-art direct acoustic feature-based speaking rate estimation techniques are developed based on 19 SBEs as well. Experiments are performed separately using three native English speech corpora (Switchboard, TIMIT and CTIMIT) and a non-native English speech corpus (ISLE). Among these, TIMIT and Switchboard are used for training the network. However, testing is carried out on all the four corpora as well as TIMIT and Switchboard with additive noise, namely white, car, high-frequency-channel, cockpit, and babble at 20, 10 and 0 dB signal-to-noise ratios. The proposed CNN-BLSTM approach outperforms the best of the existing techniques in clean as well as noisy conditions for all four corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://github.com/diviya97/CNN-BLSTM-Speaking-Rate-Estimator.

References

  1. S. Albawi, T.A. Mohammed, S. Al-Zawi, Understanding of a convolutional neural network, in International Conference on Engineering and Technology (ICET) (IEEE, 2017), pp. 1–6

  2. J.D. Amerman, M.M. Parnell, Speech timing strategies in elderly adults. J. Phon. 20(1), 65–76 (1992)

    Article  Google Scholar 

  3. W. Apple, L.A. Streeter, R.M. Krauss, Effects of pitch and speech rate on personal attributions. J. Pers. Soc. Psychol. 37(5), 715 (1979)

    Article  Google Scholar 

  4. C.D. Bartels, J.A. Bilmes, Use of syllable nuclei locations to improve ASR, in IEEE Workshop on Automatic Speech Recognition and Understanding (2007), pp. 335–340

  5. S. Bartlett, G. Kondrak, C. Cherry, On the syllabification of phonemes, in Annual Conference of the North American Chapter of the Association for Computational Linguistics (Association for Computational Linguistics, 2009), pp. 308–316

  6. Y. Bengio, P. Simard, P. Frasconi et al., Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)

    Article  Google Scholar 

  7. M.P. Black, D. Bone, Z.I. Skordilis, R. Gupta, W. Xia, P. Papadopoulos, S.N. Chakravarthula, B. Xiao, M.V. Segbroeck, J. Kim, et al, Automated evaluation of non-native English pronunciation quality: Combining knowledge-and data-driven features at multiple time scales. in Sixteenth Annual Conference of the International Speech Communication Association (2015), pp. 493–497

  8. M.P. Black, J. Tepperman, S.S. Narayanan, Automatic prediction of childrens reading ability for high-level literacy assessment. IEEE Trans. Audio Speech Lang. Process. 19(4), 1015–1028 (2011)

    Article  Google Scholar 

  9. K.L. Brown, E.B. George, CTIMIT: a speech corpus for the cellular environment with applications to automatic speech recognition, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1995), pp. 105–108

  10. M.P. Caligiuri, The influence of speaking rate on articulatory hypokinesia in Parkinsonian dysarthria. Brain Lang. 36(3), 493–502 (1989)

    Article  Google Scholar 

  11. S.Y. Chang, N. Morgan, Robust CNN-based speech recognition with Gabor filter kernels, in 15th Annual Conference of the International Speech Communication Association (2014), , pp. 905–909

  12. S.M. Chu, D. Povey, Speaking rate adaptation using continuous frame rate normalization, in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4306–4309

  13. C. Cucchiarini, H. Strik, L. Boves, Quantitative assessment of second language learners fluency by means of automatic speech recognition technology. J. Acoust. Soc. Am. 107(2), 989–999 (2000)

    Article  Google Scholar 

  14. N.H. De. Jong, R. Groenhout, R. Schoonen, J.H. Hulstijn, Second language fluency: speaking style or proficiency? Correcting measures of second language fluency for first language behavior. Appl. Psycholinguist. 36(2), 223–243 (2015)

    Article  Google Scholar 

  15. T. Dekens, M. Demol, W. Verhelst, P. Verhoeve, in A comparative study of speech rate estimation techniques I(nterspeech, 2007), pp. 510–513

  16. T.M. Derwing, M.J. Munro, R.I. Thomson, M.J. Rossiter, The relationship between L1 fluency and L2 fluency development. Stud. Second. Lang. Acquis. 31(4), 533–557 (2009)

    Article  Google Scholar 

  17. B. Fisher, tsylb2-1.1: syllabification software. National Institute of Standards and Technology, https://www.nist.gov/itl/iad/mig/tools. Last accessed on 30–05–17 (1996)

  18. K.J. Geras, A.R. Mohamed, R. Caruana, G. Urban, S. Wang, O. Aslan, M. Philipose, M. Richardson, C. Sutton, Blending LSTMs into CNNs, in ICLR Workshop (2016)

  19. J.J. Godfrey, E.C. Holliman, J. McDaniel, SWITCHBOARD: telephone speech corpus for research and development, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1992), pp. 517–520

  20. A. Graves, S. Fernández, J. Schmidhuber, Bidirectional LSTM networks for improved phoneme classification and recognition, in International Conference on Artificial Neural Networks (Springer, 2005), pp. 799–804

  21. A. Graves, A.R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), pp. 6645–6649

  22. A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)

    Article  Google Scholar 

  23. P. Green, J. Carmichael, A. Hatzis, P. Enderby, M. Hawley, M. Parker, Automatic speech recognition with sparse training data for dysarthric speakers, in Eight European Conference on Speech Communication and Technology (2003), pp. 3321–3324

  24. K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016)

    Article  MathSciNet  Google Scholar 

  25. C. Heinrich, F. Schiel, Estimating speaking rate by means of rhythmicity parameters. Interspeech (2011), pp. 1873–1876

  26. S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  27. J. Holmes, The JSRU channel vocoder, in IEEE (Communications, Radar and Signal Processing), vol. 127 (IET, 1980), pp. 53–60

  28. Z. Hu, Y. Li, Z. Yang, Improving convolutional neural network using pseudo derivative ReLU, in Fifth International Conference on Systems and Informatics (ICSAI). (IEEE, 2018), pp. 283–287

  29. Z. Huang, M. Dong, Q. Mao, Y. Zhan, Speech emotion recognition using CNN, in 22nd ACM International Conference on Multimedia (ACM, 2014), pp. 801–804

  30. M. Huckvale, Speech filing system: tools for speech research. http://www.phon.ucl.ac.uk/resource/sfs (2000)

  31. S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in International Conference on Machine Learning (2015), pp. 448–456

  32. S.K. Jemni, Y. Kessentini, S. Kanoun, J.M.Ogier, Offline Arabic handwriting recognition using BLSTMs combination, in 13th IAPR International Workshop on Document Analysis Systems (DAS) (IEEE, 2018), pp. 31–36

  33. Y. Jiao, M. Tu, V. Berisha, J. Liss, Online speaking rate estimation using recurrent neural networks, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5245–5249

  34. N. Jmour, S. Zayen, A. Abdelkrim, Convolutional neural networks for image classification, in International Conference on Advanced Systems and Electric Technologies (ICASET) (IEEE, 2018), pp. 397–402

  35. A. Jongman, R. Wayland, S. Wong, Acoustic characteristics of English fricatives. J. Acoust. Soc. Am. 108(3), 1252–1263 (2000)

    Article  Google Scholar 

  36. R.D. Kent, J.C. Rosenbek, Acoustic patterns of apraxia of speech. J. Speech Lang. Hear. Res. 26(2), 231–249 (1983)

    Article  Google Scholar 

  37. S.H. Kim, G.T. Han, 1D CNN based human respiration pattern recognition using ultra wideband radar, in International Conference on Artificial Intelligence in Information and Communication (ICAIIC) (IEEE, 2019), pp. 411–414

  38. S. Kiranyaz, T. Ince, O. Abdeljaber, O. Avci, M. Gabbouj, 1D Convolutional neural networks for signal processing applications, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), pp. 8360–8364

  39. S. Kitaazawa, H. Ichikawa, S. Kobayashi, Y. Nishinuma, Extraction and representation rhythmic components of spontaneous speech, in Fifth European Conference on Speech Communication and Technology (1997), pp. 641–644

  40. B. Ko, H.G. Kim, H.J. Choi, Controlled dropout: a different dropout for improving training speed on deep neural network, in IEEE International Conference on Systems, Man, and Cybernetics (SMC) (2017), pp. 972–977

  41. C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, Y. Bengio, Batch normalized recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2657–2661 (2016)

  42. S. Lawrence, C.L. Giles, A.C. Tsoi, A.D. Back, Face recognition: a convolutional neural–network approach. IEEE Trans. Neural Netw. 8(1), 98–113 (1997)

    Article  Google Scholar 

  43. D. Li, J. Zhang, Q. Zhang, X. Wei, Classification of ECG signals based on 1D convolution neural network, in IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom) (2017), pp. 1–6

  44. J. Li, Y. Shen, Image describing based on bidirectional LSTM and improved sequence sampling, in IEEE 2nd International Conference on Big Data Analysis (ICBDA) (2017), pp. 735–739

  45. Q. Mao, M. Dong, Z. Huang, Y. Zhan, Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 16(8), 2203–2213 (2014)

    Article  Google Scholar 

  46. H. Martens, G. Van Nuffelen, M. De Bodt, T. Dekens, L. Latacz, W. Verhelst, Automated assessment and treatment of speech rate and intonation in dysarthria, in Seventh International Conference on Pervasive Computing Technologies for Healthcare (ICST, Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering) (2013), pp. 382–384

  47. W. Menzel, E. Atwell, P. Bonaventura, D. Herron, P. Howarth, R. Morton, C. Souter, The ISLE corpus of non-native spoken English, in 2000 Language Resources and Evaluation Conference (European Language Resources Association, 2000), pp. 957–964

  48. N. Miller, G. Maruyama, R.J. Beaber, K. Valone, Speed of speech and persuasion. J. Pers. Soc. Psychol. 34(4), 615 (1976)

    Article  Google Scholar 

  49. N. Morgan, E. Fosler, N. Mirghafori, Speech recognition using on-line estimation of speaking rate. Fifth Eur. Conf. Speech Commu. Technol. 4, 2079–2082 (1997)

    Google Scholar 

  50. N. Morgan, E.Fosler-Lussier, Combining multiple estimators of speaking rate, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1998), pp. 729–732

  51. S. Nagesh, C. Yarra, O.D. Deshmukh, P.K. Ghosh, A robust speech rate estimation based on the activation profile from the selected acoustic unit dictionary, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5400–5404

  52. S.L. Oh, E.Y. Ng, R. San Tan, U.R. Acharya, Automated diagnosis of Arrhythmia using combination of CNN and LSTM techniques with variable length heart beats. Comput. Biol. Med. 102, 278–287 (2018)

    Article  Google Scholar 

  53. D. Palaz, M.M. Doss, R. Collobert, Convolutional neural networks-based continuous speech recognition using raw speech signal, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 4295–4299

  54. T. Pfau, G. Ruske, Estimating the speaking rate by vowel detection, in IEEE International Conference on Acoustics, Speech, and Signal Proessing (ICASSP) (1998), pp. 945–948

  55. M. Richardson, M. Hwang, A. Acero, X. Huang, Improvements on speech recognition for fast talkers, in Sixth European Conference on Speech Communication and Technology (1999), pp. 411–414

  56. S. Santurkar, D. Tsipras, A. Ilyas, A. Madry, How does batch normalization help optimization? in Advances in Neural Information Processing Systems (2018), pp. 2483–2493

  57. D. Talkin, A robust algorithm for pitch tracking (RAPT). Speech Coding Synth. 495, 518 (1995)

    Google Scholar 

  58. A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)

    Article  Google Scholar 

  59. D. Wang, S. Narayanan, Speech rate estimation via temporal correlation and selected sub-band correlation, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2005), pp. 413–416

  60. D. Wang, S.S. Narayanan, Robust speech rate estimation for spontaneous speech. IEEE Trans. Audio Speech Lang. Process. 15(8), 2190–2201 (2007)

    Article  Google Scholar 

  61. C. Yarra, O.D. Deshmukh, P.K. Ghosh, A mode-shape classification technique for robust speech rate estimation and syllable nuclei detection. Speech Commun. 78, 62–71 (2016)

    Article  Google Scholar 

  62. S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, Z. Deng, S. Lee, S. Narayanan, C. Busso, An acoustic study of emotions expressed in speech, in Eighth International Conference on Spoken Language Processing (2004), pp. 2193–2196

  63. J. Yuan, W. Lai, C. Cieri, M. Liberman, Using Forced Alignment for Phonetics Research (Text, Speech and Language Technology. Springer, Chinese Language Resources and Processing, 2018)

  64. J. Yuan, M. Liberman, Robust speaking rate estimation using broad phonetic class recognition, in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4222–4225

  65. Y. Zhang, J.R. Glass, Speech rhythm guided syllable nuclei detection, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2009), pp. 3797–3800

  66. J. Zhao, X. Mao, L. Chen, Learning deep features to recognise speech emotion using merged deep CNN. IET Signal Proc. 12(6), 713–721 (2018)

    Article  Google Scholar 

  67. J. Zheng, H. Franco, A. Stolcke, Rate-of-speech modeling for large vocabulary conversational speech recognition, in ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW) (2000), pp. 145–149

  68. M. Zihlmann, D. Perekrestenko, M. Tschannen, Convolutional recurrent neural networks for electrocardiogram classification, in 2017 Computing in Cardiology (CinC) (IEEE, 2017), pp. 1–4

  69. V. Zue, S. Seneff, J. Glass, Speech database development at MIT: TIMIT and beyond. Speech Commun. 9(4), 351–356 (1990)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chiranjeevi Yarra.

Ethics declarations

Data availability

Data sharing was not applicable to this article as no datasets were generated during the current study

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Srinivasan, A., Singh, D., Yarra, C. et al. A Robust Speaking Rate Estimator Using a CNN-BLSTM Network. Circuits Syst Signal Process 40, 6098–6120 (2021). https://doi.org/10.1007/s00034-021-01754-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-021-01754-1

Keywords

Navigation