Abstract
Direct acoustic feature-based speaking rate estimation is useful in applications including pronunciation assessment, dysarthria detection and automatic speech recognition. Most of the existing works on speaking rate estimation have steps which are heuristically designed. In contrast to the existing works, in this work a data-driven approach with convolutional neural network-bidirectional long short-term memory (CNN-BLSTM) network is proposed to jointly optimize all steps in speaking rate estimation through a single framework. Also, unlike existing deep learning-based methods for speaking rate estimation, the proposed approach estimates the speaking rate for an entire speech utterance in one go instead of considering segments of a fixed duration. We consider the traditional 19 sub-band energy (SBE) contours as the low-level features as the input of the proposed CNN-BLSTM network. The state-of-the-art direct acoustic feature-based speaking rate estimation techniques are developed based on 19 SBEs as well. Experiments are performed separately using three native English speech corpora (Switchboard, TIMIT and CTIMIT) and a non-native English speech corpus (ISLE). Among these, TIMIT and Switchboard are used for training the network. However, testing is carried out on all the four corpora as well as TIMIT and Switchboard with additive noise, namely white, car, high-frequency-channel, cockpit, and babble at 20, 10 and 0 dB signal-to-noise ratios. The proposed CNN-BLSTM approach outperforms the best of the existing techniques in clean as well as noisy conditions for all four corpora.
Similar content being viewed by others
References
S. Albawi, T.A. Mohammed, S. Al-Zawi, Understanding of a convolutional neural network, in International Conference on Engineering and Technology (ICET) (IEEE, 2017), pp. 1–6
J.D. Amerman, M.M. Parnell, Speech timing strategies in elderly adults. J. Phon. 20(1), 65–76 (1992)
W. Apple, L.A. Streeter, R.M. Krauss, Effects of pitch and speech rate on personal attributions. J. Pers. Soc. Psychol. 37(5), 715 (1979)
C.D. Bartels, J.A. Bilmes, Use of syllable nuclei locations to improve ASR, in IEEE Workshop on Automatic Speech Recognition and Understanding (2007), pp. 335–340
S. Bartlett, G. Kondrak, C. Cherry, On the syllabification of phonemes, in Annual Conference of the North American Chapter of the Association for Computational Linguistics (Association for Computational Linguistics, 2009), pp. 308–316
Y. Bengio, P. Simard, P. Frasconi et al., Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
M.P. Black, D. Bone, Z.I. Skordilis, R. Gupta, W. Xia, P. Papadopoulos, S.N. Chakravarthula, B. Xiao, M.V. Segbroeck, J. Kim, et al, Automated evaluation of non-native English pronunciation quality: Combining knowledge-and data-driven features at multiple time scales. in Sixteenth Annual Conference of the International Speech Communication Association (2015), pp. 493–497
M.P. Black, J. Tepperman, S.S. Narayanan, Automatic prediction of childrens reading ability for high-level literacy assessment. IEEE Trans. Audio Speech Lang. Process. 19(4), 1015–1028 (2011)
K.L. Brown, E.B. George, CTIMIT: a speech corpus for the cellular environment with applications to automatic speech recognition, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1995), pp. 105–108
M.P. Caligiuri, The influence of speaking rate on articulatory hypokinesia in Parkinsonian dysarthria. Brain Lang. 36(3), 493–502 (1989)
S.Y. Chang, N. Morgan, Robust CNN-based speech recognition with Gabor filter kernels, in 15th Annual Conference of the International Speech Communication Association (2014), , pp. 905–909
S.M. Chu, D. Povey, Speaking rate adaptation using continuous frame rate normalization, in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4306–4309
C. Cucchiarini, H. Strik, L. Boves, Quantitative assessment of second language learners fluency by means of automatic speech recognition technology. J. Acoust. Soc. Am. 107(2), 989–999 (2000)
N.H. De. Jong, R. Groenhout, R. Schoonen, J.H. Hulstijn, Second language fluency: speaking style or proficiency? Correcting measures of second language fluency for first language behavior. Appl. Psycholinguist. 36(2), 223–243 (2015)
T. Dekens, M. Demol, W. Verhelst, P. Verhoeve, in A comparative study of speech rate estimation techniques I(nterspeech, 2007), pp. 510–513
T.M. Derwing, M.J. Munro, R.I. Thomson, M.J. Rossiter, The relationship between L1 fluency and L2 fluency development. Stud. Second. Lang. Acquis. 31(4), 533–557 (2009)
B. Fisher, tsylb2-1.1: syllabification software. National Institute of Standards and Technology, https://www.nist.gov/itl/iad/mig/tools. Last accessed on 30–05–17 (1996)
K.J. Geras, A.R. Mohamed, R. Caruana, G. Urban, S. Wang, O. Aslan, M. Philipose, M. Richardson, C. Sutton, Blending LSTMs into CNNs, in ICLR Workshop (2016)
J.J. Godfrey, E.C. Holliman, J. McDaniel, SWITCHBOARD: telephone speech corpus for research and development, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1992), pp. 517–520
A. Graves, S. Fernández, J. Schmidhuber, Bidirectional LSTM networks for improved phoneme classification and recognition, in International Conference on Artificial Neural Networks (Springer, 2005), pp. 799–804
A. Graves, A.R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), pp. 6645–6649
A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)
P. Green, J. Carmichael, A. Hatzis, P. Enderby, M. Hawley, M. Parker, Automatic speech recognition with sparse training data for dysarthric speakers, in Eight European Conference on Speech Communication and Technology (2003), pp. 3321–3324
K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016)
C. Heinrich, F. Schiel, Estimating speaking rate by means of rhythmicity parameters. Interspeech (2011), pp. 1873–1876
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
J. Holmes, The JSRU channel vocoder, in IEEE (Communications, Radar and Signal Processing), vol. 127 (IET, 1980), pp. 53–60
Z. Hu, Y. Li, Z. Yang, Improving convolutional neural network using pseudo derivative ReLU, in Fifth International Conference on Systems and Informatics (ICSAI). (IEEE, 2018), pp. 283–287
Z. Huang, M. Dong, Q. Mao, Y. Zhan, Speech emotion recognition using CNN, in 22nd ACM International Conference on Multimedia (ACM, 2014), pp. 801–804
M. Huckvale, Speech filing system: tools for speech research. http://www.phon.ucl.ac.uk/resource/sfs (2000)
S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in International Conference on Machine Learning (2015), pp. 448–456
S.K. Jemni, Y. Kessentini, S. Kanoun, J.M.Ogier, Offline Arabic handwriting recognition using BLSTMs combination, in 13th IAPR International Workshop on Document Analysis Systems (DAS) (IEEE, 2018), pp. 31–36
Y. Jiao, M. Tu, V. Berisha, J. Liss, Online speaking rate estimation using recurrent neural networks, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5245–5249
N. Jmour, S. Zayen, A. Abdelkrim, Convolutional neural networks for image classification, in International Conference on Advanced Systems and Electric Technologies (ICASET) (IEEE, 2018), pp. 397–402
A. Jongman, R. Wayland, S. Wong, Acoustic characteristics of English fricatives. J. Acoust. Soc. Am. 108(3), 1252–1263 (2000)
R.D. Kent, J.C. Rosenbek, Acoustic patterns of apraxia of speech. J. Speech Lang. Hear. Res. 26(2), 231–249 (1983)
S.H. Kim, G.T. Han, 1D CNN based human respiration pattern recognition using ultra wideband radar, in International Conference on Artificial Intelligence in Information and Communication (ICAIIC) (IEEE, 2019), pp. 411–414
S. Kiranyaz, T. Ince, O. Abdeljaber, O. Avci, M. Gabbouj, 1D Convolutional neural networks for signal processing applications, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), pp. 8360–8364
S. Kitaazawa, H. Ichikawa, S. Kobayashi, Y. Nishinuma, Extraction and representation rhythmic components of spontaneous speech, in Fifth European Conference on Speech Communication and Technology (1997), pp. 641–644
B. Ko, H.G. Kim, H.J. Choi, Controlled dropout: a different dropout for improving training speed on deep neural network, in IEEE International Conference on Systems, Man, and Cybernetics (SMC) (2017), pp. 972–977
C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, Y. Bengio, Batch normalized recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2657–2661 (2016)
S. Lawrence, C.L. Giles, A.C. Tsoi, A.D. Back, Face recognition: a convolutional neural–network approach. IEEE Trans. Neural Netw. 8(1), 98–113 (1997)
D. Li, J. Zhang, Q. Zhang, X. Wei, Classification of ECG signals based on 1D convolution neural network, in IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom) (2017), pp. 1–6
J. Li, Y. Shen, Image describing based on bidirectional LSTM and improved sequence sampling, in IEEE 2nd International Conference on Big Data Analysis (ICBDA) (2017), pp. 735–739
Q. Mao, M. Dong, Z. Huang, Y. Zhan, Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 16(8), 2203–2213 (2014)
H. Martens, G. Van Nuffelen, M. De Bodt, T. Dekens, L. Latacz, W. Verhelst, Automated assessment and treatment of speech rate and intonation in dysarthria, in Seventh International Conference on Pervasive Computing Technologies for Healthcare (ICST, Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering) (2013), pp. 382–384
W. Menzel, E. Atwell, P. Bonaventura, D. Herron, P. Howarth, R. Morton, C. Souter, The ISLE corpus of non-native spoken English, in 2000 Language Resources and Evaluation Conference (European Language Resources Association, 2000), pp. 957–964
N. Miller, G. Maruyama, R.J. Beaber, K. Valone, Speed of speech and persuasion. J. Pers. Soc. Psychol. 34(4), 615 (1976)
N. Morgan, E. Fosler, N. Mirghafori, Speech recognition using on-line estimation of speaking rate. Fifth Eur. Conf. Speech Commu. Technol. 4, 2079–2082 (1997)
N. Morgan, E.Fosler-Lussier, Combining multiple estimators of speaking rate, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1998), pp. 729–732
S. Nagesh, C. Yarra, O.D. Deshmukh, P.K. Ghosh, A robust speech rate estimation based on the activation profile from the selected acoustic unit dictionary, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5400–5404
S.L. Oh, E.Y. Ng, R. San Tan, U.R. Acharya, Automated diagnosis of Arrhythmia using combination of CNN and LSTM techniques with variable length heart beats. Comput. Biol. Med. 102, 278–287 (2018)
D. Palaz, M.M. Doss, R. Collobert, Convolutional neural networks-based continuous speech recognition using raw speech signal, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 4295–4299
T. Pfau, G. Ruske, Estimating the speaking rate by vowel detection, in IEEE International Conference on Acoustics, Speech, and Signal Proessing (ICASSP) (1998), pp. 945–948
M. Richardson, M. Hwang, A. Acero, X. Huang, Improvements on speech recognition for fast talkers, in Sixth European Conference on Speech Communication and Technology (1999), pp. 411–414
S. Santurkar, D. Tsipras, A. Ilyas, A. Madry, How does batch normalization help optimization? in Advances in Neural Information Processing Systems (2018), pp. 2483–2493
D. Talkin, A robust algorithm for pitch tracking (RAPT). Speech Coding Synth. 495, 518 (1995)
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
D. Wang, S. Narayanan, Speech rate estimation via temporal correlation and selected sub-band correlation, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2005), pp. 413–416
D. Wang, S.S. Narayanan, Robust speech rate estimation for spontaneous speech. IEEE Trans. Audio Speech Lang. Process. 15(8), 2190–2201 (2007)
C. Yarra, O.D. Deshmukh, P.K. Ghosh, A mode-shape classification technique for robust speech rate estimation and syllable nuclei detection. Speech Commun. 78, 62–71 (2016)
S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, Z. Deng, S. Lee, S. Narayanan, C. Busso, An acoustic study of emotions expressed in speech, in Eighth International Conference on Spoken Language Processing (2004), pp. 2193–2196
J. Yuan, W. Lai, C. Cieri, M. Liberman, Using Forced Alignment for Phonetics Research (Text, Speech and Language Technology. Springer, Chinese Language Resources and Processing, 2018)
J. Yuan, M. Liberman, Robust speaking rate estimation using broad phonetic class recognition, in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4222–4225
Y. Zhang, J.R. Glass, Speech rhythm guided syllable nuclei detection, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2009), pp. 3797–3800
J. Zhao, X. Mao, L. Chen, Learning deep features to recognise speech emotion using merged deep CNN. IET Signal Proc. 12(6), 713–721 (2018)
J. Zheng, H. Franco, A. Stolcke, Rate-of-speech modeling for large vocabulary conversational speech recognition, in ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW) (2000), pp. 145–149
M. Zihlmann, D. Perekrestenko, M. Tschannen, Convolutional recurrent neural networks for electrocardiogram classification, in 2017 Computing in Cardiology (CinC) (IEEE, 2017), pp. 1–4
V. Zue, S. Seneff, J. Glass, Speech database development at MIT: TIMIT and beyond. Speech Commun. 9(4), 351–356 (1990)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Data availability
Data sharing was not applicable to this article as no datasets were generated during the current study
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Srinivasan, A., Singh, D., Yarra, C. et al. A Robust Speaking Rate Estimator Using a CNN-BLSTM Network. Circuits Syst Signal Process 40, 6098–6120 (2021). https://doi.org/10.1007/s00034-021-01754-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-021-01754-1