A Robust Speaking Rate Estimator Using a CNN-BLSTM Network

Srinivasan, Aparna; Singh, Diviya; Yarra, Chiranjeevi; Illa, Aravind; Ghosh, Prasanta Kumar

doi:10.1007/s00034-021-01754-1

A Robust Speaking Rate Estimator Using a CNN-BLSTM Network

Published: 11 June 2021

Volume 40, pages 6098–6120, (2021)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Aparna Srinivasan¹,
Diviya Singh²,
Chiranjeevi Yarra ORCID: orcid.org/0000-0002-0574-8777³,
Aravind Illa⁴ &
…
Prasanta Kumar Ghosh⁴

489 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Direct acoustic feature-based speaking rate estimation is useful in applications including pronunciation assessment, dysarthria detection and automatic speech recognition. Most of the existing works on speaking rate estimation have steps which are heuristically designed. In contrast to the existing works, in this work a data-driven approach with convolutional neural network-bidirectional long short-term memory (CNN-BLSTM) network is proposed to jointly optimize all steps in speaking rate estimation through a single framework. Also, unlike existing deep learning-based methods for speaking rate estimation, the proposed approach estimates the speaking rate for an entire speech utterance in one go instead of considering segments of a fixed duration. We consider the traditional 19 sub-band energy (SBE) contours as the low-level features as the input of the proposed CNN-BLSTM network. The state-of-the-art direct acoustic feature-based speaking rate estimation techniques are developed based on 19 SBEs as well. Experiments are performed separately using three native English speech corpora (Switchboard, TIMIT and CTIMIT) and a non-native English speech corpus (ISLE). Among these, TIMIT and Switchboard are used for training the network. However, testing is carried out on all the four corpora as well as TIMIT and Switchboard with additive noise, namely white, car, high-frequency-channel, cockpit, and babble at 20, 10 and 0 dB signal-to-noise ratios. The proposed CNN-BLSTM approach outperforms the best of the existing techniques in clean as well as noisy conditions for all four corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Speaking Rate Estimation Based on Deep Neural Networks

Quartered Spectral Envelope and 1D-CNN-Based Classification of Normally Phonated and Whispered Speech

Article 23 December 2022

A hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits

Article 09 May 2023

Notes

https://github.com/diviya97/CNN-BLSTM-Speaking-Rate-Estimator.

References

S. Albawi, T.A. Mohammed, S. Al-Zawi, Understanding of a convolutional neural network, in International Conference on Engineering and Technology (ICET) (IEEE, 2017), pp. 1–6
J.D. Amerman, M.M. Parnell, Speech timing strategies in elderly adults. J. Phon. 20(1), 65–76 (1992)
Article Google Scholar
W. Apple, L.A. Streeter, R.M. Krauss, Effects of pitch and speech rate on personal attributions. J. Pers. Soc. Psychol. 37(5), 715 (1979)
Article Google Scholar
C.D. Bartels, J.A. Bilmes, Use of syllable nuclei locations to improve ASR, in IEEE Workshop on Automatic Speech Recognition and Understanding (2007), pp. 335–340
S. Bartlett, G. Kondrak, C. Cherry, On the syllabification of phonemes, in Annual Conference of the North American Chapter of the Association for Computational Linguistics (Association for Computational Linguistics, 2009), pp. 308–316
Y. Bengio, P. Simard, P. Frasconi et al., Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Article Google Scholar
M.P. Black, D. Bone, Z.I. Skordilis, R. Gupta, W. Xia, P. Papadopoulos, S.N. Chakravarthula, B. Xiao, M.V. Segbroeck, J. Kim, et al, Automated evaluation of non-native English pronunciation quality: Combining knowledge-and data-driven features at multiple time scales. in Sixteenth Annual Conference of the International Speech Communication Association (2015), pp. 493–497
M.P. Black, J. Tepperman, S.S. Narayanan, Automatic prediction of childrens reading ability for high-level literacy assessment. IEEE Trans. Audio Speech Lang. Process. 19(4), 1015–1028 (2011)
Article Google Scholar
K.L. Brown, E.B. George, CTIMIT: a speech corpus for the cellular environment with applications to automatic speech recognition, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1995), pp. 105–108
M.P. Caligiuri, The influence of speaking rate on articulatory hypokinesia in Parkinsonian dysarthria. Brain Lang. 36(3), 493–502 (1989)
Article Google Scholar
S.Y. Chang, N. Morgan, Robust CNN-based speech recognition with Gabor filter kernels, in 15th Annual Conference of the International Speech Communication Association (2014), , pp. 905–909
S.M. Chu, D. Povey, Speaking rate adaptation using continuous frame rate normalization, in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4306–4309
C. Cucchiarini, H. Strik, L. Boves, Quantitative assessment of second language learners fluency by means of automatic speech recognition technology. J. Acoust. Soc. Am. 107(2), 989–999 (2000)
Article Google Scholar
N.H. De. Jong, R. Groenhout, R. Schoonen, J.H. Hulstijn, Second language fluency: speaking style or proficiency? Correcting measures of second language fluency for first language behavior. Appl. Psycholinguist. 36(2), 223–243 (2015)
Article Google Scholar
T. Dekens, M. Demol, W. Verhelst, P. Verhoeve, in A comparative study of speech rate estimation techniques I(nterspeech, 2007), pp. 510–513
T.M. Derwing, M.J. Munro, R.I. Thomson, M.J. Rossiter, The relationship between L1 fluency and L2 fluency development. Stud. Second. Lang. Acquis. 31(4), 533–557 (2009)
Article Google Scholar
B. Fisher, tsylb2-1.1: syllabification software. National Institute of Standards and Technology, https://www.nist.gov/itl/iad/mig/tools. Last accessed on 30–05–17 (1996)
K.J. Geras, A.R. Mohamed, R. Caruana, G. Urban, S. Wang, O. Aslan, M. Philipose, M. Richardson, C. Sutton, Blending LSTMs into CNNs, in ICLR Workshop (2016)
J.J. Godfrey, E.C. Holliman, J. McDaniel, SWITCHBOARD: telephone speech corpus for research and development, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1992), pp. 517–520
A. Graves, S. Fernández, J. Schmidhuber, Bidirectional LSTM networks for improved phoneme classification and recognition, in International Conference on Artificial Neural Networks (Springer, 2005), pp. 799–804
A. Graves, A.R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), pp. 6645–6649
A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)
Article Google Scholar
P. Green, J. Carmichael, A. Hatzis, P. Enderby, M. Hawley, M. Parker, Automatic speech recognition with sparse training data for dysarthric speakers, in Eight European Conference on Speech Communication and Technology (2003), pp. 3321–3324
K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016)
Article MathSciNet Google Scholar
C. Heinrich, F. Schiel, Estimating speaking rate by means of rhythmicity parameters. Interspeech (2011), pp. 1873–1876
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
J. Holmes, The JSRU channel vocoder, in IEEE (Communications, Radar and Signal Processing), vol. 127 (IET, 1980), pp. 53–60
Z. Hu, Y. Li, Z. Yang, Improving convolutional neural network using pseudo derivative ReLU, in Fifth International Conference on Systems and Informatics (ICSAI). (IEEE, 2018), pp. 283–287
Z. Huang, M. Dong, Q. Mao, Y. Zhan, Speech emotion recognition using CNN, in 22nd ACM International Conference on Multimedia (ACM, 2014), pp. 801–804
M. Huckvale, Speech filing system: tools for speech research. http://www.phon.ucl.ac.uk/resource/sfs (2000)
S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in International Conference on Machine Learning (2015), pp. 448–456
S.K. Jemni, Y. Kessentini, S. Kanoun, J.M.Ogier, Offline Arabic handwriting recognition using BLSTMs combination, in 13th IAPR International Workshop on Document Analysis Systems (DAS) (IEEE, 2018), pp. 31–36
Y. Jiao, M. Tu, V. Berisha, J. Liss, Online speaking rate estimation using recurrent neural networks, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5245–5249
N. Jmour, S. Zayen, A. Abdelkrim, Convolutional neural networks for image classification, in International Conference on Advanced Systems and Electric Technologies (ICASET) (IEEE, 2018), pp. 397–402
A. Jongman, R. Wayland, S. Wong, Acoustic characteristics of English fricatives. J. Acoust. Soc. Am. 108(3), 1252–1263 (2000)
Article Google Scholar
R.D. Kent, J.C. Rosenbek, Acoustic patterns of apraxia of speech. J. Speech Lang. Hear. Res. 26(2), 231–249 (1983)
Article Google Scholar
S.H. Kim, G.T. Han, 1D CNN based human respiration pattern recognition using ultra wideband radar, in International Conference on Artificial Intelligence in Information and Communication (ICAIIC) (IEEE, 2019), pp. 411–414
S. Kiranyaz, T. Ince, O. Abdeljaber, O. Avci, M. Gabbouj, 1D Convolutional neural networks for signal processing applications, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), pp. 8360–8364
S. Kitaazawa, H. Ichikawa, S. Kobayashi, Y. Nishinuma, Extraction and representation rhythmic components of spontaneous speech, in Fifth European Conference on Speech Communication and Technology (1997), pp. 641–644
B. Ko, H.G. Kim, H.J. Choi, Controlled dropout: a different dropout for improving training speed on deep neural network, in IEEE International Conference on Systems, Man, and Cybernetics (SMC) (2017), pp. 972–977
C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, Y. Bengio, Batch normalized recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2657–2661 (2016)
S. Lawrence, C.L. Giles, A.C. Tsoi, A.D. Back, Face recognition: a convolutional neural–network approach. IEEE Trans. Neural Netw. 8(1), 98–113 (1997)
Article Google Scholar
D. Li, J. Zhang, Q. Zhang, X. Wei, Classification of ECG signals based on 1D convolution neural network, in IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom) (2017), pp. 1–6
J. Li, Y. Shen, Image describing based on bidirectional LSTM and improved sequence sampling, in IEEE 2nd International Conference on Big Data Analysis (ICBDA) (2017), pp. 735–739
Q. Mao, M. Dong, Z. Huang, Y. Zhan, Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 16(8), 2203–2213 (2014)
Article Google Scholar
H. Martens, G. Van Nuffelen, M. De Bodt, T. Dekens, L. Latacz, W. Verhelst, Automated assessment and treatment of speech rate and intonation in dysarthria, in Seventh International Conference on Pervasive Computing Technologies for Healthcare (ICST, Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering) (2013), pp. 382–384
W. Menzel, E. Atwell, P. Bonaventura, D. Herron, P. Howarth, R. Morton, C. Souter, The ISLE corpus of non-native spoken English, in 2000 Language Resources and Evaluation Conference (European Language Resources Association, 2000), pp. 957–964
N. Miller, G. Maruyama, R.J. Beaber, K. Valone, Speed of speech and persuasion. J. Pers. Soc. Psychol. 34(4), 615 (1976)
Article Google Scholar
N. Morgan, E. Fosler, N. Mirghafori, Speech recognition using on-line estimation of speaking rate. Fifth Eur. Conf. Speech Commu. Technol. 4, 2079–2082 (1997)
Google Scholar
N. Morgan, E.Fosler-Lussier, Combining multiple estimators of speaking rate, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1998), pp. 729–732
S. Nagesh, C. Yarra, O.D. Deshmukh, P.K. Ghosh, A robust speech rate estimation based on the activation profile from the selected acoustic unit dictionary, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5400–5404
S.L. Oh, E.Y. Ng, R. San Tan, U.R. Acharya, Automated diagnosis of Arrhythmia using combination of CNN and LSTM techniques with variable length heart beats. Comput. Biol. Med. 102, 278–287 (2018)
Article Google Scholar
D. Palaz, M.M. Doss, R. Collobert, Convolutional neural networks-based continuous speech recognition using raw speech signal, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 4295–4299
T. Pfau, G. Ruske, Estimating the speaking rate by vowel detection, in IEEE International Conference on Acoustics, Speech, and Signal Proessing (ICASSP) (1998), pp. 945–948
M. Richardson, M. Hwang, A. Acero, X. Huang, Improvements on speech recognition for fast talkers, in Sixth European Conference on Speech Communication and Technology (1999), pp. 411–414
S. Santurkar, D. Tsipras, A. Ilyas, A. Madry, How does batch normalization help optimization? in Advances in Neural Information Processing Systems (2018), pp. 2483–2493
D. Talkin, A robust algorithm for pitch tracking (RAPT). Speech Coding Synth. 495, 518 (1995)
Google Scholar
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Article Google Scholar
D. Wang, S. Narayanan, Speech rate estimation via temporal correlation and selected sub-band correlation, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2005), pp. 413–416
D. Wang, S.S. Narayanan, Robust speech rate estimation for spontaneous speech. IEEE Trans. Audio Speech Lang. Process. 15(8), 2190–2201 (2007)
Article Google Scholar
C. Yarra, O.D. Deshmukh, P.K. Ghosh, A mode-shape classification technique for robust speech rate estimation and syllable nuclei detection. Speech Commun. 78, 62–71 (2016)
Article Google Scholar
S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, Z. Deng, S. Lee, S. Narayanan, C. Busso, An acoustic study of emotions expressed in speech, in Eighth International Conference on Spoken Language Processing (2004), pp. 2193–2196
J. Yuan, W. Lai, C. Cieri, M. Liberman, Using Forced Alignment for Phonetics Research (Text, Speech and Language Technology. Springer, Chinese Language Resources and Processing, 2018)
J. Yuan, M. Liberman, Robust speaking rate estimation using broad phonetic class recognition, in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4222–4225
Y. Zhang, J.R. Glass, Speech rhythm guided syllable nuclei detection, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2009), pp. 3797–3800
J. Zhao, X. Mao, L. Chen, Learning deep features to recognise speech emotion using merged deep CNN. IET Signal Proc. 12(6), 713–721 (2018)
Article Google Scholar
J. Zheng, H. Franco, A. Stolcke, Rate-of-speech modeling for large vocabulary conversational speech recognition, in ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW) (2000), pp. 145–149
M. Zihlmann, D. Perekrestenko, M. Tschannen, Convolutional recurrent neural networks for electrocardiogram classification, in 2017 Computing in Cardiology (CinC) (IEEE, 2017), pp. 1–4
V. Zue, S. Seneff, J. Glass, Speech database development at MIT: TIMIT and beyond. Speech Commun. 9(4), 351–356 (1990)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA, 92093, USA
Aparna Srinivasan
Department of Electrical Engineering, Indian Institute of Technology Roorkee (IITR), Roorkee, 247667, India
Diviya Singh
Language Technologies Research Center, International Institute of Information Technology (IIIT), Hyderabad, 500032, India
Chiranjeevi Yarra
Department of Electrical Engineering, Indian Institute of Science (IISc), Bangalore, 560012, India
Aravind Illa & Prasanta Kumar Ghosh

Authors

Aparna Srinivasan
View author publications
You can also search for this author in PubMed Google Scholar
Diviya Singh
View author publications
You can also search for this author in PubMed Google Scholar
Chiranjeevi Yarra
View author publications
You can also search for this author in PubMed Google Scholar
Aravind Illa
View author publications
You can also search for this author in PubMed Google Scholar
Prasanta Kumar Ghosh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chiranjeevi Yarra.

Ethics declarations

Data availability

Data sharing was not applicable to this article as no datasets were generated during the current study

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Srinivasan, A., Singh, D., Yarra, C. et al. A Robust Speaking Rate Estimator Using a CNN-BLSTM Network. Circuits Syst Signal Process 40, 6098–6120 (2021). https://doi.org/10.1007/s00034-021-01754-1

Download citation

Received: 08 March 2020
Revised: 13 May 2021
Accepted: 18 May 2021
Published: 11 June 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s00034-021-01754-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Robust Speaking Rate Estimator Using a CNN-BLSTM Network

Abstract

Access this article

Similar content being viewed by others

Speaking Rate Estimation Based on Deep Neural Networks

Quartered Spectral Envelope and 1D-CNN-Based Classification of Normally Phonated and Whispered Speech

A hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Data availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Robust Speaking Rate Estimator Using a CNN-BLSTM Network

Abstract

Access this article

Similar content being viewed by others

Speaking Rate Estimation Based on Deep Neural Networks

Quartered Spectral Envelope and 1D-CNN-Based Classification of Normally Phonated and Whispered Speech

A hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Data availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation