Mixed Bangla-English Spoken Digit Classification Using Convolutional Neural Network

Das, Shuvro; Yasmin, Mst. Rubayat; Arefin, Musfikul; Taher, Kazi Abu; Uddin, Md Nasir; Rahman, Muhammad Arifur

doi:10.1007/978-3-030-82269-9_29

Shuvro Das¹⁰,
Mst. Rubayat Yasmin¹⁰,
Musfikul Arefin¹⁰,
Kazi Abu Taher¹⁰,
Md Nasir Uddin¹¹ &
…
Muhammad Arifur Rahman¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1435))

Included in the following conference series:

International Conference on Applied Intelligence and Informatics

606 Accesses
9 Citations

A correction to this publication are available online at https://doi.org/10.1007/978-3-030-82269-9_31

Abstract

In this era of the scientific revolution, speech recognition is an important field. People of the world are connecting by using technology. People are shifting from one country to another, sharing their culture and language. Speech recognition has made it easy by translating most of the languages into a readable format. Our world is moving forward through the era of the digital revolution. Still, there are rudimentary examples of research works on Bangla speech recognition with the advancement of automatic speech recognition (ASR). From a Bangladeshi perspective, we often feel the need of using mixed Bangla-English language in different use-cases, mostly in educational institutions and hospital environments. However, most research works focus on speech recognition in the English language, so we were motivated to develop a mixed Bangla-English language classifier to transcribe isolated mixed Bangla-English spoken digits. We have used an open-source dataset for English, and for Bangla, we created a dataset in a noisy environment by speakers of different ages, gender, and dialects. Finally, for the mixed dataset, we have used Mel Frequency Cepstral Coefficient (MFCC) for feature extraction and Convolutional Neural Network (CNN) classifier to train, test, and analyze data for two different experiments we found promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Change history

26 July 2021
The caption of figure 9 in the original version of chapter 29 contained erroneous data and typographical errors. The wrong value and typographical errors have been corrected.

References

Adiba, F.I., Islam, T., Kaiser, M.S., Mahmud, M., Rahman, M.A.: Effect of corpora on classification of fake news using naive bayes classifier. Int. J. Autom. AI Mach. Learn. Canada 1, 80–92 (2020)
Google Scholar
Sumon, S.A., Chowdhury, J., Debnath, S., Mohammed, N., Momen, S.: Bangla short speech commands recognition using convolutional neural networks. In: 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1–6 (2018). https://doi.org/10.1109/ICBSLP.2018.8554395
Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. CoRR abs/1610.09001 (2016). http://arxiv.org/abs/1610.09001
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)
MATH Google Scholar
Blog, G.A.: Launching the speech commands dataset, August 2017. https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html//
Choi, K., Fazekas, G., Sandler, M.B., Cho, K.: Convolutional recurrent neural networks for music classification. CoRR abs/1609.04243 (2016). http://arxiv.org/abs/1609.04243
Das, T.R., Hasan, S., Sarwar, S.M., Das, J.K., Rahman, M.A.: Facial spoof detection using support vector machine. In: Kaiser, M.S., Bandyopadhyay, A., Mahmud, M., Ray, K. (eds.) Proceedings of International Conference on Trends in Computational and Cognitive Engineering. AISC, vol. 1309, pp. 615–625. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-4673-4_50
Chapter Google Scholar
Demir, F., Abdullah, D., Sengur, A.: A new deep CNN model for environmental sound classification. IEEE Access 8, 66529–66537 (2020)
Article Google Scholar
Dong, M.: Convolutional neural network achieves human-level accuracy in music genre classification. CoRR abs/1802.09697 (2018). http://arxiv.org/abs/1802.09697
Ferdous, H., Siraj, T., Setu, S.J., Anwar, M.M., Rahman, M.A.: Machine learning approach towards satellite image classification. In: Kaiser, M.S., Bandyopadhyay, A., Mahmud, M., Ray, K. (eds.) Proceedings of International Conference on Trends in Computational and Cognitive Engineering. AISC, vol. 1309, pp. 627–637. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-4673-4_51
Chapter Google Scholar
getsmarter: Applications of speech recognition, March 2019. https://getsmarter.com/blog/market-trends/ applications-of-speech-recognition//
Ghanty, S., Shaikh, S., Chaki, N.: On recognition of spoken Bengali numerals. In: International Conference on Computer Information Systems and Industrial Management Applications (CISIM), pp. 54–59 (10 2010). https://doi.org/10.1109/CISIM.2010.5643692
Gupta, A., Sarkar, K.: Recognition of spoken Bengali numerals using MLP, SVM, RF based models with PCA based feature summarization. Int. Arab J. Inf. Technol. 15(2), 263–269 (2018)
Google Scholar
Hees, A.G.F.R.J., Dengel, A.: EsresNet: environmental sound classification based on visual domain models. arXiv (2020)
Google Scholar
Huque, S., Rasel, A., Islam, B.: Analysis of a small vocabulary Bangla speech database for recognition. Int. J. Comput. Appl. 133, 22–28 (2016). https://doi.org/10.5120/ijca2016907827
Article Google Scholar
Mahalingam, H., Rajakumar, M.: Speech recognition using multiscale scattering of audio signals and long short-term memory 0f neural networks. Int. J. Adv. Comput. Sci. Cloud Comput. 7, 12–16 (2019)
Google Scholar
Mahmud, M., Kaiser, M.S., Hussain, A.: Deep learning in mining biological data. arXiv (2021)
Google Scholar
Mahmud, M., Kaiser, M.S., Hussain, A., Vassanelli, S.: Applications of deep learning and reinforcement learning to biological data. CoRR abs/1711.03985 (2017). http://arxiv.org/abs/1711.03985
Muhammad, G., Alotaibi, Y., Huda, M.: Automatic speech recognition for Bangla digits. In: 12th International Conference on Computers and Information Technology, pp. 379–383, January 2010. https://doi.org/10.1109/ICCIT.2009.5407267
Nasrullah, Z., Zhao, Y.: Music artist classification with convolutional recurrent neural networks. In: International Joint Conference on Neural Networks (IJCNN), pp. 1381–1388 (2019)
Google Scholar
van den Oord, A., et al..: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016). http://arxiv.org/abs/1609.03499
Paul, B., Bera, S., Paul, R., Phadikar, S.: Bengali spoken numerals recognition by MFCC and GMM technique. In: Mallick, P.K., Bhoi, A.K., Chae, G.-S., Kalita, K. (eds.) Advances in Electronics, Communication and Computing. LNEE, vol. 709, pp. 85–96. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-8752-8_9
Chapter Google Scholar
PyPI: librosa.feature.mfcc librosa 0.8.0 documentation (www document) (2020). https://pypi.org/project/librosa/
Rahman, M.A.: Gaussian process in computational biology: covariance functions for transcriptomics. Ph.D. thesis, University of Sheffield (2018)
Google Scholar
Reddy, P.V.N., Kumar, D.D.A.: Test accuracy improvement in spoken digit recognition using convolutional neural networks. Int. J. Adv. Sci. Technol. 29(02), 1468–1477 (2020)
Google Scholar
Roberts, A., Engel, J.H., Raffel, C., Hawthorne, C., Eck, D.: A hierarchical latent vector model for learning long-term structure in music. CoRR abs/1803.05428 (2018). http://arxiv.org/abs/1803.05428
Sadik, R., Reza, M.L., Noman, A.A., Mamun, S.A., Kaiser, M.S., Rahman, M.A.: Covid-19 pandemic: a comparative prediction using machine learning. Int. J. Autom. AI Mach. Learn. Canada 1, 1–16 (2020)
Google Scholar
Scipy: numpy.append numpy v1.20 manual (2020). https://docs.scipy.org/doc/numpy/reference/genrated/numpy.append.html
Sharmin, R., Rahut, S.K., Huq, M.R.: Bengali spoken digit classification: a deep learning approach using convolutional neural network. Proc. Comput. Sci. 171, 1381–1388 (2020)
Article Google Scholar
sklearn: sklearn.model\(\_\)selection.train\(\_\)test]\(\_\)split scikit-learn 0.24.1 documentationdocumentation (www document) (2020). https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Speaks, A.: Audrey: the first speech recognition system, October 2014. https://astaspeaks.wordpress.com/2014/10/13/audrey-the-first-speech-recognition-system//
Sultana, S., Rahman, M.S., Iqbal, M.Z.: Recent advancement in speech recognition for bangla: a survey. Int. J. Adv. Comput. Sci. Appl. 12(3) (2021). https://doi.org/10.14569/IJACSA.2021.0120365 http://dx.doi.org/10.14569/IJACSA.2021.0120365
Taufika, D., Hanafiaha, N.: Autovat: An automated visual acuity test using spoken digit recognition with MEL frequency cepstral coefficients and convolutional neural network. In: 5th International Conference on Computer Science and Computational Intelligence 2020. vol. 179, pp. 458–467 (2021)
Google Scholar
tensorflow: tensorflow.org/guide/keras/sequential\(\_\)tensorflow core v2.4.1] (www document) (2020). https://www.tensorflow.org/guide/keras/sequential_model
Watt, S., Kostylev, M.: Spoken digit classification using spin-wave delay-line active-ring reservoir computing. arXiv (2020)
Google Scholar
Wikiland: List of languages by total number of speakers (2019). https://wikiwand.com/en/List_of_languages_by_number_of_native_speakers//
Zerari, N., Samir, A., Hassen, B., Raymond, C.: Bidirectional deep architecture for Arabic speech recognition speech recognition using multiscale scattering of audio signals and long short-term memory of neural networks. Open Comput. Sci. 9(1), 92–102 (2019)
Article Google Scholar
Zhang, W., Lei, W., Xu, X., Xing, X.: Improved music genre classification with convolutional neural networks. In: INTERSPEECH (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of ICT, Bangladesh University of Professionals, Dhaka, Bangladesh
Shuvro Das, Mst. Rubayat Yasmin, Musfikul Arefin & Kazi Abu Taher
Department of Neurology, University of Rochester, Rochester, NY, 14642, USA
Md Nasir Uddin
Department of Physics, Jahangirnagar University, Dhaka, Bangladesh
Muhammad Arifur Rahman

Authors

Shuvro Das
View author publications
You can also search for this author in PubMed Google Scholar
Mst. Rubayat Yasmin
View author publications
You can also search for this author in PubMed Google Scholar
Musfikul Arefin
View author publications
You can also search for this author in PubMed Google Scholar
Kazi Abu Taher
View author publications
You can also search for this author in PubMed Google Scholar
Md Nasir Uddin
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Arifur Rahman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuvro Das .

Editor information

Editors and Affiliations

Nottingham Trent University, Nottingham, UK
Mufti Mahmud
Jahangirnagar University, Savar, Dhaka, Bangladesh
M. Shamim Kaiser
Auckland University of Technology, Auckland, New Zealand
Nikola Kasabov
Old Dominion University, Norfolk, VA, USA
Khan Iftekharuddin
Maebashi Institute of Technology, Maebashi, Japan
Ning Zhong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Das, S., Yasmin, M.R., Arefin, M., Taher, K.A., Uddin, M.N., Rahman, M.A. (2021). Mixed Bangla-English Spoken Digit Classification Using Convolutional Neural Network. In: Mahmud, M., Kaiser, M.S., Kasabov, N., Iftekharuddin, K., Zhong, N. (eds) Applied Intelligence and Informatics. AII 2021. Communications in Computer and Information Science, vol 1435. Springer, Cham. https://doi.org/10.1007/978-3-030-82269-9_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-82269-9_29
Published: 26 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82268-2
Online ISBN: 978-3-030-82269-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics