Abstract
Feature extraction is an essential phase in the development of Automatic Speech Recognition (ASR) systems. This study examines the performance of different deep neural network architectures, including Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), and (bi-LSTM) models for the Amazigh speech recognition system. When applied a several of feature extraction techniques, specifically Mel-Frequency Cepstral Coefficients (MFCC), Spectrograms, and Mel-Spectrograms, on the performance of different. The results show that the Bi-LSTM with Spectrograms achieved a maximum accuracy of 85%, giving the best performance in our Amazigh Speech Recognition (ASR) study. and we show that each feature type offers specific advantages, influenced by the particular neural network architecture employed.








Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Data availability
No datasets were generated or analysed during the current study.
References
Abdelhamid, A. A., El-Kenawy, E. S. M., Alotaibi, B., Amer, G. M., Abdelkader, M. Y., Ibrahim, A., & Eid, M. M. (2022). Robust speech emotion recognition using CNN + LSTM based on stochastic fractal search optimization algorithm. IEEE Access, 10, 49265–49284.
Abdelmaksoud, E. R., Hassen, A., Hassan, N., & Hesham, M. (2021). Convolutional neural network for Arabic speech recognition. The Egyptian Journal of Language Engineering, 8(1), 27–38.
Ahmed, B., & Abdellah, Y. (2024). MDVC corpus: Empowering Moroccan Darija speech recognition. Indonesian Journal of Electrical Engineering and Computer Science, 34(1), 290–301. https://doi.org/10.11591/ijeecs.v34.i1.pp290-301
Alamsyah, R. D., & Suyanto, S. (2020, December). Speech gender classification using bidirectional long short term memory. 2020 3rd international seminar on research of information technology and intelligent systems (ISRITI) (pp. 646–649). IEEE.
Alkhatib, B., Eddin, M. M. K., & Syria, D. (2023). ASR features extraction using MFCC and LPC: A comparative study. Journal of Digital Information Management, 21(2), 39.
Almekhlafi, E., Moeen, A. M., Zhang, E., Wang, J., & Peng, J. (2022). A classification benchmark for Arabic alphabet phonemes with diacritics in deep neural networks. Computer Speech & Language, 71, 101274.
Ammar Aouchiche, I. R., Boumahdi, F., Madani, A., & Remmide, M. A. (2023). Hate speech prediction on social media. SN Computer Science, 4(3), 229.
Arias-Vergara, T., Klumpp, P., Vasquez-Correa, J. C., Nöth, E., Orozco-Arroyave, J. R., & Schuster, M. (2021). Multi-channel spectrograms for speech processing applications using deep learning methods. Pattern Analysis and Applications, 24, 423–431.
Bhangale, K., & Mohanaprasad, K. (2020, November). Speech emotion recognition using MEL frequency log spectrogram and deep convolutional neural network. In International conference on futuristic communication and network technologies (pp. 241–250). Springer.
Bogdanoski, K., Mishev, K., Simjanoska, M., & Trajanov, D. (2023). Exploring ASR models in low-resource languages: Use-case the Macedonian language. In International conference on deep learning theory and applications (pp. 254–268). Springer.
Boulal, H., Hamidi, M., Abarkan, M., & Barkani, J. (2023). Amazigh spoken digit recognition using a deep learning approach based on MFCC. International Journal of Electrical and Computer Engineering Systems, 14(7), 791–798.
Boulal, H., Hamidi, M., Abarkan, M., & Barkani, J. (2024). Amazigh CNN speech recognition system based on MEL spectrogram feature extraction method. International Journal of Speech Technology, 1–10.
Ćirić, D., Perić, Z., Nikolić, J., & Vučić, N. (2021, March). Audio signal mapping into spectrogram-based images for deep learning applications. In 2021 20th international symposium INFOTEH-JAHORINA (INFOTEH) (pp. 1–6). IEEE.
Demircan, S., & Örnek, H. K. (2020). Comparison of the effects of mel coefficients and spectrogram images via deep learning in emotion classification. Traitement du Signal.
Dhanjal, A. S., & Singh, W. (2024). A comprehensive survey on automatic speech recognition using neural networks. Multimedia Tools and Applications, 83(8), 23367–23412.
Gupta, V., Juyal, S., & Hu, Y. C. (2022). Understanding human emotions through speech spectrograms using deep neural network. The Journal of Supercomputing, 78(5), 6944–6973.
Joshi, S., & Chatterjee, S. (2023). Accuracy test on 2D CNN on entire MFCC and MEL-spectrogram with and without data augmentation. The Journal of Acoustical Society of India, 76.
Koduru, A., Valiveti, H. B., & Budati, A. K. (2020). Feature extraction algorithms to improve the speech emotion recognition rate. International Journal of Speech Technology, 23(1), 45–55.
Mahmood, A., & Köse, U. (2021). Speech recognition based on convolutional neural networks and MFCC algorithm. Advances in Artificial Intelligence Research, 1(1), 6–12.
Kim, Y., Kim, H., & Park, S. (2023). Advancements in ASR for Acoustic Di-versity. IEEE Transactions on Neural Networks, 35(1), 97-110.
Meghanani, A., Anoop, C. S., & Ramakrishnan, A. G. (2021). An exploration of log-mel spectrogram and MFCC features for Alzheimer’s dementia recognition from spontaneous speech. 2021 IEEE spoken language technology workshop (SLT) (pp. 670–677). IEEE.
Mehrish, A., Majumder, N., Bharadwaj, R., Mihalcea, R., & Poria, S. (2023). A review of deep learning techniques for speech processing. Information Fusion, 99, 101869.
Rathor, S., & Agrawal, S. (2021). A robust model for domain recognition of acoustic communication using bidirectional LSTM and deep neural network. Neural Computing and Applications, 33(17), 11223–11232.
Rawat, P., Bajaj, M., Vats, S., & Sharma, V. (2023). A comprehensive study based on MFCC and spectrogram for audio classification. Journal of Information and Optimization Sciences, 44(6), 1057–1074.
Revathi, A., Nagakrishnan, R., & Sasikaladevi, N. (2022). Comparative analysis of dysarthric speech recognition: Multiple features and robust templates. Multimedia Tools and Applications, 81(22), 31245–31259.
Seo, S., Kim, C., & Kim, J. H. (2022). Convolutional neural networks using log mel-spectrogram separation for audio event classification with unknown devices. Journal of Web Engineering, 21(2), 497–522.
Strake, M., Defraene, B., Fluyt, K., Tirry, W., & Fingscheidt, T. (2020). Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration. EURASIP Journal on Advances in Signal Processing, 2020, 1–26.
Telmem, M., & Ghanou, Y. (2020). A comparative study of HMMs and CNN acoustic model in amazigh recognition system. In Embedded systems and artificial intelligence: Proceedings of ESAI 2019, Fez, Morocco (pp. 533–540). Springer.
Telmem, M., Laaidi, N., Ghanou, Y., Hamiane, S., & Satori, H. (2024). Comparative study of CNN, LSTM and hybrid CNN-LSTM model in Amazigh speech recognition using spectrogram feature extraction and different gender and age dataset. International Journal of Speech Technology, 1–13.
Ustubioglu, A., Ustubioglu, B., & Ulutas, G. (2023). Mel spectrogram-based audio forgery detection using CNN. Signal Image and Video Processing, 17(5), 2211–2219.
Wang, Q., Feng, C., Xu, Y., Zhong, H., & Sheng, V. S. (2020). A novel privacy-preserving speech recognition framework using bidirectional LSTM. Journal of Cloud Computing, 9(1), 36.
Yadav, S., Kumar, A., Yaduvanshi, A., & Meena, P. (2023). A review of feature extraction and classification techniques in speech recognition. SN Computer Science, 4(6), 777.
Zhang, T., Feng, G., Liang, J., & An, T. (2021). Acoustic scene classification based on mel spectrogram decomposition and model merging. Applied Acoustics, 182, 108258.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
Meryam Telmem: ABCDEFNaouar Laaidi: ABCDEFHassan Satori: ABCDEF All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Telmem, M., Laaidi, N. & Satori, H. The impact of MFCC, spectrogram, and Mel-Spectrogram on deep learning models for Amazigh speech recognition system. Int J Speech Technol 28, 299–312 (2025). https://doi.org/10.1007/s10772-025-10183-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-025-10183-3