Skip to main content

Robustness of Whisper Features for Infant Cry Classification

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2023)

Abstract

Early intervention and correct identification of the pathology in infant cry is an important and socially relevant research problem, as it can save the lives of many infants, and also improve the quality of their life. This study proposes utilizing Web-scale Supervised Pretraining for Speech Recognition (WSPSR), also known as Whisper, pre-trained Encoder Module (WEM) for infant cry classification task. These features are contrasted with the state-of-the-art Mel Frequency Cepstral Coefficients (MFCC) feature set, for the purpose of classifying normal vs. pathological infant cries. Additionally, we introduce a multi-class classification approach for pathological infant cries using Convolutional Neural Network (CNN), and Bidirectional Long Short-Term Memory (Bi-LSTM) networks. Our study concludes that the combination of the WEM with Deep Neural Networks (DNN) classifiers, such as CNN and Bi-LSTM, outperforms the MFCC feature set by a significant margin. In addition, a series of comprehensive experiments were conducted to assess the noise robustness and the results indicate that WEM features are relatively more robust compared to MFCC. The experiments were performed utilizing a 10-fold cross-validation on standard and statistically meaningful Baby Chilanto dataset, In-House DA-IICT Corpus, and a combined dataset derived from these two datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Agarap, A.F.: Deep learning using rectified linear units (relu). CoRR abs/1803.08375 (2018). http://arxiv.org/abs/1803.08375. Accessed 6 Feb 2023

  2. Alaie, H.F., Abou-Abbas, L., Tadj, C.: Cry-based infant pathology classification using GMMs. Speech Commun. 77, 28–52 (2016)

    Article  Google Scholar 

  3. Anjali, G., Sanjeev, S., Mounika, A., Suhas, G., Reddy, G.P., Kshiraja, Y.: Infant cry classification using transfer learning. In: TENCON 2022, Seoul, South Korea, pp. 1–7. IEEE (2022)

    Google Scholar 

  4. Armbrüster, L., Mende, W., Gelbrich, G., Wermke, P., Götz, R., Wermke, K.: Musical intervals in infants’ spontaneous crying over the first 4 months of life. Folia Phoniatr. Logop. 73(5), 401–412 (2021)

    Article  Google Scholar 

  5. Bock, S., Weiß, M.: A proof of local convergence for the Adam optimizer. In: 2019 (IJCNN), pp. 1–8 (2019)

    Google Scholar 

  6. Buddha, N., Patil, H.A.: Corpora for analysis of infant cry. Oriental Cocosda, Vietnam (2007)

    Google Scholar 

  7. Chittora, A., Patil, H.A.: Data collection of infant cries for research and analysis. J. Voice 31(2), 252-e15 (2017)

    Article  Google Scholar 

  8. Ji, C., Basodi, S., Xiao, X., Pan, Y.: Infant sound classification on multi-stage CNNs with hybrid features and prior knowledge. In: Xu, R., De, W., Zhong, W., Tian, L., Bai, Y., Zhang, L.-J. (eds.) AIMS 2020. LNCS, vol. 12401, pp. 3–16. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59605-7_1

    Chapter  Google Scholar 

  9. Ji, C., Mudiyanselage, T.B., Gao, Y., Pan, Y.: A review of infant cry analysis and classification. EURASIP J. Audio Speech Music Process. 2021(1), 1–17 (2021)

    Article  Google Scholar 

  10. Ji, C., Xiao, X., Basodi, S., Pan, Y.: Deep learning for asphyxiated infant cry classification based on acoustic features and weighted prosodic features. In: 2019 International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE smart data (SmartData), pp. 1233–1240. IEEE (2019)

    Google Scholar 

  11. Manickam, K., Li, H.: Complexity analysis of normal and deaf infant cry acoustic waves. In: 4th International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (2005)

    Google Scholar 

  12. Onu, C.C., Lebensold, J., Hamilton, W.L., Precup, D.: Neural transfer learning for cry-based diagnosis of perinatal asphyxia. In: International Conference on Learning Representations (ICLR) Workshop, Graz, Austria (2019)

    Google Scholar 

  13. Onu, C.C., et al.: Ubenwa: cry-based diagnosis of birth asphyxia. In: 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA (2017)

    Google Scholar 

  14. O’Shea, K., Nash, R.: An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458 (2015). Accessed 25 Feb 2023

  15. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022). Accessed 6 Mar 2023

  16. Reyes-Galaviz, O.F., Cano-Ortiz, S.D., Reyes-García, C.A.: Validation of the cry unit as primary element for cry analysis using an evolutionary-neural approach. In: 2008 Mexican International Conference on Computer Science, pp. 261–267. IEEE (2008)

    Google Scholar 

  17. Rezaee, K., Ghayoumi Zadeh, H., Qi, L., Rabiee, H., Khosravi, M.R.: Can you understand why i am crying? a decision-making system for classifying infants’ cry languages based on deepsvm model. ACM Transactions on Asian and Low-Resource Language Information Processing (2023)

    Google Scholar 

  18. Rosales-Pérez, A., Reyes-García, C.A., Gonzalez, J.A., Reyes-Galaviz, O.F., Escalante, H.J., Orlandi, S.: Classifying infant cry patterns by the genetic selection of a fuzzy model. Biomed. Signal Process. Control 17, 38–46 (2015)

    Article  Google Scholar 

  19. Sahak, R., Mansor, W., Lee, Y., Yassin, A., Zabidi, A.: Performance of combined support vector machine and principal component analysis in recognizing infant cry with asphyxia. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp. 6292–6295. IEEE (2010)

    Google Scholar 

  20. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)

    Article  Google Scholar 

  21. Strand, O.M., Egeberg, A.: Cepstral mean and variance normalization in the model domain. In: COST278 and ITRW on Robustness Issues in Conversational Interaction, Norwich, United Kingdom, 30–31 August 2004 (2004)

    Google Scholar 

  22. Ting, H.N., Choo, Y.M., Kamar, A.A.: Classification of asphyxia infant cry using hybrid speech features and deep learning models. Expert Syst. Appl. 208, 118064 (2022)

    Article  Google Scholar 

  23. Vaswani, A., et al.: Attention is all you need. In: Advances in NIPS, Long Beach California, United States of America 30 (2017)

    Google Scholar 

  24. Xu, H.t., Zhang, J., Dai, L.r.: Differential time-frequency log-mel spectrogram features for vision transformer based infant cry recognition. In: Proceedings of the INTERSPEECH, Incheon Songdo ConvensiA, Korea, pp. 1963–1967 (2022)

    Google Scholar 

  25. Zhang, Z., Sabuncu, M.: Generalized cross entropy loss for training deep neural networks with noisy labels. In: Advances in NIPS, vol. 31, 2018, Montreal Canada (2018)

    Google Scholar 

Download references

Acknowledgements

The authors are thankful to the Ministry of Electronics and Information Technology (MeitY), New Delhi, Government of India, for sponsoring the project, National Language Translation Mission (NLTM): BHASHINI with the objective of Building Assistive Speech Technologies for the Challenged (Grant ID: 11(1)2022-HCC (TDIL)). They also thank the organizers, namely, the National Institute of Astrophysics and Optical Electronics, CONACYT Mexico for the statistically meaningful Baby Chilanto Database.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Monil Charola , Siddharth Rathod or Hemant A. Patil .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Charola, M., Rathod, S., Patil, H.A. (2023). Robustness of Whisper Features for Infant Cry Classification. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14339. Springer, Cham. https://doi.org/10.1007/978-3-031-48312-7_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-48312-7_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-48311-0

  • Online ISBN: 978-3-031-48312-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics