Skip to main content

Deep Learning for Image and Sound Data: An Overview

  • Conference paper
  • First Online:
  • 563 Accesses

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 164))

Abstract

The growth of deep learning (DL) in recent years has been astonishing, and every domain in our life is often touched and sometimes consumed with DL breakthroughs. This brought us to why this state-of-the-art was elaborated, the algorithms used in DL are reviewed, their different architectures and designs, and the application domains are based on the type of input data, namely image and sound data. The domains of image classification and object detection for the former, and the domain of automatic speech recognition, Automatic spoken language identification, and Speech emotion recognition for the latter. Hoping through this paper that the task of looking for the latest techniques in DL is made easier for new researchers.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Liu, L., et al.: Deep learning for generic object detection: a survey. Int. J. Comput. Vis. 128(2), 261–318 (2019). https://doi.org/10.1007/s11263-019-01247-4

    Article  MATH  Google Scholar 

  2. Sornam, M., Muthusubash, K., Vanitha, V.: A survey on image classification and activity recognition using deep convolutional neural network architecture. In: 2017 9th International Conference on Advanced Computing (ICoAC), pp. 121–126 (2017). https://doi.org/10.1109/ICoAC.2017.8441512

  3. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Garcia-Rodriguez, J.: A review on deep learning techniques applied to semantic segmentation. arXiv:1704.06857 [cs] (2017)

  4. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Martinez-Gonzalez, P., Garcia-Rodriguez, J.: A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft Comput. 70, 41–65 (2018). https://doi.org/10.1016/j.asoc.2018.05.018

    Article  Google Scholar 

  5. Yuan, X., Shi, J., Gu, L.: A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 169, 114417 (2021). https://doi.org/10.1016/j.eswa.2020.114417

    Article  Google Scholar 

  6. Nekrasov, P., Freeze, J., Batista, V.: Using restricted boltzmann machines to model molecular geometries. arXiv:2012.06984 [physics] (2020)

  7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012)

    Google Scholar 

  8. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [cs] (2015)

  9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 [cs] (2015)

  10. Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. IEEE, Boston, MA, USA (2015). https://doi.org/10.1109/CVPR.2015.7298594

  11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  12. Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078 [cs, stat] (2014)

  13. Goodfellow, I.J., et al.: Generative adversarial networks. arXiv:1406.2661 [cs, stat] (2014)

  14. Rahman, S., Wang, L., Sun, C., Zhou, L.: Deep learning based HEp-2 image classification: A comprehensive review. Med. Image Anal. 65, 101764 (2020). https://doi.org/10.1016/j.media.2020.101764

    Article  Google Scholar 

  15. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. arXiv:1311.2901 [cs] (2013)

  16. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:1311.2524 [cs] (2014)

  17. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_23

    Chapter  Google Scholar 

  18. Huang, C., He, Z., Cao, G., Cao, W.: Task-driven progressive part localization for fine-grained object recognition. IEEE Trans. Multimedia 18, 2372–2383 (2016). https://doi.org/10.1109/TMM.2016.2602060

    Article  Google Scholar 

  19. Kong, L., Huang, D., Qin, J., Wang, Y.: A joint framework for athlete tracking and action recognition in sports videos. IEEE Trans. Circuits Syst. Video Technol. 30, 532–548 (2020). https://doi.org/10.1109/TCSVT.2019.2893318

    Article  Google Scholar 

  20. Girshick, R.: Fast R-CNN. arXiv:1504.08083 [cs] (2015)

  21. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. arXiv:1506.01497 [cs] (2016)

  22. Parvathi, S., Selvi, S.T.: Detection of maturity stages of coconuts in complex background using Faster R-CNN model. Biosyst. Eng. 202, 119–132 (2021). https://doi.org/10.1016/j.biosystemseng.2020.12.002

    Article  Google Scholar 

  23. Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks vol. 9

    Google Scholar 

  24. Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. arXiv:1612.03144 [cs] (2017)

  25. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6154–6162. IEEE, Salt Lake City, UT (2018). https://doi.org/10.1109/CVPR.2018.00644

  26. Wu, C.-E., Chan, Y.-M., Chen, C.-H., Chen, W.-C., Chen, C.-S.: IMMVP: an efficient daytime and nighttime on-road object detector. In: 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP 2019). IEEE, New York (2019)

    Google Scholar 

  27. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. arXiv:1506.02640 [cs] (2016)

  28. Shin, S., Han, H., Lee, S.H.: Improved YOLOv3 with duplex FPN for object detection based on deep learning. Int. J. Elec. Eng. Educ. https://doi.org/10.1177/0020720920983524

  29. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2

    Chapter  Google Scholar 

  30. Ding, L., Xu, X., Cao, Y., Zhai, G., Yang, F., Qian, L.: Detection and tracking of infrared small target by jointly using SSD and pipeline filter. Digital Signal Process. 110, 102949 (2021). https://doi.org/10.1016/j.dsp.2020.102949

    Article  Google Scholar 

  31. Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: DSSD: deconvolutional single shot detector. arXiv:1701.06659 [cs] (2017)

  32. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. arXiv:1708.02002 [cs] (2018)

  33. da Silva, B.C.G., Tam, R., Ferrari, R.J.: Detecting cells in intravital video microscopy using a deep convolutional neural network. Comput. Biol. Med. 129, 104133 (2021). https://doi.org/10.1016/j.compbiomed.2020.104133

    Article  Google Scholar 

  34. Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra R-CNN: towards balanced learning for object detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 821–830. IEEE, Long Beach, CA, USA (2019). https://doi.org/10.1109/CVPR.2019.00091

  35. Liu, Y., Sun, P., Wergeles, N., Shang, Y.: A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 172, 114602 (2021). https://doi.org/10.1016/j.eswa.2021.114602

    Article  Google Scholar 

  36. Logan, B.: Mel frequency cepstral coefficients for music modeling. In: International Symposium on Music Information Retrieval (2000)

    Google Scholar 

  37. Honig, F., Stemmer, G., Hacker, C., Brugnara, F.: Revising perceptual linear prediction (PLP), vol. 4 (2005)

    Google Scholar 

  38. Palaz, D., Magimai-Doss, M., Collobert, R.: End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition. Speech Commun. 108, 15–32 (2019). https://doi.org/10.1016/j.specom.2019.01.004

    Article  Google Scholar 

  39. Dokuz, Y., Tufekci, Z.: Mini-batch sample selection strategies for deep learning based speech recognition. Appl. Acoust. 171, 107573 (2021). https://doi.org/10.1016/j.apacoust.2020.107573

    Article  Google Scholar 

  40. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. Association for Computing Machinery, New York, NY, USA (2006). https://doi.org/10.1145/1143844.1143891

  41. Nagrani, A., Chung, J.S., Xie, W., Zisserman, A.: Voxceleb: large-scale speaker verification in the wild. Comput. Speech Lang. 60, 101027 (2020). https://doi.org/10.1016/j.csl.2019.101027

    Article  Google Scholar 

  42. Garain, A., Singh, P.K., Sarkar, R.: FuzzyGCP: a deep learning architecture for automatic spoken language identification from speech signals. Expert Syst. Appl. 168, 114416 (2021). https://doi.org/10.1016/j.eswa.2020.114416

    Article  Google Scholar 

  43. Ubale, R., Qian, Y., Evanini, K.: Exploring end-to-end attention-based neural networks for native language identification. In: 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece (2018). https://doi.org/10.1109/SLT.2018.8639689

  44. Li, D., Zhou, Y., Wang, Z., Gao, D.: Exploiting the potentialities of features for speech emotion recognition. Inf. Sci. 548, 328–343 (2021). https://doi.org/10.1016/j.ins.2020.09.047

    Article  Google Scholar 

  45. Yin, Y., Zheng, X., Hu, B., Zhang, Y., Cui, X.: EEG emotion recognition using fusion model of graph convolutional neural networks and LSTM. Appl. Soft Comput. 100, 106954 (2021). https://doi.org/10.1016/j.asoc.2020.106954

    Article  Google Scholar 

  46. Huang, Y., Tian, K., Wu, A., Zhang, G.: Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. J. Ambient. Intell. Humaniz. Comput. 10(5), 1787–1798 (2017). https://doi.org/10.1007/s12652-017-0644-8

    Article  Google Scholar 

  47. Huang, H.B., Huang, X.R., Li, R.X., Lim, T.C., Ding, W.P.: Sound quality prediction of vehicle interior noise using deep belief networks. Appl. Acoust. 113, 149–161 (2016). https://doi.org/10.1016/j.apacoust.2016.06.021

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hilali Manal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Manal, H., Abdellah, E., Said, B.A. (2023). Deep Learning for Image and Sound Data: An Overview. In: Hassanien, A.E., et al. The 3rd International Conference on Artificial Intelligence and Computer Vision (AICV2023), March 5–7, 2023. AICV 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 164. Springer, Cham. https://doi.org/10.1007/978-3-031-27762-7_27

Download citation

Publish with us

Policies and ethics