Abstract
The growth of deep learning (DL) in recent years has been astonishing, and every domain in our life is often touched and sometimes consumed with DL breakthroughs. This brought us to why this state-of-the-art was elaborated, the algorithms used in DL are reviewed, their different architectures and designs, and the application domains are based on the type of input data, namely image and sound data. The domains of image classification and object detection for the former, and the domain of automatic speech recognition, Automatic spoken language identification, and Speech emotion recognition for the latter. Hoping through this paper that the task of looking for the latest techniques in DL is made easier for new researchers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Liu, L., et al.: Deep learning for generic object detection: a survey. Int. J. Comput. Vis. 128(2), 261–318 (2019). https://doi.org/10.1007/s11263-019-01247-4
Sornam, M., Muthusubash, K., Vanitha, V.: A survey on image classification and activity recognition using deep convolutional neural network architecture. In: 2017 9th International Conference on Advanced Computing (ICoAC), pp. 121–126 (2017). https://doi.org/10.1109/ICoAC.2017.8441512
Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Garcia-Rodriguez, J.: A review on deep learning techniques applied to semantic segmentation. arXiv:1704.06857 [cs] (2017)
Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Martinez-Gonzalez, P., Garcia-Rodriguez, J.: A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft Comput. 70, 41–65 (2018). https://doi.org/10.1016/j.asoc.2018.05.018
Yuan, X., Shi, J., Gu, L.: A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 169, 114417 (2021). https://doi.org/10.1016/j.eswa.2020.114417
Nekrasov, P., Freeze, J., Batista, V.: Using restricted boltzmann machines to model molecular geometries. arXiv:2012.06984 [physics] (2020)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [cs] (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 [cs] (2015)
Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. IEEE, Boston, MA, USA (2015). https://doi.org/10.1109/CVPR.2015.7298594
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078 [cs, stat] (2014)
Goodfellow, I.J., et al.: Generative adversarial networks. arXiv:1406.2661 [cs, stat] (2014)
Rahman, S., Wang, L., Sun, C., Zhou, L.: Deep learning based HEp-2 image classification: A comprehensive review. Med. Image Anal. 65, 101764 (2020). https://doi.org/10.1016/j.media.2020.101764
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. arXiv:1311.2901 [cs] (2013)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:1311.2524 [cs] (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_23
Huang, C., He, Z., Cao, G., Cao, W.: Task-driven progressive part localization for fine-grained object recognition. IEEE Trans. Multimedia 18, 2372–2383 (2016). https://doi.org/10.1109/TMM.2016.2602060
Kong, L., Huang, D., Qin, J., Wang, Y.: A joint framework for athlete tracking and action recognition in sports videos. IEEE Trans. Circuits Syst. Video Technol. 30, 532–548 (2020). https://doi.org/10.1109/TCSVT.2019.2893318
Girshick, R.: Fast R-CNN. arXiv:1504.08083 [cs] (2015)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. arXiv:1506.01497 [cs] (2016)
Parvathi, S., Selvi, S.T.: Detection of maturity stages of coconuts in complex background using Faster R-CNN model. Biosyst. Eng. 202, 119–132 (2021). https://doi.org/10.1016/j.biosystemseng.2020.12.002
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks vol. 9
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. arXiv:1612.03144 [cs] (2017)
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6154–6162. IEEE, Salt Lake City, UT (2018). https://doi.org/10.1109/CVPR.2018.00644
Wu, C.-E., Chan, Y.-M., Chen, C.-H., Chen, W.-C., Chen, C.-S.: IMMVP: an efficient daytime and nighttime on-road object detector. In: 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP 2019). IEEE, New York (2019)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. arXiv:1506.02640 [cs] (2016)
Shin, S., Han, H., Lee, S.H.: Improved YOLOv3 with duplex FPN for object detection based on deep learning. Int. J. Elec. Eng. Educ. https://doi.org/10.1177/0020720920983524
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Ding, L., Xu, X., Cao, Y., Zhai, G., Yang, F., Qian, L.: Detection and tracking of infrared small target by jointly using SSD and pipeline filter. Digital Signal Process. 110, 102949 (2021). https://doi.org/10.1016/j.dsp.2020.102949
Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: DSSD: deconvolutional single shot detector. arXiv:1701.06659 [cs] (2017)
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. arXiv:1708.02002 [cs] (2018)
da Silva, B.C.G., Tam, R., Ferrari, R.J.: Detecting cells in intravital video microscopy using a deep convolutional neural network. Comput. Biol. Med. 129, 104133 (2021). https://doi.org/10.1016/j.compbiomed.2020.104133
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra R-CNN: towards balanced learning for object detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 821–830. IEEE, Long Beach, CA, USA (2019). https://doi.org/10.1109/CVPR.2019.00091
Liu, Y., Sun, P., Wergeles, N., Shang, Y.: A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 172, 114602 (2021). https://doi.org/10.1016/j.eswa.2021.114602
Logan, B.: Mel frequency cepstral coefficients for music modeling. In: International Symposium on Music Information Retrieval (2000)
Honig, F., Stemmer, G., Hacker, C., Brugnara, F.: Revising perceptual linear prediction (PLP), vol. 4 (2005)
Palaz, D., Magimai-Doss, M., Collobert, R.: End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition. Speech Commun. 108, 15–32 (2019). https://doi.org/10.1016/j.specom.2019.01.004
Dokuz, Y., Tufekci, Z.: Mini-batch sample selection strategies for deep learning based speech recognition. Appl. Acoust. 171, 107573 (2021). https://doi.org/10.1016/j.apacoust.2020.107573
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. Association for Computing Machinery, New York, NY, USA (2006). https://doi.org/10.1145/1143844.1143891
Nagrani, A., Chung, J.S., Xie, W., Zisserman, A.: Voxceleb: large-scale speaker verification in the wild. Comput. Speech Lang. 60, 101027 (2020). https://doi.org/10.1016/j.csl.2019.101027
Garain, A., Singh, P.K., Sarkar, R.: FuzzyGCP: a deep learning architecture for automatic spoken language identification from speech signals. Expert Syst. Appl. 168, 114416 (2021). https://doi.org/10.1016/j.eswa.2020.114416
Ubale, R., Qian, Y., Evanini, K.: Exploring end-to-end attention-based neural networks for native language identification. In: 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece (2018). https://doi.org/10.1109/SLT.2018.8639689
Li, D., Zhou, Y., Wang, Z., Gao, D.: Exploiting the potentialities of features for speech emotion recognition. Inf. Sci. 548, 328–343 (2021). https://doi.org/10.1016/j.ins.2020.09.047
Yin, Y., Zheng, X., Hu, B., Zhang, Y., Cui, X.: EEG emotion recognition using fusion model of graph convolutional neural networks and LSTM. Appl. Soft Comput. 100, 106954 (2021). https://doi.org/10.1016/j.asoc.2020.106954
Huang, Y., Tian, K., Wu, A., Zhang, G.: Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. J. Ambient. Intell. Humaniz. Comput. 10(5), 1787–1798 (2017). https://doi.org/10.1007/s12652-017-0644-8
Huang, H.B., Huang, X.R., Li, R.X., Lim, T.C., Ding, W.P.: Sound quality prediction of vehicle interior noise using deep belief networks. Appl. Acoust. 113, 149–161 (2016). https://doi.org/10.1016/j.apacoust.2016.06.021
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Manal, H., Abdellah, E., Said, B.A. (2023). Deep Learning for Image and Sound Data: An Overview. In: Hassanien, A.E., et al. The 3rd International Conference on Artificial Intelligence and Computer Vision (AICV2023), March 5–7, 2023. AICV 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 164. Springer, Cham. https://doi.org/10.1007/978-3-031-27762-7_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-27762-7_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27761-0
Online ISBN: 978-3-031-27762-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)