Skip to main content
Log in

Multi-modal active learning with deep reinforcement learning for target feature extraction in multi-media image processing applications

  • 1178: Pattern Recognition for Adaptive User Interfaces
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The advancement in on demand Multimedia Streaming Applications (MAS) enables faster video transmission as per the user request in various fields. This system suffers from poor speed, flexibility and efficiency in accessing and presenting the multimedia contents from the archive. It mostly undergoes delay, packet loss and congestion during data delivery. Hence, the requirement of manual annotation is required for access and retrieval but it suffers from poor retrieval accuracy over large databases. The need of automatic annotation in MAS takes the lead for increased retrieval accuracy on most similar image retrieval systems based on various low-level features. Thus, it eliminates the gap between the high-level semantics and low-level feature representation. The approach on automated annotation of images is considered dependent on the accuracy of a model while detecting edges, color, texture, shape and spatial information. In this paper, we develop an automated annotation model that retrieves visually similar images from online multimedia streams with optimal feature extraction. The automated annotation model is designed with a Multi-modal Active Learning (MAL) that uses Convolutional Recurrent Neural Network (CRNN) for automatic annotation of labels based on visually similar contents or features like edges, color, texture, shape and spatial information. Further, a Deep Reinforcement Learning (DRL) algorithm is used that increases the performance of the retrieval engine based on validating the visually extracted features. The simulation of MAL-CNN is conducted over large online streaming databases and it is then validated by DRL on an online real-time streaming. The performance is validated in terms of its retrieval accuracy, sensitivity, specificity, f-measure, geometric mean and mean absolute percentage error (MAPE). The results confirm the accuracy of the proposed MAL-DRL model against conventional machine learning, reinforcement learning and deep learning automatic annotation models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. 20BN-something-something Dataset:https://20bn.com/datasets/something-something

  2. Abdel-Mottaleb M, Wu HL, Dimitrova N (1996) Aspects of multimedia retrieval. Philips J Res 50(1–2):227–251

    Article  Google Scholar 

  3. Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675

  4. ActivityNet C dataset: https://paperswithcode.com/sota/dense-video-captioning-on-activitynet

  5. Alansary A, Oktay O, Li Y, Le Folgoc L, Hou B, Vaillant G, Rueckert D (2019) Evaluating reinforcement learning agents for anatomical landmark detection. Med Image Anal 53:156–164

    Article  Google Scholar 

  6. Chatterjee I (2021) Artificial intelligence and patentability: review and discussions. Int J Mod Res 1:15–21

    Google Scholar 

  7. DALY dataset: http://thoth.inrialpes.fr/daly/

  8. Duraimurugan S, Jayarin PJ (2020) Maximizing the quality of service in distributed multimedia streaming in heterogeneous wireless network. Multimed Tools Appl 79(5):4185–4198

    Article  Google Scholar 

  9. Goyal R, Kahou SE, Michalski V, Materzynska J, Westphal S, Kim H, Hoppe F (2017) The” Something Something” video database for learning and evaluating visual common sense. In: ICCV, vol 1, no 4, p 5

  10. Hashemzehi R, Mahdavi SJS, Kheirabadi M, Kamel SR (2020) Detection of brain tumors from MRI images base on deep learning using hybrid model CNN and NADE. Biocybern Biomed Eng. https://doi.org/10.1016/j.bbe.2020.06.001

  11. He S, Wu J, Lian C, Gach HM, Mutic S, Bosch W, Li H (2020) An adaptive low-rank modeling-based active learning method for medical image annotation. IRBM. In Press, Corrected Proof. https://doi.org/10.1016/j.irbm.2020.06.001

  12. Huang G, Liu Z, Pleiss G, Van Der Maaten L, Weinberger K (2019) Convolutional networks with dense connectivity. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2918284

    Article  Google Scholar 

  13. Ide H, Kobayashi T, Watanabe K, Kurita T (2020) Robust pruning for efficient CNNs. Pattern Recognit Lett 135:90–98

    Article  Google Scholar 

  14. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014)Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 1725-1732

  15. Ke X, Zhou M, Niu Y, Guo W (2017) Data equilibrium based automatic image annotation by fusing deep model and semantic propagation. Pattern Recogn 71:60–77

    Article  Google Scholar 

  16. Khalil T, Akram MU, Raja H, Jameel A, Basit I (2018) Detection of glaucoma using cup to disc ratio from spectral domain optical coherence tomography images. IEEE Access 6:4560–4576

    Article  Google Scholar 

  17. Kiran R, Kumar P, Bhasker B (2020) OSLCFit (Organic Simultaneous LSTM and CNN Fit): A novel deep learning based solution for sentiment polarity classification of reviews. Expert Syst Appl 113488

  18. Koriem SM (2004) Modeling concurrent, sequential, storage, retrieval, and scheduling activities of multimedia systems. J King Saud Univ - Comput Inf Sci 17:65–103

    Google Scholar 

  19. Krishna R, Hata K, Ren F, Fei-Fei L, Niebles C (2017) J. Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision, pp 706-715

  20. Kumar R, Dhiman G (2021) A comparative study of fuzzy optimization through fuzzy number. Int J Mod Res 1:1–14

    Google Scholar 

  21. Kuminski E, Shamir L (2018) A hybrid approach to machine learning annotation of large galaxy image databases. Astron Comput 25:257–269

    Article  Google Scholar 

  22. Li H, Zhang B, Zhang Y, Liu W, Mao Y, Huang J, Wei L (2020) A semi-automated annotation algorithm based on weakly supervised learning for medical images. Biocybernet Biomed Eng 40(2):787–802

    Article  Google Scholar 

  23. Luo C, Yu L, Yang E, Zhou H, Ren P (2019) A benchmark image dataset for industrial tools. Pattern Recognit Lett 125:341–348

    Article  Google Scholar 

  24. Mishkin D, Sergievskiy N, Matas J (2017) Systematic evaluation of convolution neural network advances on the imagenet. Comput Vis Image Underst 161:11–19

    Article  Google Scholar 

  25. Mishra SR, Mishra TK, Sanyal G, Sarkar A, Satapathy SC (2020) Real time human action recognition using triggered frame extraction and a typical CNN heuristic. Pattern Recognit Lett 135:329–336

    Article  Google Scholar 

  26. Mo K, Zhu S, Chang AX, Yi L, Tripathi S, Guibas LJ, Su H (2019) Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 909-918

  27. MPII-Cooking dataset: https://pgram.com/dataset/mpii-cooking-activities-dataset/

  28. Piras L, Giacinto G (2017) Information fusion in content based image retrieval: A comprehensive overview. Inf Fusion 37:50–60

    Article  Google Scholar 

  29. Qi X, Han Y (2007) Incorporating multiple SVMs for automatic image annotation. Pattern Recogn 40(2):728–741

    Article  MATH  Google Scholar 

  30. Qin J, Pan W, Xiang X, Tan Y, Hou G (2020) A biological image classification method based on improved CNN. Eco Inform 58:101093

    Article  Google Scholar 

  31. Real E, Shlens J, Mazzocchi S, Pan X, Vanhoucke V (2017) Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5296-5305

  32. Rohrbach M, Amin S, Andriluka M, Schiele B (2012) A database for fine grained activity detection of cooking activities. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 1194-1201

  33. Sherstinsky A (2020) Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D 404:132306

    Article  MATH  Google Scholar 

  34. Sports-1M dataset: https://github.com/gtoderici/sports-1m-dataset/blob/wiki/ProjectHome.md

  35. Tian F, Wang Q, Li X, Sun N (2019) Heterogeneous multimedia cooperative annotation based on multimodal correlation learning. J Vis Commun Image Represent 58:544–553

    Article  Google Scholar 

  36. Tran D, Bolonkin M, Paluri M, Torresani L (2016) VideoMCC: a New benchmark for video comprehension. arXiv preprint arXiv:1606.07373

  37. Vaishnav PK, Sharma S, Sharma P (2021) Analytical review analysis for screening COVID-19. Int J Mod Res 1:22–29

    Google Scholar 

  38. VideoMCC dataset: https://archive.org/details/vicomdataset

  39. Wang R, Xie Y, Yang J, Xue L, Hu M, Zhang Q (2017) Large scale automatic image annotation based on convolutional neural network. J Vis Commun Image Represent 49:213–224

    Article  Google Scholar 

  40. Wang R, Xu J, Han TX (2019) Object instance detection with pruned Alexnet and extended training data. Sig Process Image Commun 70:145–156

    Article  Google Scholar 

  41. Wang C, Song L, Wang G, Zhang Q, Wang X (2020)Multi-scale multi-patch person re-identification with exclusivity regularized softmax. Neurocomputing 382:64–70

    Article  Google Scholar 

  42. Weinzaepfel P, Martin X, Schmid C (2016) Human action localization with sparse spatial supervision. arXiv preprint arXiv:1605.05197

  43. Xie Y, Zhou S, Xiao Y, Kulturel-Konak S, Konak A (2018) A β-accurate linearization method of Euclidean distance for the facility layout problem with heterogeneous distance metrics. Eur J Oper Res 265(1):26–38

    Article  MATH  Google Scholar 

  44. Xue Z, Li G, Huang Q (2018) Joint multi-view representation and image annotation via optimal predictive subspace learning. Inf Sci 451:180–194

    Article  MATH  Google Scholar 

  45. Youtube-8M dataset: http://research.google.com/youtube8m/

  46. Youtube BoundingBoxes dataset: https://research.google.com/youtube-bb/

  47. Zafar B, Ashraf R, Ali N, Ahmed M, Jabbar S, Naseer K, Jeon G (2018) Intelligent image classification-based on spatial weighted histograms of concentric circles. Comput Sci Inf Syst 15(3):615–633

    Article  Google Scholar 

  48. Zhao M, Chow TW, Zhang Z, Li B (2015) Automatic image annotation via compact graph based semi-supervised learning. Knowl Based Syst 76:148–165

    Article  Google Scholar 

  49. Zhao W, Yan L, Zhang Y (2018)Geometric-constrained multi-view image matching method based on semi-global optimization. Geo Spat Inf Sci 21(2):115–126

    Article  Google Scholar 

  50. Zhen Z, Xuan Z, Wang F, Sun R, Duić N, Jin T (2019) Image phase shift invariance based multi-transform-fusion method for cloud motion displacement calculation using sky images. Energy Conv Manag 197:111853

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gaurav Dhiman.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dhiman, G., Kumar, A.V., Nirmalan, R. et al. Multi-modal active learning with deep reinforcement learning for target feature extraction in multi-media image processing applications. Multimed Tools Appl 82, 5343–5367 (2023). https://doi.org/10.1007/s11042-022-12178-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12178-7

Keywords

Navigation