Abstract
The advancement in on demand Multimedia Streaming Applications (MAS) enables faster video transmission as per the user request in various fields. This system suffers from poor speed, flexibility and efficiency in accessing and presenting the multimedia contents from the archive. It mostly undergoes delay, packet loss and congestion during data delivery. Hence, the requirement of manual annotation is required for access and retrieval but it suffers from poor retrieval accuracy over large databases. The need of automatic annotation in MAS takes the lead for increased retrieval accuracy on most similar image retrieval systems based on various low-level features. Thus, it eliminates the gap between the high-level semantics and low-level feature representation. The approach on automated annotation of images is considered dependent on the accuracy of a model while detecting edges, color, texture, shape and spatial information. In this paper, we develop an automated annotation model that retrieves visually similar images from online multimedia streams with optimal feature extraction. The automated annotation model is designed with a Multi-modal Active Learning (MAL) that uses Convolutional Recurrent Neural Network (CRNN) for automatic annotation of labels based on visually similar contents or features like edges, color, texture, shape and spatial information. Further, a Deep Reinforcement Learning (DRL) algorithm is used that increases the performance of the retrieval engine based on validating the visually extracted features. The simulation of MAL-CNN is conducted over large online streaming databases and it is then validated by DRL on an online real-time streaming. The performance is validated in terms of its retrieval accuracy, sensitivity, specificity, f-measure, geometric mean and mean absolute percentage error (MAPE). The results confirm the accuracy of the proposed MAL-DRL model against conventional machine learning, reinforcement learning and deep learning automatic annotation models.














Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
20BN-something-something Dataset:https://20bn.com/datasets/something-something
Abdel-Mottaleb M, Wu HL, Dimitrova N (1996) Aspects of multimedia retrieval. Philips J Res 50(1–2):227–251
Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675
ActivityNet C dataset: https://paperswithcode.com/sota/dense-video-captioning-on-activitynet
Alansary A, Oktay O, Li Y, Le Folgoc L, Hou B, Vaillant G, Rueckert D (2019) Evaluating reinforcement learning agents for anatomical landmark detection. Med Image Anal 53:156–164
Chatterjee I (2021) Artificial intelligence and patentability: review and discussions. Int J Mod Res 1:15–21
DALY dataset: http://thoth.inrialpes.fr/daly/
Duraimurugan S, Jayarin PJ (2020) Maximizing the quality of service in distributed multimedia streaming in heterogeneous wireless network. Multimed Tools Appl 79(5):4185–4198
Goyal R, Kahou SE, Michalski V, Materzynska J, Westphal S, Kim H, Hoppe F (2017) The” Something Something” video database for learning and evaluating visual common sense. In: ICCV, vol 1, no 4, p 5
Hashemzehi R, Mahdavi SJS, Kheirabadi M, Kamel SR (2020) Detection of brain tumors from MRI images base on deep learning using hybrid model CNN and NADE. Biocybern Biomed Eng. https://doi.org/10.1016/j.bbe.2020.06.001
He S, Wu J, Lian C, Gach HM, Mutic S, Bosch W, Li H (2020) An adaptive low-rank modeling-based active learning method for medical image annotation. IRBM. In Press, Corrected Proof. https://doi.org/10.1016/j.irbm.2020.06.001
Huang G, Liu Z, Pleiss G, Van Der Maaten L, Weinberger K (2019) Convolutional networks with dense connectivity. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2918284
Ide H, Kobayashi T, Watanabe K, Kurita T (2020) Robust pruning for efficient CNNs. Pattern Recognit Lett 135:90–98
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014)Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 1725-1732
Ke X, Zhou M, Niu Y, Guo W (2017) Data equilibrium based automatic image annotation by fusing deep model and semantic propagation. Pattern Recogn 71:60–77
Khalil T, Akram MU, Raja H, Jameel A, Basit I (2018) Detection of glaucoma using cup to disc ratio from spectral domain optical coherence tomography images. IEEE Access 6:4560–4576
Kiran R, Kumar P, Bhasker B (2020) OSLCFit (Organic Simultaneous LSTM and CNN Fit): A novel deep learning based solution for sentiment polarity classification of reviews. Expert Syst Appl 113488
Koriem SM (2004) Modeling concurrent, sequential, storage, retrieval, and scheduling activities of multimedia systems. J King Saud Univ - Comput Inf Sci 17:65–103
Krishna R, Hata K, Ren F, Fei-Fei L, Niebles C (2017) J. Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision, pp 706-715
Kumar R, Dhiman G (2021) A comparative study of fuzzy optimization through fuzzy number. Int J Mod Res 1:1–14
Kuminski E, Shamir L (2018) A hybrid approach to machine learning annotation of large galaxy image databases. Astron Comput 25:257–269
Li H, Zhang B, Zhang Y, Liu W, Mao Y, Huang J, Wei L (2020) A semi-automated annotation algorithm based on weakly supervised learning for medical images. Biocybernet Biomed Eng 40(2):787–802
Luo C, Yu L, Yang E, Zhou H, Ren P (2019) A benchmark image dataset for industrial tools. Pattern Recognit Lett 125:341–348
Mishkin D, Sergievskiy N, Matas J (2017) Systematic evaluation of convolution neural network advances on the imagenet. Comput Vis Image Underst 161:11–19
Mishra SR, Mishra TK, Sanyal G, Sarkar A, Satapathy SC (2020) Real time human action recognition using triggered frame extraction and a typical CNN heuristic. Pattern Recognit Lett 135:329–336
Mo K, Zhu S, Chang AX, Yi L, Tripathi S, Guibas LJ, Su H (2019) Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 909-918
MPII-Cooking dataset: https://pgram.com/dataset/mpii-cooking-activities-dataset/
Piras L, Giacinto G (2017) Information fusion in content based image retrieval: A comprehensive overview. Inf Fusion 37:50–60
Qi X, Han Y (2007) Incorporating multiple SVMs for automatic image annotation. Pattern Recogn 40(2):728–741
Qin J, Pan W, Xiang X, Tan Y, Hou G (2020) A biological image classification method based on improved CNN. Eco Inform 58:101093
Real E, Shlens J, Mazzocchi S, Pan X, Vanhoucke V (2017) Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5296-5305
Rohrbach M, Amin S, Andriluka M, Schiele B (2012) A database for fine grained activity detection of cooking activities. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 1194-1201
Sherstinsky A (2020) Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D 404:132306
Sports-1M dataset: https://github.com/gtoderici/sports-1m-dataset/blob/wiki/ProjectHome.md
Tian F, Wang Q, Li X, Sun N (2019) Heterogeneous multimedia cooperative annotation based on multimodal correlation learning. J Vis Commun Image Represent 58:544–553
Tran D, Bolonkin M, Paluri M, Torresani L (2016) VideoMCC: a New benchmark for video comprehension. arXiv preprint arXiv:1606.07373
Vaishnav PK, Sharma S, Sharma P (2021) Analytical review analysis for screening COVID-19. Int J Mod Res 1:22–29
VideoMCC dataset: https://archive.org/details/vicomdataset
Wang R, Xie Y, Yang J, Xue L, Hu M, Zhang Q (2017) Large scale automatic image annotation based on convolutional neural network. J Vis Commun Image Represent 49:213–224
Wang R, Xu J, Han TX (2019) Object instance detection with pruned Alexnet and extended training data. Sig Process Image Commun 70:145–156
Wang C, Song L, Wang G, Zhang Q, Wang X (2020)Multi-scale multi-patch person re-identification with exclusivity regularized softmax. Neurocomputing 382:64–70
Weinzaepfel P, Martin X, Schmid C (2016) Human action localization with sparse spatial supervision. arXiv preprint arXiv:1605.05197
Xie Y, Zhou S, Xiao Y, Kulturel-Konak S, Konak A (2018) A β-accurate linearization method of Euclidean distance for the facility layout problem with heterogeneous distance metrics. Eur J Oper Res 265(1):26–38
Xue Z, Li G, Huang Q (2018) Joint multi-view representation and image annotation via optimal predictive subspace learning. Inf Sci 451:180–194
Youtube-8M dataset: http://research.google.com/youtube8m/
Youtube BoundingBoxes dataset: https://research.google.com/youtube-bb/
Zafar B, Ashraf R, Ali N, Ahmed M, Jabbar S, Naseer K, Jeon G (2018) Intelligent image classification-based on spatial weighted histograms of concentric circles. Comput Sci Inf Syst 15(3):615–633
Zhao M, Chow TW, Zhang Z, Li B (2015) Automatic image annotation via compact graph based semi-supervised learning. Knowl Based Syst 76:148–165
Zhao W, Yan L, Zhang Y (2018)Geometric-constrained multi-view image matching method based on semi-global optimization. Geo Spat Inf Sci 21(2):115–126
Zhen Z, Xuan Z, Wang F, Sun R, Duić N, Jin T (2019) Image phase shift invariance based multi-transform-fusion method for cloud motion displacement calculation using sky images. Energy Conv Manag 197:111853
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dhiman, G., Kumar, A.V., Nirmalan, R. et al. Multi-modal active learning with deep reinforcement learning for target feature extraction in multi-media image processing applications. Multimed Tools Appl 82, 5343–5367 (2023). https://doi.org/10.1007/s11042-022-12178-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12178-7