Multi-modal multi-concept-based deep neural network for automatic image annotation

Xu, Haijiao; Huang, Changqin; Huang, Xiaodi; Huang, Muxiong

doi:10.1007/s11042-018-6555-7

Multi-modal multi-concept-based deep neural network for automatic image annotation

Published: 24 August 2018

Volume 78, pages 30651–30675, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Haijiao Xu¹,
Changqin Huang ORCID: orcid.org/0000-0003-1371-2608^1,2,
Xiaodi Huang³ &
…
Muxiong Huang¹

402 Accesses
6 Citations
Explore all metrics

Abstract

Automatic Image Annotation (AIA) remains as a challenge in computer vision with real-world applications, due to the semantic gap between high-level semantic concepts and low-level visual appearances. Contextual tags attached to visual images and context semantics among semantic concepts can provide further semantic information to bridge this gap. In order to effectively capture these semantic correlations, we present a novel approach called Multi-modal Multi-concept-based Deep Neural Network (M2-DNN) in this study, which models the correlations of visual images, contextual tags, and multi-concept semantics. Unlike traditional AIA methods, our M2-DNN approach takes into account not only single-concept context semantics, but also multi-concept context semantics with abstract scenes. In our model, a multi-concept such as \(\{``plane",``buildings"\}\) is viewed as one holistic scene concept for concept learning. Specifically, we first construct a multi-modal Deep Neural Network (DNN) as a concept classifier for visual images and contextual tags, and then employ it to annotate unlabeled images. Second, real-world databases commonly include many difficult concepts that are hard to be recognized, such as concepts with similar appearances, concepts with abstract scenes, and rare concepts. To effectively recognize them, we utilize multi-concept semantics inference and multi-modal correlation learning to refine semantic annotations. Finally, we estimate the most relevant labels for each of unlabeled images through a new strategy of label decision. The results of our comprehensive experiments on two publicly available datasets have shown that our method performs favourably compared with several other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Hybrid Architecture Based on CNN for Image Semantic Annotation

PERIA-Framework: A Prediction Extension Revision Image Annotation Framework

A hybrid architecture based on CNN for cross-modal semantic instance annotation

Article 05 May 2017

References

Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27
Article Google Scholar
Chen M, Zheng A, Weinberger KQ (2013) Fast image tagging. In: Proceedings of ACM International Conference on Machine Learning, pp 1274–1282
Chu W, Cai D (2018) Deep feature based contextual model for object detection. Neurocomputing 275:1035–1042
Article Google Scholar
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world Web image database from National University of Singapore. In: Proceedings of ACM International Conference on Image and Video Retrieval, pp 48–56
Gong Y, Jia Y, Leung T, Toshev A, Ioffe S (2014) Deep convolutional ranking for multilabel image annotation. In: Proceedings of International Conference on Learning Representations
Guillaumin M, Mensink T, Verbeek J, Schmid C (2009) TagProp: discriminative metric learning in nearest neighbor models for image auto-annotation. In: Proceedings of IEEE International Conference on Computer Vision, pp 309–316
Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 902–909
Izadinia H, Russell BC, Farhadi A, Hoffman MD, Hertzmann A (2015) Deep classifiers from image tags in the wild. In: Proceedings of ACM Conference on Multimedia, pp 13–18
Kalayeh MM, Idrees H, Shah M (2014) NMF-KNN: image annotation using weighted multi-view non-negative matrix factorization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 184–191
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of ACL International Conference on Empirical Methods in Natural Language Processing, pp 1746–1751
Lai H, Pan Y, Shu X, Wei Y, Yan S (2016) Instance-aware hashing for multi-label image retrieval. IEEE Trans Image Process 25(6):2469–2479
Article MathSciNet Google Scholar
Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Li Y, Song Y, Luo J (2017) Improving pairwise ranking for multi-label image classification. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp 426–435
Lin M, Chen Q, Yan S (2014) Network In Network. In: Proceedings of International Conference on Learning Representations
Lin G, Liao K, Sun B, Chen Y, Zhao F (2017) Dynamic graph fusion label propagation for semi-supervised multi-modality classification. Pattern Recogn 68:14–23
Article Google Scholar
Liu W, Tsang IW (2015) Large margin metric learning for multi-label prediction. In: Proceedings of AAAI Conference on Artificial Intelligence, pp 2800–2806
Liu Z, Zhang C, Chen C (2018) MMDF-LDA: an improved multi-modal latent Dirichlet allocation model for social image annotation. Expert Syst Appl 104:168–184
Article Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of Advances in Neural Information Processing Systems, pp 3111–3119
Nogueira K, Veloso AA, Santos JAD (2016) Pointwise and pairwise clothing annotation: combining features from social media. Multimed Tools Appl 75(7):4083–4113
Article Google Scholar
Nowak S, Nagel K, Liebetrau J (2011) The CLEF 2011 photo annotation and concept-based retrieval tasks. In: Proceedings of CLEF Conference and Labs of the Evaluation Forum, pp 1–25
Ren Z, Jin H, Lin Z, Fang C, Yuille A (2015) Multi-instance visual-semantic embedding. arXiv:1512.06963
Shu X, Lai D, Xu H, Tao L (2015) Learning shared subspace for multi-label dimensionality reduction via dependence maximization. Neurocomputing 168:356–364
Article Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: Proceedings of International Conference on Learning Representations
Song Y, Mcduff D, Vasisht D, Kapoor A (2016) Exploiting sparsity and co-occurrence structure for action unit recognition. In: Proceedings of IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, pp 1–8
Srivastava N, Salakhutdinov R (2014) Multimodal learning with deep Boltzmann machines. J Mach Learn Res 15(1):2949–2980
MathSciNet MATH Google Scholar
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp 1–9
Wang Y, Lin X, Wu L, Zhang W, Zhang Q (2015) LBMCH: learning bridging mapping for cross-modal hashing. In: Proceedings of International ACM SIGIR, pp 999–1002
Wang Y, Lin X, Wu L, Zhang W, Zhang Q, Huang X (2015) Robust subspace clustering for multi-view data by exploiting correlation consensus. IEEE Trans Image Process 24(11):3939–3949
Article MathSciNet Google Scholar
Wang Y, Zhang W, Wu L, Lin X, Fang M, Pan S (2016) Iterative views agreement: an iterative low-rank based structured optimization method to multi-view spectral clustering. In: Proceedings of International Joint Conference on Artificial Intelligence, pp 2153–2159
Wang Y, Zhang W, Wu L, Lin X, Zhao X (2017) Unsupervised metric fusion over multiview data by graph random walk-based cross-view diffusion. IEEE Trans Neural Netw Learn Syst 28(1):57–70
Article Google Scholar
Wang Y, Lin X, Wu L, Zhang W (2017) Effective multi-query expansions: collaborative deep networks for robust landmark retrieval. IEEE Trans Image Process 26(3):1393–1404
Article MathSciNet Google Scholar
Wang Y, Wu L, Lin X, Gao J (2018) Multiview spectral clustering via structured low-rank matrix factorization. IEEE Transactions on Neural Networks and Learning Systems, https://doi.org/10.1109/TNNLS.2017.2777489
Article Google Scholar
Wang Y, Wu L (2018) Beyond low-rank representations: orthogonal clustering basis reconstruction with optimized graph structure for multi-view spectral clustering. Neural Netw 103:1–8
Article Google Scholar
Wu B, Jia F, Liu W, Ghanem B, Lyu S (2018) Multi-label learning with missing labels using mixed dependency graphs. International Journal of Computer Vision 126(8):875–896
Article MathSciNet Google Scholar
Wu L, Wang Y, Li X, Gao J (2018) What-and-where to match: deep spatially multiplicative integration networks for person re-identification. Pattern Recogn 76:727–738
Article Google Scholar
Wu L, Wang Y, Li X, Gao J (2018) Deep attention-based spatially recursive networks for fine-grained visual recognition. IEEE Transactions on Cybernetics. https://doi.org/10.1109/TCYB.2018.2813971
Article Google Scholar
Wu L, Wang Y, Gao J, Li X (2018) Deep adaptive feature embedding with local sample distributions for person re-identification. Pattern Recogn 73:275–288
Article Google Scholar
Xiang Y, Zhou X, Liu Z, Chua TS, Ngo CW (2010) Semantic context modeling with maximal margin conditional random fields for automatic image annotation. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp 3368–3375
Xie L, Pan P, Lu Y (2015) Markov random field based fusion for supervised and semi-supervised multi-modal image classification. Multimed Tools Appl 74(2):613–634
Article Google Scholar
Xu H, Huang C, Pan P, Zhao G, Xu C, Lu Y, Chen D, Wu J (2015) Image retrieval based on multi-concept detector and semantic correlation. Sci China Inf Sci 58(12):1–15
Article Google Scholar
Xu C, Lu C, Liang X, Gao J, Zheng W, Wang T, Yan S (2016) Multi-loss Regularized Deep Neural Network. IEEE Trans Circ Syst Video Technol 26 (12):2273–2283
Article Google Scholar
Zhang S, Huang J, Li H, Metaxas D (2012) Automatic image annotation and retrieval using group sparsity. IEEE Trans Syst Man Cybern Part B: Cybern 42 (3):838–849
Article Google Scholar

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 61877020), the GDUPS (2015), the CSC (No. 201706755023) and the China Postdoctoral Science Foundation (No. 2016M600657 and 2017T100637).

Author information

Authors and Affiliations

School of Information Technology in Education, South China Normal University, Guangzhou, China
Haijiao Xu, Changqin Huang & Muxiong Huang
Guangdong Engineering Research Center for Smart Learning, South China Normal University, Guangzhou, China
Changqin Huang
School of Computing and Mathematics, Charles Sturt University, Albury, Australia
Xiaodi Huang

Authors

Haijiao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Changqin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodi Huang
View author publications
You can also search for this author in PubMed Google Scholar
Muxiong Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Changqin Huang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, H., Huang, C., Huang, X. et al. Multi-modal multi-concept-based deep neural network for automatic image annotation. Multimed Tools Appl 78, 30651–30675 (2019). https://doi.org/10.1007/s11042-018-6555-7

Download citation

Received: 01 June 2018
Revised: 09 August 2018
Accepted: 15 August 2018
Published: 24 August 2018
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11042-018-6555-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-modal multi-concept-based deep neural network for automatic image annotation

Abstract

Access this article

Similar content being viewed by others

A Hybrid Architecture Based on CNN for Image Semantic Annotation

PERIA-Framework: A Prediction Extension Revision Image Annotation Framework

A hybrid architecture based on CNN for cross-modal semantic instance annotation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-modal multi-concept-based deep neural network for automatic image annotation

Abstract

Access this article

Similar content being viewed by others

A Hybrid Architecture Based on CNN for Image Semantic Annotation

PERIA-Framework: A Prediction Extension Revision Image Annotation Framework

A hybrid architecture based on CNN for cross-modal semantic instance annotation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation