Abstract
Inspired by the success of the CNN-RNN framework in the image captioning task, several works have explored this in multi-label image annotation with the hope that the RNN followed by a CNN would encode inter-label dependencies better than using a CNN alone. To do so, for each training sample, the earlier methods converted the ground-truth label-set into a sequence of labels based on their frequencies (e.g., rare-to-frequent) for training the RNN. However, since the ground-truth is an unordered set of labels, imposing a fixed and predefined sequence on them does not naturally align with this task. To address this, some of the recent papers have proposed techniques that are capable to train the RNN without feeding the ground-truth labels in a particular sequence/order. However, most of these techniques leave it to the RNN to implicitly choose one sequence for the ground-truth labels corresponding to each sample at the time of training, thus making it inherently biased. In this paper, we address this limitation and propose a novel approach in which the RNN is explicitly forced to learn multiple relevant inter-label dependencies, without the need of feeding the ground-truth in any particular order. Using thorough empirical comparisons, we demonstrate that our approach outperforms several state-of-the-art techniques on two popular datasets (MS-COCO and NUS-WIDE). Additionally, it provides a new perspecitve of looking at an unordered set of labels as equivalent to a collection of different permutations (sequences) of those labels, thus naturally aligning with the image annotation task. Our code is available at: https://github.com/ayushidutta/multi-order-rnn.
A. Dutta—The author did most of this work while she was a student at IIIT Hyderabad, India.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recogn. 39(9), 1757–1771 (2004)
Bucak, S.S., Jin, R., Jain, A.K.: Multi-label learning with incomplete class assignments. In: CVPR (2011)
Carneiro, G., Chan, A.B., Moreno, P.J., Vasconcelos, N.: Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 394–410 (2007)
Chen, S.F., Chen, Y.C., Yeh, C.K., Wang, Y.C.F.: Order-free RNN with visual attention for multi-label classification. In: AAAI (2018)
Chen, T., Wang, Z., Li, G., Lin, L.: Recurrent attentional reinforcement learning for multi-label image recognition. In: AAAI. pp. 6730–6737 (2018)
Chua, T.S, Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: a real-world web image database from national university of Singapore. In: In CIVR (2009)
Escalante, H.J., Hérnadez, C.A., Sucar, L.E., Montes, M.: Late fusion of heterogeneous methods for multimedia image retrieval. In: MIR (2008)
Fang, H., et al.: From captions to visual concepts and back. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1482 (2015)
Feng, S.L., Manmatha, R., Lavrenko, V.: Multiple Bernoulli relevance models for image and video annotation. In: CVPR (2004)
Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: Neural Information Processing Systems (NIPS) (2013)
Gong, Y., Jia, Y., Leung, T.K., Toshev, A., Ioffe, S.: Deep convolutional ranking for multilabel image annotation. In: ICLR (2014)
Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: TagProp: discriminative metric learning in nearest neighbour models for image auto-annotation. In: ICCV (2009)
Guo, H., Zheng, K., Fan, X., Yu, H., Wang, S.: Visual attention consistency under image transforms for multi-label image classification. In: CVPR. pp. 729–739 (2019)
Hariharan, B., Zelnik-Manor, L., Vishwanathan, S.V.N., Varma, M.: Large scale max-margin multi-label classification with priors. In: ICML (2010)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(9), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Hu, H., Zhou, G.T., Deng, Z., Liao, Z., Mori, G.: Learning structured inference neural networks with label relations. In: CVPR (2016)
Jin, J., Nakayama, H.: Annotation order matters: Recurrent image annotator for arbitrary length image tagging. In: ICPR (2016)
Joachims, T.: Optimizing search engines using clickthrough data. In: KDD (2002)
Johnson, J., Ballan, L., Fei-Fei, L.: Love thy neighbors: image annotation by exploiting image metadata. In: ICCV (2015)
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: CVPR, pp. 4565–4574 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
Lan, T., Mori, G.: A max-margin riffled independence model for image tag ranking. In: Computer Vision and Pattern Recognition (CVPR) (2013)
Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: NIPS (2003)
Li, C., Liu, C., Duan, L., Gao, P., Zheng, K.: Reconstruction regularized deep metric learning for multi-label image classification. IEEE Trans. Neural Netw. Learn. Syst. 31(4), 2294–2303 (2019)
Li, L., Wang, S., Jiang, S., Huang, Q.: Attentive recurrent neural network for weak-supervised multi-label image classification. In: ACM Multimedia, pp. 1092–1100 (2018)
Li, Q., Qiao, M., Bian, W., Tao, D.: Conditional graphical lasso for multi-label image classification. In: CVPR (2016)
Li, X., Zhao, F., Guo, Y.: Multi-label image classification with a probabilistic label enhancement model. In: Proceedings Uncertainty in Artificial Intelligence (2014)
Li, Y., Song, Y., Luo, J.: Improving pairwise ranking for multi-label image classification. In: CVPR (2017)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, F., Xiang, T., Hospedales, T.M., Yang, W., Sun, C.: Semantic regularisation for recurrent image annotation. In: CVPR (2017)
Liu, Y., Sheng, L., Shao, J., Yan, J., Xiang, S., Pan, C.: Multi-label image classification via knowledge distillation from weakly-supervised detection. In: ACM Multimedia, pp. 700–708 (2018)
Makadia, A., Pavlovic, V., Kumar, S.: A new baseline for image annotation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5304, pp. 316–329. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88690-7_24
Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM (CACM) 38(11), 39–41 (1995)
Murthy, V.N., Maji, S., Manmatha, R.: Automatic image annotation using deep learning representations. In: ICMR (2015)
Niu, Y., Lu, Z., Wen, J.R., Xiang, T., Chang, S.F.: Multi-modal multi-scale deep learning for large-scale image annotation. IEEE Trans. Image Process. 28, 1720–1731 (2017)
Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: ACM MM (2010)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)
Tsai, C.P., Lee, Y.H.: Adversarial learning of label dependency: a novel framework for multi-class classification. ICASSP pp. 3847–3851 (2019)
Uricchio, T., Ballan, L., Seidenari, L., Bimbo, A.D.: Automatic image annotation via label transfer in the semantic space (2016). CoRR abs/1605.04770
Verma, Y., Jawahar, C.V.: Image annotation using metric learning in semantic neighbourhoods. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 836–849. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_60
Verma, Y., Jawahar, C.V.: Exploring SVM for image annotation in presence of confusing labels. In: BMVC (2013)
Verma, Y., Jawahar, C.V.: Image annotation by propagating labels from semantic neighbourhoods. Int. J. Comput. Vision 121(1), 126–148 (2017)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Wang, C., Blei, D., Fei-Fei, L.: Simultaneous image classification and annotation. In: Proceedings CVPR (2009)
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN-RNN: a unified framework for multi-label image classification. In: CVPR (2016)
Weston, J., Bengio, S., Usunier, N.: WSABIE: scaling up to large vocabulary image annotation. In: IJCAI (2011)
Yazici, V.O., Gonzalez-Garcia, A., Ramisa, A., Twardowski, B., van de Weijer, J.: Orderless recurrent models for multi-label classification (2019). CoRR abs/1911.09996
You, R., Guo, Z., Cui, L., Long, X., Bao, Y., Wen, S.: Cross-modality attention with semantic graph embedding for multi-label classification (2019). CoRR abs/1912.07872
Zhu, F., Li, H., Ouyang, W., Yu, N., Wang, X.: Learning spatial regularization with image-level supervisions for multi-label image classification. In: CVPR, pp. 2027–2036 (2017)
Zhuang, Y., Yang, Y., Wu, F.: Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans. Multimedia 10(2), 221–229 (2008)
Acknowledgement
YV would like to thank the Department of Science and and Technology (India) for the INSPIRE Faculty Award 2017.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Dutta, A., Verma, Y., Jawahar, C.V. (2020). Recurrent Image Annotation with Explicit Inter-label Dependencies. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12374. Springer, Cham. https://doi.org/10.1007/978-3-030-58526-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-58526-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58525-9
Online ISBN: 978-3-030-58526-6
eBook Packages: Computer ScienceComputer Science (R0)