Skip to main content

Recurrent Image Annotation with Explicit Inter-label Dependencies

  • Conference paper
  • First Online:
Book cover Computer Vision – ECCV 2020 (ECCV 2020)

Abstract

Inspired by the success of the CNN-RNN framework in the image captioning task, several works have explored this in multi-label image annotation with the hope that the RNN followed by a CNN would encode inter-label dependencies better than using a CNN alone. To do so, for each training sample, the earlier methods converted the ground-truth label-set into a sequence of labels based on their frequencies (e.g., rare-to-frequent) for training the RNN. However, since the ground-truth is an unordered set of labels, imposing a fixed and predefined sequence on them does not naturally align with this task. To address this, some of the recent papers have proposed techniques that are capable to train the RNN without feeding the ground-truth labels in a particular sequence/order. However, most of these techniques leave it to the RNN to implicitly choose one sequence for the ground-truth labels corresponding to each sample at the time of training, thus making it inherently biased. In this paper, we address this limitation and propose a novel approach in which the RNN is explicitly forced to learn multiple relevant inter-label dependencies, without the need of feeding the ground-truth in any particular order. Using thorough empirical comparisons, we demonstrate that our approach outperforms several state-of-the-art techniques on two popular datasets (MS-COCO and NUS-WIDE). Additionally, it provides a new perspecitve of looking at an unordered set of labels as equivalent to a collection of different permutations (sequences) of those labels, thus naturally aligning with the image annotation task. Our code is available at: https://github.com/ayushidutta/multi-order-rnn.

A. Dutta—The author did most of this work while she was a student at IIIT Hyderabad, India.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recogn. 39(9), 1757–1771 (2004)

    Article  Google Scholar 

  2. Bucak, S.S., Jin, R., Jain, A.K.: Multi-label learning with incomplete class assignments. In: CVPR (2011)

    Google Scholar 

  3. Carneiro, G., Chan, A.B., Moreno, P.J., Vasconcelos, N.: Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 394–410 (2007)

    Article  Google Scholar 

  4. Chen, S.F., Chen, Y.C., Yeh, C.K., Wang, Y.C.F.: Order-free RNN with visual attention for multi-label classification. In: AAAI (2018)

    Google Scholar 

  5. Chen, T., Wang, Z., Li, G., Lin, L.: Recurrent attentional reinforcement learning for multi-label image recognition. In: AAAI. pp. 6730–6737 (2018)

    Google Scholar 

  6. Chua, T.S, Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: a real-world web image database from national university of Singapore. In: In CIVR (2009)

    Google Scholar 

  7. Escalante, H.J., Hérnadez, C.A., Sucar, L.E., Montes, M.: Late fusion of heterogeneous methods for multimedia image retrieval. In: MIR (2008)

    Google Scholar 

  8. Fang, H., et al.: From captions to visual concepts and back. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1482 (2015)

    Google Scholar 

  9. Feng, S.L., Manmatha, R., Lavrenko, V.: Multiple Bernoulli relevance models for image and video annotation. In: CVPR (2004)

    Google Scholar 

  10. Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: Neural Information Processing Systems (NIPS) (2013)

    Google Scholar 

  11. Gong, Y., Jia, Y., Leung, T.K., Toshev, A., Ioffe, S.: Deep convolutional ranking for multilabel image annotation. In: ICLR (2014)

    Google Scholar 

  12. Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: TagProp: discriminative metric learning in nearest neighbour models for image auto-annotation. In: ICCV (2009)

    Google Scholar 

  13. Guo, H., Zheng, K., Fan, X., Yu, H., Wang, S.: Visual attention consistency under image transforms for multi-label image classification. In: CVPR. pp. 729–739 (2019)

    Google Scholar 

  14. Hariharan, B., Zelnik-Manor, L., Vishwanathan, S.V.N., Varma, M.: Large scale max-margin multi-label classification with priors. In: ICML (2010)

    Google Scholar 

  15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  16. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(9), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  17. Hu, H., Zhou, G.T., Deng, Z., Liao, Z., Mori, G.: Learning structured inference neural networks with label relations. In: CVPR (2016)

    Google Scholar 

  18. Jin, J., Nakayama, H.: Annotation order matters: Recurrent image annotator for arbitrary length image tagging. In: ICPR (2016)

    Google Scholar 

  19. Joachims, T.: Optimizing search engines using clickthrough data. In: KDD (2002)

    Google Scholar 

  20. Johnson, J., Ballan, L., Fei-Fei, L.: Love thy neighbors: image annotation by exploiting image metadata. In: ICCV (2015)

    Google Scholar 

  21. Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: CVPR, pp. 4565–4574 (2015)

    Google Scholar 

  22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)

    Google Scholar 

  23. Lan, T., Mori, G.: A max-margin riffled independence model for image tag ranking. In: Computer Vision and Pattern Recognition (CVPR) (2013)

    Google Scholar 

  24. Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: NIPS (2003)

    Google Scholar 

  25. Li, C., Liu, C., Duan, L., Gao, P., Zheng, K.: Reconstruction regularized deep metric learning for multi-label image classification. IEEE Trans. Neural Netw. Learn. Syst. 31(4), 2294–2303 (2019)

    MathSciNet  Google Scholar 

  26. Li, L., Wang, S., Jiang, S., Huang, Q.: Attentive recurrent neural network for weak-supervised multi-label image classification. In: ACM Multimedia, pp. 1092–1100 (2018)

    Google Scholar 

  27. Li, Q., Qiao, M., Bian, W., Tao, D.: Conditional graphical lasso for multi-label image classification. In: CVPR (2016)

    Google Scholar 

  28. Li, X., Zhao, F., Guo, Y.: Multi-label image classification with a probabilistic label enhancement model. In: Proceedings Uncertainty in Artificial Intelligence (2014)

    Google Scholar 

  29. Li, Y., Song, Y., Luo, J.: Improving pairwise ranking for multi-label image classification. In: CVPR (2017)

    Google Scholar 

  30. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  31. Liu, F., Xiang, T., Hospedales, T.M., Yang, W., Sun, C.: Semantic regularisation for recurrent image annotation. In: CVPR (2017)

    Google Scholar 

  32. Liu, Y., Sheng, L., Shao, J., Yan, J., Xiang, S., Pan, C.: Multi-label image classification via knowledge distillation from weakly-supervised detection. In: ACM Multimedia, pp. 700–708 (2018)

    Google Scholar 

  33. Makadia, A., Pavlovic, V., Kumar, S.: A new baseline for image annotation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5304, pp. 316–329. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88690-7_24

    Chapter  Google Scholar 

  34. Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM (CACM) 38(11), 39–41 (1995)

    Article  Google Scholar 

  35. Murthy, V.N., Maji, S., Manmatha, R.: Automatic image annotation using deep learning representations. In: ICMR (2015)

    Google Scholar 

  36. Niu, Y., Lu, Z., Wen, J.R., Xiang, T., Chang, S.F.: Multi-modal multi-scale deep learning for large-scale image annotation. IEEE Trans. Image Process. 28, 1720–1731 (2017)

    Article  MathSciNet  Google Scholar 

  37. Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: ACM MM (2010)

    Google Scholar 

  38. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  39. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

    Google Scholar 

  40. Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)

    Google Scholar 

  41. Tsai, C.P., Lee, Y.H.: Adversarial learning of label dependency: a novel framework for multi-class classification. ICASSP pp. 3847–3851 (2019)

    Google Scholar 

  42. Uricchio, T., Ballan, L., Seidenari, L., Bimbo, A.D.: Automatic image annotation via label transfer in the semantic space (2016). CoRR abs/1605.04770

    Google Scholar 

  43. Verma, Y., Jawahar, C.V.: Image annotation using metric learning in semantic neighbourhoods. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 836–849. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_60

    Chapter  Google Scholar 

  44. Verma, Y., Jawahar, C.V.: Exploring SVM for image annotation in presence of confusing labels. In: BMVC (2013)

    Google Scholar 

  45. Verma, Y., Jawahar, C.V.: Image annotation by propagating labels from semantic neighbourhoods. Int. J. Comput. Vision 121(1), 126–148 (2017)

    Article  Google Scholar 

  46. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)

    Google Scholar 

  47. Wang, C., Blei, D., Fei-Fei, L.: Simultaneous image classification and annotation. In: Proceedings CVPR (2009)

    Google Scholar 

  48. Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN-RNN: a unified framework for multi-label image classification. In: CVPR (2016)

    Google Scholar 

  49. Weston, J., Bengio, S., Usunier, N.: WSABIE: scaling up to large vocabulary image annotation. In: IJCAI (2011)

    Google Scholar 

  50. Yazici, V.O., Gonzalez-Garcia, A., Ramisa, A., Twardowski, B., van de Weijer, J.: Orderless recurrent models for multi-label classification (2019). CoRR abs/1911.09996

    Google Scholar 

  51. You, R., Guo, Z., Cui, L., Long, X., Bao, Y., Wen, S.: Cross-modality attention with semantic graph embedding for multi-label classification (2019). CoRR abs/1912.07872

    Google Scholar 

  52. Zhu, F., Li, H., Ouyang, W., Yu, N., Wang, X.: Learning spatial regularization with image-level supervisions for multi-label image classification. In: CVPR, pp. 2027–2036 (2017)

    Google Scholar 

  53. Zhuang, Y., Yang, Y., Wu, F.: Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans. Multimedia 10(2), 221–229 (2008)

    Article  Google Scholar 

Download references

Acknowledgement

YV would like to thank the Department of Science and and Technology (India) for the INSPIRE Faculty Award 2017.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ayushi Dutta .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dutta, A., Verma, Y., Jawahar, C.V. (2020). Recurrent Image Annotation with Explicit Inter-label Dependencies. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12374. Springer, Cham. https://doi.org/10.1007/978-3-030-58526-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58526-6_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58525-9

  • Online ISBN: 978-3-030-58526-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics