Skip to main content
Log in

Multi-label image classification with recurrently learning semantic dependencies

The Visual Computer Aims and scope Submit manuscript

Abstract

Recognizing multi-label images is a significant but challenging task toward high-level visual understanding. Remarkable success has been achieved by applying CNN–RNN design-based models to capture the underlying semantic dependencies of labels and predict the label distributions over the global-level features output by CNNs. However, such global-level features often fuse the information of multiple objects, leading to the difficulty in recognizing small object and capturing the label co-relation. To better solve this problem, in this paper, we propose a novel multi-label image classification framework which is an improvement to the CNN–RNN design pattern. By introducing the attention network module in the CNN–RNN architecture, the objects features of the attention map are separated by the channels which are further send to the LSTM network to capture dependencies and predict labels sequentially. A category-wise max-pooling operation is then performed to integrate these labels into the final prediction. Experimental results on PASCAL2007 and MS-COCO datasets demonstrate that our model can effectively exploit the correlation between tags to improve the classification performance as well as better recognize the small targets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)

    Google Scholar 

  2. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  3. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)

  4. Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

  5. Huang, G., Liu, Z., Maaten, L., van der Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2261–2269 (2017)

  6. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. Comput. Vis. Pattern Recogn. 7132–7141 (2018)

  7. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

  8. Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 512–519 (2014)

  9. Nguyen, T.V.: Salient Object detection via objectness proposals. In: AAAI’15 Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 4286–4287 (2015)

  10. Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. Int. J. Comput. Vis. 104, 154–171 (2013)

    Article  Google Scholar 

  11. Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In European Conference on Computer Vision, pp. 391–405 (2014)

  12. Arbeláez, P.A., Pont-Tuset, J., Barron, J.T., Marqués, F., Malik, J.: Multiscale combinatorial grouping. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 328–335 (2014)

  13. Wei, Y., Xia, W., Lin, M., et al.: Hcp: A flexible cnn framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38, 1901–1907 (2016)

    Article  Google Scholar 

  14. Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: British Machine Vision Conference (2014)

  15. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

  16. Gong, Y., Jia, Y., Leung, T., Toshev, A., Ioffe, S.: Deep convolutional ranking for multilabel image annotation. In: International Conference on Learning Representations (2014)

  17. van der Gaag, L.C., Feelders, A.J.: Probabilistic Graphical Models. Lecture Notes in Artificial Intelligence (2014). https://www.springer.com/cn/book/9783319114323

  18. Jin, J., Nakayama, H.: Annotation order matters: recurrent image annotator for arbitrary length image tagging. In: International Conference on Pattern Recognition (ICPR), pp. 2452–2457 (2016)

  19. Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN–RNN: A unified framework for multi-label image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294 (2016)

  20. Zhang, J., Wu, Q., Shen, C., Zhang, J., Lu, J.: Multi-label image classification with regional latent semantic dependencies. IEEE Trans. Multimed. 20, 2801– 2813 (2018)

    Article  Google Scholar 

  21. Chen, Q., Song, Z., Hua, Y., Huang, Z., Yan, S.: Hierarchical matching with side information for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3426–3433 (2012)

  22. Dong, J., Xia, W., Chen, Q., Feng, J., Huang, Z., Yan, S.: Subcategory-aware object classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 827–834 (2013)

  23. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004)

    Article  Google Scholar 

  24. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 886–893 (2005)

  25. Ojala, T., Pietikäinen, M., Harwood, D.: A comparative study of texture measures with classification based on featured distributions. Pattern Recogn. 29, 51–59 (1996)

    Article  Google Scholar 

  26. Li, Y., Song, Y., Luo, J.: Improving pairwise ranking for multi-label image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1837–1845 (2017)

  27. Yang, H., Zhou, J.T., Zhang, Y., Gao, B., Wu, J., Cai, J.: Exploit bounding box annotations for multi-label object recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 280–288 (2016)

  28. Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.J.: Image classification with the fisher vector: theory and practice. Int. J. Comput. Vis. 105, 222–245 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  29. Wang, Z., Chen, T., Li, G., Xu, R., Lin, L.: Multi-label image recognition by recurrently discovering attentional regions. In: IEEE International Conference on Computer Vision (ICCV), pp. 464–472 (2017)

  30. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. Neural Inf. Process. Syst. 2, 2017–2025 (2015)

    Google Scholar 

  31. Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111, 98–136 (2015)

    Article  Google Scholar 

  32. Lin, T.-Y., Maire, M., Belongie, S.J., Hays, J., et al.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision, pp. 740–755 (2014)

  33. Zhu, F., Li, H., Ouyang, W., Yu, N., Wang, X.: Learning spatial regularization with image-level supervisions for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5513–5522 (2017)

  34. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)

  35. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)

    Article  Google Scholar 

  36. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 1, 91–99 (2015)

    Google Scholar 

Download references

Funding

Funding was provided by the National Natural Science Foundation of China (Grant No. 61672202).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan Yang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, L., Wang, R., Yang, J. et al. Multi-label image classification with recurrently learning semantic dependencies. Vis Comput 35, 1361–1371 (2019). https://doi.org/10.1007/s00371-018-01615-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-018-01615-0

Keywords

Navigation