skip to main content
10.1145/3240508.3240649acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Attentive Recurrent Neural Network for Weak-supervised Multi-label Image Classification

Published: 15 October 2018 Publication History

Abstract

Multi-label image classification is a fundamental and challenging task in computer vision, and recently achieved significant progress by exploiting semantic relations among labels. However, the spatial positions of labels for multi-labels images are usually not provided in real scenarios, which brings insuperable barrier to conventional models. In this paper, we propose an end-to-end attentive recurrent neural network for multi-label image classification under only image-level supervision, which learns the discriminative feature representations and models the label relations simultaneously. First, inspired by attention mechanism, we propose a recurrent highlight network (RHN) which focuses on the most related regions in the image to learn the discriminative feature representations for different objects in an iterative manner. Second, we develop a gated recurrent relation extractor (GRRE) to model the label relations using multiplicative gates in a recurrent fashion, which learns to decide how multiple labels of the image influence the relation extraction. Extensive experiments on three benchmark datasets show that our model outperforms the state-of-the-arts, and performs better on small-object categories and under the scenario with large number of labels.

References

[1]
Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. 2014. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014).
[2]
Xiao Cai, Feiping Nie, Weidong Cai, and Heng Huang. 2013. New graph structured sparsity model for multi-label image annotations. In ICCV . 801--808.
[3]
Xiaochun Cao, Hua Zhang, Xiaojie Guo, Si Liu, and Dan Meng. 2015. SLED: Semantic Label Embedding Dictionary Representation for Multilabel Image Annotation. TIP, Vol. 24, 9 (2015), 2746--2759.
[4]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: a real-world web image database from National University of Singapore. In ACM international conference on image and video retrieval. 48.
[5]
Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop .
[6]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. IEEE, 248--255.
[7]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR . 580--587.
[8]
Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, and Sergey Ioffe. 2013. Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894 (2013).
[9]
Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, and Cordelia Schmid. 2009. Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In ICCV. IEEE, 309--316.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015).
[11]
Hexiang Hu, Guang-Tong Zhou, Zhiwei Deng, Zicheng Liao, and Greg Mori. 2016. Learning structured inference neural networks with label relations. In CVPR . 2960--2968.
[12]
Qinghao Hu, Jiaxiang Wu, Jian Cheng, Lifang Wu, and Hanqing Lu. 2017. Pseudo Label based Unsupervised Deep Discriminative Hashing for Image Retrieval. In ACM Multimedia .
[13]
Yunho Jeon and Junmo Kim. 2017. Active Convolution: Learning the Shape of Convolution for Image Classification. In CVPR .
[14]
Jiren Jin and Hideki Nakayama. 2016. Annotation order matters: Recurrent image annotator for arbitrary length image tagging. arXiv preprint arXiv:1604.05225 (2016).
[15]
Mahdi M Kalayeh, Haroon Idrees, and Mubarak Shah. 2014. Nmf-knn: Image annotation using weighted multi-view non-negative matrix factorization. In CVPR . IEEE, 184--191.
[16]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[17]
Piotr Koniusz, Fei Yan, Philippe-Henri Gosselin, and Krystian Mikolajczyk. 2017. Higher-Order Occurrence Pooling for Bags-of-Words: Visual Concept Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 2 (2017), 313--327.
[18]
Maksim Lapin, Matthias Hein, and Bernt Schiele. 2018. Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 7 (2018), 1533--1554.
[19]
Qiang Li, Maoying Qiao, Wei Bian, and Dacheng Tao. 2016. Conditional graphical lasso for multi-label image classification. In CVPR . 2977--2986.
[20]
Yunsheng Li, Mandar Dixit, and Nuno Vasconcelos. 2017a. Deep Scene Image Classification With the MFAFVNet. In ICCV .
[21]
Yuncheng Li, Yale Song, and Jiebo Luo. 2017b. Improving Pairwise Ranking for Multi-label Image Classification. In CVPR .
[22]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. Springer, 740--755.
[23]
Volodymyr Mnih, Nicolas Heess, Alex Graves, and koray kavukcuoglu. 2014. Recurrent Models of Visual Attention. In Advances in Neural Information Processing Systems 27. 2204--2212.
[24]
Venkatesh N Murthy, Subhransu Maji, and R Manmatha. 2015. Automatic image annotation using deep learning representations. In ICMR. ACM, 603--606.
[25]
Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. 2014. Learning and transferring mid-level image representations using convolutional neural networks. In CVPR. 1717--1724.
[26]
M. Oquab, L. Bottou, I. Laptev, and J. Sivic. 2015. Is object localization for free? -- Weakly-supervised learning with convolutional neural networks. In CVPR .
[27]
Duangmanee Putthividhy, Hagai T Attias, and Srikantan S Nagarajan. 2010. Topic regression multi-modal latent dirichlet allocation for image annotation. In CVPR. IEEE, 3408--3415.
[28]
Ronald A Rensink. 2000. The dynamic representation of scenes. Visual cognition, Vol. 7, 1--3 (2000), 17--42.
[29]
Robin Senge, Juan José Del Coz, and Eyke Hüllermeier. 2014. On the problem of error propagation in classifier chains for multi-label classification. In Data Analysis, Machine Learning and Knowledge Discovery. Springer, 163--170.
[30]
Weiwei Shi, Yihong Gong, Xiaoyu Tao, and Nanning Zheng. 2017. Training DCNN by Combining Max-Margin, Max-Correlation Objectives, and Correntropy Loss for Multilabel Image Classification. IEEE TNNLS (2017).
[31]
Fuming Sun, Jinhui Tang, Haojie Li, Guo-Jun Qi, and Thomas S Huang. 2014. Multi-label image categorization with sparse factor representation. TIP, Vol. 23, 3 (2014), 1028--1037.
[32]
Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et almbox. 1999. Policy gradient methods for reinforcement learning with function approximation. In NIPS, Vol. 99. 1057--1063.
[33]
Tiberio Uricchio, Marco Bertini, Lorenzo Seidenari, and Alberto Bimbo. 2015. Fisher encoded convolutional bag-of-windows for efficient image retrieval and social image tagging. In ICCV Workshops. 9--15.
[34]
Yashaswi Verma and CV Jawahar. 2012. Image annotation using metric learning in semantic neighbourhoods. In ECCV. Springer, 836--849.
[35]
Luis Von Ahn and Laura Dabbish. 2004. Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 319--326.
[36]
Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. 2016. CNN-RNN: A Unified Framework for Multi-label Image Classification. arXiv preprint arXiv:1604.04573 (2016).
[37]
Yunchao Wei, Wei Xia, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. 2014. CNN: Single-label to multi-label. arXiv preprint arXiv:1406.5726 (2014).
[38]
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, Vol. 8, 3--4 (1992), 229--256.
[39]
Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang. 2015. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In CVPR. 842--850.
[40]
Pengtao Xie, Ruslan Salakhutdinov, Luntian Mou, and Eric P. Xing. 2017. Deep Determinantal Point Process for Large-Scale Multi-Label Classification. In ICCV .
[41]
Xiangyang Xue, Wei Zhang, Jie Zhang, Bin Wu, Jianping Fan, and Yao Lu. 2011. Correlative multi-label multi-instance image annotation. In ICCV. IEEE, 651--658.
[42]
Geng Yan, Yang Wang, and Zicheng Liao. 2016. LSTM for Image Annotation with Relative Visual Importance. In BMVC .
[43]
Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. 2016b. Exploit bounding box annotations for multi-label object recognition. In CVPR . 280--288.
[44]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016a. Stacked attention networks for image question answering. In CVPR. 21--29.
[45]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. arXiv preprint arXiv:1603.03925 (2016).
[46]
Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).
[47]
Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. 2010. Deconvolutional networks. In CVPR. IEEE, 2528--2535.
[48]
Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, and Xiaogang Wang. 2017. Learning Spatial Regularization with Image-level Supervisions for Multi-label Image Classification. In CVPR .

Cited By

View all
  • (2025)Semantic Abstractions for Multi-label ClassificationArtificial Intelligence Logic and Applications10.1007/978-981-96-0354-1_12(143-151)Online publication date: 31-Jan-2025
  • (2024)Learning Domain Invariant Features for Unsupervised Indoor Depth Estimation AdaptationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367239720:9(1-23)Online publication date: 13-Jun-2024
  • (2024)TFAD: An Image Multi-Label Recognition Method with Image-Text Powered Attention2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650309(1-8)Online publication date: 30-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '18: Proceedings of the 26th ACM international conference on Multimedia
October 2018
2167 pages
ISBN:9781450356657
DOI:10.1145/3240508
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. attentive recurrent neural network
  2. label relations modeling
  3. multi-label image classification
  4. reinforcement learning

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • National Basic Research Program of China
  • Key Research Program of Frontier Sciences

Conference

MM '18
Sponsor:
MM '18: ACM Multimedia Conference
October 22 - 26, 2018
Seoul, Republic of Korea

Acceptance Rates

MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Semantic Abstractions for Multi-label ClassificationArtificial Intelligence Logic and Applications10.1007/978-981-96-0354-1_12(143-151)Online publication date: 31-Jan-2025
  • (2024)Learning Domain Invariant Features for Unsupervised Indoor Depth Estimation AdaptationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367239720:9(1-23)Online publication date: 13-Jun-2024
  • (2024)TFAD: An Image Multi-Label Recognition Method with Image-Text Powered Attention2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650309(1-8)Online publication date: 30-Jun-2024
  • (2024)Active learning based on multi-enhanced views for classification of multiple patterns in lung ultrasound imagesComputerized Medical Imaging and Graphics10.1016/j.compmedimag.2024.102454118(102454)Online publication date: Dec-2024
  • (2023)Hierarchical Visual Attribute Learning in the WildProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612274(3415-3423)Online publication date: 26-Oct-2023
  • (2023)Graph Attention Transformer Network for Multi-label Image ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357851819:4(1-16)Online publication date: 27-Feb-2023
  • (2023)Bidirectional Relationship Inferring Network for Referring Image Localization and SegmentationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.310615334:5(2246-2258)Online publication date: May-2023
  • (2023)An End-to-End Blind Image Quality Assessment Method Using a Recurrent Network and Self-AttentionIEEE Transactions on Broadcasting10.1109/TBC.2022.321524969:2(369-377)Online publication date: Jun-2023
  • (2022)Fine-grained Image Classification via Multi-scale Selective Hierarchical Biquadratic PoolingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/349222118:1s(1-23)Online publication date: 25-Jan-2022
  • (2022)Semantic Supplementary Network With Prior Information for Multi-Label Image ClassificationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2021.308397832:4(1848-1859)Online publication date: Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media