Language-aware weak supervision for salient object detection
Introduction
Saliency detection [1], [2], which aims to capture the important instance or region in the image, has received much attention in recent years driven by the deep neural networks [3]. Many supervised saliency methods can efficiently highlight a distinct object with accurate boundaries using pixel-level ground truth. However, the work for annotating each pixel is time-consuming and arduous, which needs a great deal of vigor and labor to create a large-scale dataset. To alleviate this situation, there has been a recent keen interest in weakly supervision using image-level tags, like labels or phrases. Most existing weakly detection methods consider the high-level convolutional features as the important saliency detectors, and integrate the semantic feature maps to extract class-aware visual representations. Using these class-aware representations that distill information down to the salient objects is one of the effective solution in saliency detection [4]. However, these tags are limited in the amount of information and they have to depend on the agnostic semantic meanings learned from DNNs, resulting in uncontrollable object prediction and incomplete coverage of foreground areas.
Despite that image-level tags indicate the presence or absence of object categories in the image level, they cannot effectively offer assistance for the network to better predict full-extent saliency map in details. Rather than being fixed on the image categories, natural language for image (i.e. captions) is a high-level global concept and provides rich saliency cues, including location and appearance. Some deep captioning models [5] have succeeded in learning visual representations to translate the input image into natural language, but they do not further discover the potential relation with visual saliency. Therefore, inspired by the weakly supervised structures using tag knowledge in dealing with pixel-level information, we want to utilize the potential contextual information from the natural language to measure the dominant visual contents in image, supervising the detection network for better performance. Although Ramanishka et al. [6] have already explored the caption guided saliency detection, they make a pioneer work in video saliency detection with an end-to-end model, but only produce spatial or spatiotemporal heat maps for each input caption. Furthermore, we want to extract a highlighted salient region with fine-detailed boundaries, by exploring the potential relationship of each feature representation from static visual image and corresponding natural language (shown in Fig. 1).
To bridging cross-model gap of different modalities, previous approaches try to find a good metric that accurately measures the representations from different modalities with low dimensional vectors, and their distances/similarities reflect their semantic relations. Sound source localization is handled via learning the correspondence between visual scene and the sound, while cross-modal retrieval finds a low-dimensional latent common space where multi-modal data can be compared directly. Although they show similar views in aligning features of two modalities, they sometimes dominate the global representation to describe the superficial level information of an image about where and what the document or source contains. More importantly, our method goes a further step that we learn to detect salient objects in the case of limited textual information and generate the finer foreground saliency map with detailed edges. Instead of addressing this difficult task in a simple feedforward network, our approach uses a steady strategy that finding a matching visual content from linguistic descriptions and then refines it by local contextual information. The proposed approach contains two sub-networks: Feature Matching Network (FMN) and Recurrent Fine-tune Network (RFN). By transforming the input image and the corresponding caption into a latent feature space, the FMN is proposed to discover a semantic matching to establish the textual-visual pairwise affinities. This pairwise matching is measured by an objective function that visual feature and linguistic feature belonging to the same specific identity should have a similar feature distribution, thus yielding an initial estimation of the saliency map. In the feedforward processing, the coarse map has already succeeded in locating the corresponding objects described in the language sentence, but fails to preserve enough low-level boundary or texture information. Instead of using common post-processing or handcrafted optimization, we construct a recurrent structure RFN to recover more details of the estimated map, which uses a refinement module to learn by self-supervision. We compare our approach with most existing unsupervised and supervised saliency methods on the large-scale datasets, and the results indicate that our approach captures relatively more accurate regions and detailed boundaries at a faster speed of 72 FPS. The flexibility of our framework also make it possible to be transformed into dense models and achieve better performance in the future.
The contributions of this paper are summarized as follows:
- •
We first design a language-aware saliency detection framework and clearly demonstrate that with the textual information from the natural language, the network can also be robust and accurate to describe the visual object and generate a fine-detailed saliency map without any pixel-level annotations.
- •
We propose a novel Feature Matching Network to establish the textual-visual pairwise affinities for explaining the internal relation between language and image, which provides an important saliency prior for detection.
- •
We leverage a self-supervision mechanism to refine the fine-tune network progressively and the results demonstrate strong competitiveness against existing supervised methods.
Section snippets
Saliency detection
Detection research has been going on for many years, some traditional algorithms [7], [8] are successfully applied to detect generic salient objects. However, the breakthrough of improving saliency performance occurs after widely employing the deep learning models. Early methods like MDF [9], MCDL [10], LEGS [2] and so on, mainly focus on aggregating low-level localize features with high-level semantic meanings to maintain visible improvement. They act on small patches and incorporate multiply
Proposed method
Language-based saliency detection is a high-level matching problem that two essential issues should be taken into consideration: finding a semantic content matching the corresponding linguistic concept and the way to recover fine details without any pixel-level annotations. In this work, we address the two questions by proposing a weakly supervised approach, which leverages cross-model textual-visual matching structure to describe the salient objects, and enhance the prediction accuracy of each
Experimental results
We train the FMN on the Microsoft COCO caption evaluation dataset, and only captions are leveraged as the supervised truth. To reduce the complexity of the training data in COCO, we also train the RFN using the images on the DUTS-TR dataset [4] without any ground truth. In the test stage, for fair comparison, we generate the saliency maps from the refined RFN without post-processing. All the proposed algorithms are implemented in Caffe and MATLAB (Caffe is used for network training of FMN and
Conclusion
In this paper, we propose a weakly supervised saliency detection method by constructing the matching relation network between visual image and natural language and preserving more informative details with recurrent network. We demonstrate that the affluent textual information learned from caption has a complete concept to cover the dominant visual attention in high-level semantic patterns, which shows an internal relation between language and image. By establishing the textual-visual pairwise
Mingyang Qian received her B.E. degree in electrical and information engineering, Dalian University of Technology (DUT), China, in 2018. He is currently a master student in Signal and Information Processing, Dalian University of Technology (DUT). His research interest is in saliency detection and video object segmentation.
References (47)
- et al.
Saliency detection via conditional adversarial image-to-image network
Neurocomputing
(2018) - et al.
Deep visual tracking: review and experimental comparison
Pattern Recognit.
(2018) - et al.
Salient object detection via multi-scale attention CNN
Neurocomputing
(2018) - et al.
Unsupervised image saliency detection with gestalt-laws guided optimization and visual attention based refinement
Pattern Recognit.
(2018) - et al.
Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection
Pattern Recognit.
(2019) - et al.
Graph model-based salient object detection using objectness and multiple saliency cues
Neurocomputing
(2019) - et al.
Deep networks for saliency detection via local estimation and global search
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
(2015) - et al.
Learning to detect salient objects with image-level supervision
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
(2017) - et al.
Show and tell: a neural image caption generator
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
(2015) - et al.
Top-down visual saliency guided by captions
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
(2017)
Ranking saliency
IEEE Trans. Pattern Anal. Mach.Intell.
Saliency detection via absorbing Markov chain with learnt transition probability
IEEE Trans. Image Process.
Visual saliency based on multiscale deep features
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
Saliency detection by multi-context deep learning
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
Deeply supervised salient object detection with short connections
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
Instance-level salient object segmentation
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
Saliency detection with recurrent fully convolutional networks
Proceedings of European Conference on Computer Vision
(SMKL)-M-3: scalable semi-supervised multiple kernel learning for real-world image applications
IEEE Trans. Multimed.
Weakly-supervised image annotation and segmentation with objects and attributes
IEEE Trans. Pattern Anal. Mach.Intell.
Weakly supervised instance segmentation using class peak response
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
Weakly supervised attention learning for textual phrases grounding
CoRR
Weakly supervised learning for salient object detection
CoRR
Joint global and co-attentive representation learning for image-sentence retrieval
ACM Multimedia Conference
Cited by (28)
Self-supervised learning-based two-phase flow regime identification using ultrasonic sensors in an S-shape riser
2024, Expert Systems with ApplicationsPerceptual loss guided Generative adversarial network for saliency detection
2024, Information SciencesA label noise filtering method for regression based on adaptive threshold and noise score
2023, Expert Systems with ApplicationsCross-modal co-feedback cellular automata for RGB-T saliency detection
2023, Pattern RecognitionCitation Excerpt :Wang et al. [16] provide a new paradigm for detecting salient object, which only relies on less annotation efforts. Qian et al. [17] propose a novel feature matching network to extract salient object from image and natural language. However, resources and time used in DNN based methods are still tremendous.
Gradient-based refined class activation map for weakly supervised object localization
2022, Pattern RecognitionCitation Excerpt :SSD [13] introduced the multi-scale feature map to obtain default boxes and improve detection performance. Compared to fully supervised object localization and detection, weakly supervised learning [20] with much fewer manual annotations has attracted more attention. Among them, weakly supervised segmentation [21,22] is extremely closely related to our WSOL task.
Mingyang Qian received her B.E. degree in electrical and information engineering, Dalian University of Technology (DUT), China, in 2018. He is currently a master student in Signal and Information Processing, Dalian University of Technology (DUT). His research interest is in saliency detection and video object segmentation.
Mengyang Feng received the B.E. degree in electrical and information engineering from the Dalian University of Technology in 2015, where he is currently pursuing the Ph.D. degree under the supervision of Prof. H. Lu.
Lihe Zhang received the Ph.D. degree in signal and information processing from Beijing University of Posts and Telecommunications, Beijing, China, in 2004. He is currently an Associate Professor with the School of Information and Communication Engineering, Dalian University of Technology. His research interests include pattern recognition and computer vision.
Jinqing Qi received the Ph.D. degree in communication and integrated system from the University of Tokyo Institute of Technology, Tokyo, Japan, in 2004. He is currently an Associate Professor of Information and Communication Engineering at University of DUT, Dalian, China. His recent research interests focus on computer vision, pattern recognition and machine learning. He is a member of IEEE.
Huchuan Lu received the M.S. degree from the Department of Electrical Engineering, Dalian University of Technology (DUT), China in 1998 and his Ph.D. degree of System Engineering from DUT in 2008, respectively. From 1998 to now, he is a faculty of School of Electronic and Information Engineering of DUT. He has been associate professor since2006. He has visited Ritsumeikan University from Oct. 2007 to Jan. 2008. His recent research interests focus on computer vision, artificial intelligence, pattern recognition and machine learning. He is a member of IEEE and IEIC.