Elsevier

Pattern Recognition

Volume 96, December 2019, 106955
Pattern Recognition

Language-aware weak supervision for salient object detection

https://doi.org/10.1016/j.patcog.2019.06.021Get rights and content

Highlights

  • We learn textual information from natural language to detect salient objects.

  • We establish the textual-visual pairwise affinities to explore saliency priors.

  • We leverage a recurrent self-supervision mechanism for saliency refinement.

  • The proposed algorithm performs competitive against the existing saliency methods.

Abstract

Natural Language Processing has achieved remarkable performance in multitudinous computer tasks, but the potential capability of textual information has not been completely explored in visual saliency detection. In this paper, we learn to detect salient object from natural language by addressing the two essential issues: finding a semantic content matching the corresponding linguistic concept and recovering fine details without any pixel-level annotations. We first propose the Feature Matching Network (FMN) to explore the internal relation between the linguistic concept and visual image in the semantic space. The FMN simultaneously establishes the textual-visual pairwise affinities and generates a language-aware coarse saliency map. to refine the coarse map, the Recurrent Fine-tune Network (RFN) is proposed to enhance its predicted performance progressively by self-supervision. Our approach only leverages the caption to provide important cues of salient object, but generates a fine-detailed foreground map at a detecting speed of 72 FPS without any post-processing. Extensive experiments demonstrate that our method takes full advantage of textual information of natural language in saliency detection, and performs favorably against state-of-the-art approaches on the most existing datasets.

Introduction

Saliency detection [1], [2], which aims to capture the important instance or region in the image, has received much attention in recent years driven by the deep neural networks [3]. Many supervised saliency methods can efficiently highlight a distinct object with accurate boundaries using pixel-level ground truth. However, the work for annotating each pixel is time-consuming and arduous, which needs a great deal of vigor and labor to create a large-scale dataset. To alleviate this situation, there has been a recent keen interest in weakly supervision using image-level tags, like labels or phrases. Most existing weakly detection methods consider the high-level convolutional features as the important saliency detectors, and integrate the semantic feature maps to extract class-aware visual representations. Using these class-aware representations that distill information down to the salient objects is one of the effective solution in saliency detection [4]. However, these tags are limited in the amount of information and they have to depend on the agnostic semantic meanings learned from DNNs, resulting in uncontrollable object prediction and incomplete coverage of foreground areas.

Despite that image-level tags indicate the presence or absence of object categories in the image level, they cannot effectively offer assistance for the network to better predict full-extent saliency map in details. Rather than being fixed on the image categories, natural language for image (i.e. captions) is a high-level global concept and provides rich saliency cues, including location and appearance. Some deep captioning models [5] have succeeded in learning visual representations to translate the input image into natural language, but they do not further discover the potential relation with visual saliency. Therefore, inspired by the weakly supervised structures using tag knowledge in dealing with pixel-level information, we want to utilize the potential contextual information from the natural language to measure the dominant visual contents in image, supervising the detection network for better performance. Although Ramanishka et al. [6] have already explored the caption guided saliency detection, they make a pioneer work in video saliency detection with an end-to-end model, but only produce spatial or spatiotemporal heat maps for each input caption. Furthermore, we want to extract a highlighted salient region with fine-detailed boundaries, by exploring the potential relationship of each feature representation from static visual image and corresponding natural language (shown in Fig. 1).

To bridging cross-model gap of different modalities, previous approaches try to find a good metric that accurately measures the representations from different modalities with low dimensional vectors, and their distances/similarities reflect their semantic relations. Sound source localization is handled via learning the correspondence between visual scene and the sound, while cross-modal retrieval finds a low-dimensional latent common space where multi-modal data can be compared directly. Although they show similar views in aligning features of two modalities, they sometimes dominate the global representation to describe the superficial level information of an image about where and what the document or source contains. More importantly, our method goes a further step that we learn to detect salient objects in the case of limited textual information and generate the finer foreground saliency map with detailed edges. Instead of addressing this difficult task in a simple feedforward network, our approach uses a steady strategy that finding a matching visual content from linguistic descriptions and then refines it by local contextual information. The proposed approach contains two sub-networks: Feature Matching Network (FMN) and Recurrent Fine-tune Network (RFN). By transforming the input image and the corresponding caption into a latent feature space, the FMN is proposed to discover a semantic matching to establish the textual-visual pairwise affinities. This pairwise matching is measured by an objective function that visual feature and linguistic feature belonging to the same specific identity should have a similar feature distribution, thus yielding an initial estimation of the saliency map. In the feedforward processing, the coarse map has already succeeded in locating the corresponding objects described in the language sentence, but fails to preserve enough low-level boundary or texture information. Instead of using common post-processing or handcrafted optimization, we construct a recurrent structure RFN to recover more details of the estimated map, which uses a refinement module to learn by self-supervision. We compare our approach with most existing unsupervised and supervised saliency methods on the large-scale datasets, and the results indicate that our approach captures relatively more accurate regions and detailed boundaries at a faster speed of 72 FPS. The flexibility of our framework also make it possible to be transformed into dense models and achieve better performance in the future.

The contributions of this paper are summarized as follows:

  • We first design a language-aware saliency detection framework and clearly demonstrate that with the textual information from the natural language, the network can also be robust and accurate to describe the visual object and generate a fine-detailed saliency map without any pixel-level annotations.

  • We propose a novel Feature Matching Network to establish the textual-visual pairwise affinities for explaining the internal relation between language and image, which provides an important saliency prior for detection.

  • We leverage a self-supervision mechanism to refine the fine-tune network progressively and the results demonstrate strong competitiveness against existing supervised methods.

Section snippets

Saliency detection

Detection research has been going on for many years, some traditional algorithms [7], [8] are successfully applied to detect generic salient objects. However, the breakthrough of improving saliency performance occurs after widely employing the deep learning models. Early methods like MDF [9], MCDL [10], LEGS [2] and so on, mainly focus on aggregating low-level localize features with high-level semantic meanings to maintain visible improvement. They act on small patches and incorporate multiply

Proposed method

Language-based saliency detection is a high-level matching problem that two essential issues should be taken into consideration: finding a semantic content matching the corresponding linguistic concept and the way to recover fine details without any pixel-level annotations. In this work, we address the two questions by proposing a weakly supervised approach, which leverages cross-model textual-visual matching structure to describe the salient objects, and enhance the prediction accuracy of each

Experimental results

We train the FMN on the Microsoft COCO caption evaluation dataset, and only captions are leveraged as the supervised truth. To reduce the complexity of the training data in COCO, we also train the RFN using the images on the DUTS-TR dataset [4] without any ground truth. In the test stage, for fair comparison, we generate the saliency maps from the refined RFN without post-processing. All the proposed algorithms are implemented in Caffe and MATLAB (Caffe is used for network training of FMN and

Conclusion

In this paper, we propose a weakly supervised saliency detection method by constructing the matching relation network between visual image and natural language and preserving more informative details with recurrent network. We demonstrate that the affluent textual information learned from caption has a complete concept to cover the dominant visual attention in high-level semantic patterns, which shows an internal relation between language and image. By establishing the textual-visual pairwise

Mingyang Qian received her B.E. degree in electrical and information engineering, Dalian University of Technology (DUT), China, in 2018. He is currently a master student in Signal and Information Processing, Dalian University of Technology (DUT). His research interest is in saliency detection and video object segmentation.

References (47)

  • L. Zhang et al.

    Ranking saliency

    IEEE Trans. Pattern Anal. Mach.Intell.

    (2017)
  • L. Zhang et al.

    Saliency detection via absorbing Markov chain with learnt transition probability

    IEEE Trans. Image Process.

    (2018)
  • G. Li et al.

    Visual saliency based on multiscale deep features

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • R. Zhao et al.

    Saliency detection by multi-context deep learning

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • Q. Hou et al.

    Deeply supervised salient object detection with short connections

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • G. Li et al.

    Instance-level salient object segmentation

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • L. Wang et al.

    Saliency detection with recurrent fully convolutional networks

    Proceedings of European Conference on Computer Vision

    (2016)
  • S. Wang et al.

    (SMKL)-M-3: scalable semi-supervised multiple kernel learning for real-world image applications

    IEEE Trans. Multimed.

    (2012)
  • Z. Shi et al.

    Weakly-supervised image annotation and segmentation with objects and attributes

    IEEE Trans. Pattern Anal. Mach.Intell.

    (2017)
  • Y. Zhou et al.

    Weakly supervised instance segmentation using class peak response

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • Z. Fang et al.

    Weakly supervised attention learning for textual phrases grounding

    CoRR

    (2018)
  • H. Jiang

    Weakly supervised learning for salient object detection

    CoRR

    (2015)
  • S. Wang et al.

    Joint global and co-attentive representation learning for image-sentence retrieval

    ACM Multimedia Conference

    (2018)
  • Cited by (28)

    • Cross-modal co-feedback cellular automata for RGB-T saliency detection

      2023, Pattern Recognition
      Citation Excerpt :

      Wang et al. [16] provide a new paradigm for detecting salient object, which only relies on less annotation efforts. Qian et al. [17] propose a novel feature matching network to extract salient object from image and natural language. However, resources and time used in DNN based methods are still tremendous.

    • Gradient-based refined class activation map for weakly supervised object localization

      2022, Pattern Recognition
      Citation Excerpt :

      SSD [13] introduced the multi-scale feature map to obtain default boxes and improve detection performance. Compared to fully supervised object localization and detection, weakly supervised learning [20] with much fewer manual annotations has attracted more attention. Among them, weakly supervised segmentation [21,22] is extremely closely related to our WSOL task.

    View all citing articles on Scopus

    Mingyang Qian received her B.E. degree in electrical and information engineering, Dalian University of Technology (DUT), China, in 2018. He is currently a master student in Signal and Information Processing, Dalian University of Technology (DUT). His research interest is in saliency detection and video object segmentation.

    Mengyang Feng received the B.E. degree in electrical and information engineering from the Dalian University of Technology in 2015, where he is currently pursuing the Ph.D. degree under the supervision of Prof. H. Lu.

    Lihe Zhang received the Ph.D. degree in signal and information processing from Beijing University of Posts and Telecommunications, Beijing, China, in 2004. He is currently an Associate Professor with the School of Information and Communication Engineering, Dalian University of Technology. His research interests include pattern recognition and computer vision.

    Jinqing Qi received the Ph.D. degree in communication and integrated system from the University of Tokyo Institute of Technology, Tokyo, Japan, in 2004. He is currently an Associate Professor of Information and Communication Engineering at University of DUT, Dalian, China. His recent research interests focus on computer vision, pattern recognition and machine learning. He is a member of IEEE.

    Huchuan Lu received the M.S. degree from the Department of Electrical Engineering, Dalian University of Technology (DUT), China in 1998 and his Ph.D. degree of System Engineering from DUT in 2008, respectively. From 1998 to now, he is a faculty of School of Electronic and Information Engineering of DUT. He has been associate professor since2006. He has visited Ritsumeikan University from Oct. 2007 to Jan. 2008. His recent research interests focus on computer vision, artificial intelligence, pattern recognition and machine learning. He is a member of IEEE and IEIC.

    View full text