Elsevier

Neurocomputing

Volume 447, 4 August 2021, Pages 257-271
Neurocomputing

Weakly supervised object-aware convolutional neural networks for semantic feature matching

https://doi.org/10.1016/j.neucom.2021.03.052Get rights and content

Abstract

We address the task of establishing visual correspondences between two images depicting main objects of the same semantic category. This task encounters various challenges such as background clutter, intra-class variation, and viewpoint variations. Existing works are dominated by end-to-end training methods that rely on redundant calculation or large amounts of manual annotations, and cannot generalize to unseen object categories. In this paper, we propose to construct a weakly supervised object-aware convolutional neural network architecture for semantic feature matching, while being trainable end-to-end without the requirement for manual annotations. The main component of this architecture is a similarity filter module containing a trainable neural nearest neighbors network. Since training data for semantic feature matching is rather limited, we introduce a simple and effective foreground selection strategy to produce the foreground masks. Using these masks as a form of weak supervision signal for correspondence task and tackle the background clutter. Extensive experiments illustrate that the proposed approach outperforms the state-of-the-art methods for semantic feature matching on multiple public standard benchmark datasets.

Graphical abstract

We address the task of establishing visual correspondences between two images depicting main objects of the same semantic category. This task encounters various challenges such as background clutter, intra-class variation, and viewpoint variations. Existing works are dominated by end-to-end training methods that rely on redundant calculation or large amounts of manual annotations, and cannot generalize to unseen object categories. In this paper, we propose to construct a weakly supervised object-aware convolutional neural network architecture for semantic feature matching, while being trainable end-to-end without the requirement for manual annotations. The main component of this architecture is a similarity filter module containing a trainable neural nearest neighbors network. Since training data for semantic feature matching is rather limited, we develop an object-aware matching mechanism to enable weakly-supervised learning and tackle the background clutter. Extensive experiments illustrate that the proposed approach outperforms the state-of-the-art methods for semantic feature matching on multiple public standard benchmark datasets.

  1. Download : Download high-res image (191KB)
  2. Download : Download full-size image

Introduction

Establishing correspondences, which is traditionally defined as calculating the associations among multiple images depicting the same scene or object, is one of the fundamental problems in computer vision and graphics. This has been widely used in a variety of graphics fields such as image stitching [1], [5], 3D reconstruction [3], [4], and stereo matching [2]. They search for the correspondences with different handcrafted features, typically Scale-Invariant Feature Transform (SIFT) [10], Histogram of Oriented Gradient (HOG) [11], Speeded Up Robust Features (SURF) [36], and some improved descriptors [49]. Some researchers have also been committed to seeking better matching techniques [51], [52], [54], [55]. With the breakthrough of strong representation capabilities of the Convolutional Neural Networks (CNNs), many excellent matching algorithms are proposed [47], [48], [50], [53], and semantic understanding-based matching has been also developed in the latest years [13], [14]. Essentially, it is the basis for some rising fields such as semantic object segmentation [43], object detection [6], [37], and Re-identification [7].

Semantic feature matching is concerned with estimating the correspondences between two objects of the same semantic category in different images, which can be roughly divided into two branches. The first branch aims to construct a post-processor [21], [14], [22]. The extracted handcrafted features [10], [22] or the learned CNN features are taken as inputs to the designed processor [14], [15]. Matching constraints are used to minimize appearance matching cost and enforce geometric consistency between all candidate feature pairs. However, it generally obtains low accuracy performance that cannot meet the requirements for further applying, resulting in rarely used for semantic feature matching. Another branch of the methods is based on a correlation filter and CNNs. A similarity filter is trained by encoding the spatial consistency and semantic associations between intra-class objects. Existing methods develop different convolutional neural network architectures for correspondence task which are trainable end-to-end to improve the accuracy [18], [19], [20]. But they generally prefer to estimate the parameters of the geometric transformation relating the input images instead of the matched features, resulting in a narrower applicability. Meanwhile, they are sensitive to the interference factors present in the images, such as background clutter, and intra-class variation. Besides, shallow neural network and large-kernel convolution increase the computational complexity [18]. And [19] strongly relies on the synthetic datasets which reduces the generalization capabilities of the model for unseen object categories.

In this work we establish sparse feature associations between intra-class semantic objects, as shown in Fig. 1. It is challenging in background clutter, intra-class variation, changes in viewpoint and illumination, and non-overlapping of scenes or objects. Inspired by the state-of-the-art semantic feature matching method, i.e., NCNet [18], we construct a weakly supervised convolutional matching network for correspondence task. The key is to search for sufficient salient features and estimate the correspondences between two objects by fully exploiting their similar semantics. In contrast to the original version [18], we aim to design an object-aware matching mechanism to alleviate the background clutter. Essentially, our approach adopts a salient foreground selection strategy to produce the foreground masks, which provides a form of weak supervision signal to train a re-ranking convolutional neural network. This mechanism can effectively constrain the nearest-neighbor searching scope, and perceive main semantic regions. Specifically, we introduce a common 2-D re-ranking network instead of a complicated 4-D neighbourhood consensus module.

We propose a weakly supervised object-aware convolutional neural network architecture for semantic feature matching, consisting of three main modules: feature extraction, similarity measurer, and similarity filter, as shown in Fig. 2. Given two input images depicting main objects of the same semantic category, we first adopt a very weak supervision in the form of ImageNet pre-trained feature representations [28] for each image, which are analogous to dense local descriptors and readily available. This obtains the object-specific attribute representations and low-level contexts such as colors and edges. Then we implement an attribute transfer process to eliminate the interference caused by the differences in color space. The purpose of this process is to simultaneously alleviate the confusion caused by the low-level visual features, and provide normalized data for further filtering. Further, a common correlation layer is used to match the feature representations across images into the tentative correlation maps, namely the initial correlation maps.

Finally, a similarity filter module is introduced to produce the resulting correspondences. We first introduce a cycle consistency constraint to weight the initial correlation maps. This can initially distinguish between the inliers and outliers from the collection of the correlated features, encourage one-to-one matching, and reduces the computational load of the filter network. Then we develop an object-aware matching mechanism. A neural nearest neighbors network (3N-Network) is driven by designing a semantic perception loss function. Motivated by the notion of the classical k-nearest neighbor matching strategy, this module enforces the nearest-neighbor searching process under a confidential salient constraint, which effectively mitigates the interference caused by the background clutter. Specifically, the filter module is used to accelerate calculations, and detect the positive correspondences by fully exploiting the local associations between objects. Analogously to a mutual nearest-neighbor matching process, this module can parse more local nonrepresentational features from images. The main contributions of this work are three-folds:

  • We propose to construct a weakly supervised object-aware convolutional matching network architecture, while being trainable end-to-end without the requirement for manual annotations.

  • We develop an object-aware matching mechanism. A simple and effective foreground selection strategy is incorporated into a semantic perception loss to enable weakly-supervised learning. This enforces the nearest-neighbor searching process in the main semantic regions, reduces the computational load, and enhances the capability of extracting the salient features.

  • Extensive experiments thoroughly validate the effectiveness of the proposed approach on multiple public standard benchmark datasets, where it also outperforms state-of-the-art methods for semantic feature matching.

Section snippets

Related work

Semantic feature matching has gained rising attention in the last several years. Recent works are concerned with learning-based matching, and continue to make new advances.

Proposed approach

In this section, we describe the proposed framework for semantic feature matching in detail. As shown in Fig. 2, given a pair of images depicting main objects of the same semantic category, a pre-trained CNNs model is first used for feature extraction. Then we introduce a color space homogenization method to perform an attribute transfer process, and a correlation layer is adopted to produce the initial tentative correspondence maps aross images. Furthermore, a cycle consistency constraint is

Implementation and evaluation

In this section, we evaluate the performance of the proposed approach on several publicly available benchmark dat-asets for semantic feature matching. Meanwhile, the implementation details, results, analyses, and the comparisons to the state-of-the-art methods are provided in details.

Applications

Having established a sparse set of correspondences between a pair of images depicting the main objects of the same semantic category, which can be generalized to guide the alignment of two overlapping images containing the same object, as well as to estimate a dense correspondence field between the two images. Actually, high-level semantics are more robust than low-level visual features for matching. This can facilitate a variety of graphics applications, one of which is discussed below.

Conclusions

We have developed a semantic feature matching network framework, while being trainable end-to-end without the requirement for annotations. Our approach is based on an object-aware convolutional neural network architecture. The framework is simple and effective, and achieves superior performance. Experiments have clearly shown that our approach outperforms most state-of-the-art methods for semantic feature matching on several standard benchmark datasets. Meanwhile, Extensive experiments

CRediT authorship contribution statement

Wei Lyu: Writing - original draft, Conceptualization, Writing - review & editing, Investigation, Methodology. Lang Chen: Software, Validation, Conceptualization. Zhong Zhou: Conceptualization, Supervision, Funding acquisition. Wei Wu: Conceptualization, Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by National Key R & D Program of China (Grant No. 2018YFB2100601), and in part by the National Natural Science Foundation of China (Grant No. 61872023).

Wei Lyu received the B.S. degree in computer science from Sichuan Agricultural University, Yaan, China, in 2011, and the M.E. degree in computer science from Guizhou University, Guiyang, China, in 2015. He is currently pursuing the Ph.D. degree with the State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China. His research interests include semantic matching, semantic segmentation, geometric modeling, and virtual reality.

References (55)

  • T. Yu et al.

    Multi-view harmonized bilinear network for 3D object recognition

  • J. Si et al.

    Dual attention matching network for context-aware feature sequence based person re-identification

  • C. Liu et al.

    Sift flow: dense correspondence across scenes and its applications

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2011)
  • D.G. Lowe

    Distinctive image features from scale-invariant keypoints

    International Journal of Computer Vision

    (2004)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

  • P. Fischer, A. Dosovitskiy, T. Brox, Descriptor matching with convolutional neural networks: a comparison to sift,...
  • J. Long et al.

    Do convnets learn correspondence?

  • N. Ufer et al.

    Deep semantic feature matching

  • K. Aberman et al.

    Neural best-buddies: sparse cross-domain correspondence

    ACM Transactions on Graphics (SIGGRAPH Proceedings)

    (2018)
  • J. Liao et al.

    Visual attribute transfer through deep image analogy

    ACM Transactions on Graphics

    (2017)
  • C.B. Choy et al.

    Universal correspondence network

  • I. Rocco et al.

    NCNet: neighbourhood consensus networks for estimating image correspondences

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2020)
  • I. Rocco et al.

    End-to-end weakly-supervised semantic alignment

  • I. Rocco et al.

    Convolutional neural network architecture for geometric matching

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2018)
  • J. Kim et al.

    Deformable spatial pyramid matching for fast dense correspondence

  • B. Ham et al.

    Proposal flow: semantic correspondences from object proposals

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2018)
  • F. Yang et al.

    Object-aware dense semantic correspondence

  • Wei Lyu received the B.S. degree in computer science from Sichuan Agricultural University, Yaan, China, in 2011, and the M.E. degree in computer science from Guizhou University, Guiyang, China, in 2015. He is currently pursuing the Ph.D. degree with the State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China. His research interests include semantic matching, semantic segmentation, geometric modeling, and virtual reality.

    Lang Chen received the B.S. degree in computer science from Beijing Institute of Technology, Beijing, China, in 2017. He is currently pursuing the M.S. degree with the State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China. His main research interest is computer vision and image processing, including image instance matching and image semantic matching.

    Zhong Zhou received the B.S. degree from Nanjing University, Nanjing, China, in 1999, and the Ph.D. degree from Beihang University, Beijing, China, in 2005. He is currently a professor and Ph.D. adviser with the State Key Lab of Virtual Reality Technology and Systems, Beihang University, Beijing, China. His main research interests include Virtual Reality/Augmented Reality/Mixed Reality, Computer Vision and Artificial Intelligence.

    Wei Wu received the PhD degree from Harbin Institute of Technology, Harbin, China, in 1995. He is currently a professor with the State Key Laboratory of Virtual Reality Technology and Systems with Beihang University. He is chair of the Technical Committee on Virtual Reality and Visualization, China Computer Federation. His current research interests include virtual reality, wireless networking, and distributed interactive system.

    View full text