Geometric-aware dense matching network for 6D pose estimation of objects from RGB-D images

https://doi.org/10.1016/j.patcog.2022.109293Get rights and content

Highlights

  • We propose a geometric-aware dense matching network for 6D pose estimation that is robust to occlusion and truncation. The method leverages the information of the 3D model to match the visible points from the RGB-D image.

  • We design a neighbor-constraint triplet loss for dense correspondence matching that improves the matching stability and reduces the correspondence matching error.

  • We propose to explicitly consider the symmetric property of the 3D model in the training stage. We experimentally show that the distribution of the vertex features is more reasonable with symmetric consistency.

Abstract

6D pose estimation for certain targets from RGB-D images is a fundamental problem in computer vision. Current methods emphasize learning the overall expression of the targets, which leads to poor performance under occlusion and truncation conditions. In this paper, we propose using a geometric-aware dense matching network to obtain visible dense correspondences between a RGB-D image and 3D model to address difficult predictions from unseen keypoints. Two geometrical structures are considered for dense matching. (1) The neighbor area of the correspondences is treated as suboptimal matches in addition to the correspondence to reduce the influence of the error caused by ground truth calibration. (2) The distance consistency of the correspondences is leveraged to eliminate the ambiguity from the symmetrical objects. Experiments on LM-O dataset (77.1% ADD(S)-0.1d) and YCB-V dataset (97.6% ADD(S)) show the effectiveness and advantages of our proposed method.1

Introduction

Pose estimation for certain targets with visual sensors is a fundamental task in computer vision. The application fields include automatic driving [46], target assembling [29], robotic grasping [5] and augmented reality [19] for digital twins [33]. It has been proven to be a challenging problem due to the complex working conditions, e.g., varying lights, occlusion of targets, light reflection of special surfaces and sensor noise.

Traditional methods tackle the pose estimation problem as a feature matching problem. Different kinds of local descriptors such as SIFT [14], ORB [27], point-pair features (PPFs) [6], histograms of color gradients [28] and norm vectors of local surfaces [12] are manually designed to extract local invariants from the visual information. Then, such descriptors are used to extract features from both the 3D model of the object and the visual information. Features with the highest similarities between the 3D model and the visual information are chosen as the correspondences. The relative transformation from the 3D model to the visual information is finally calculated through PnP [17] for the 2D image or through least square fitting for the 3D point cloud.

With the rapid development of deep learning technologies in recent years, many researchers have approached the problem of 6D pose estimation through deep neural networks (DNNs). The direct solution is to fully utilize the powerful generalization ability of DNNs to regress the rotation and translation from the given visual information [35], [42]. However, this kind of method cannot deal with occlusion. Regression tends to find a mapping between the visual clues and object pose, but the occlusion states are so complex that the networks fail to generalize all the circumstances.

Because the constraint of the object pose is insufficient for deep neural networks with such a large number of parameters, some researchers have proposed training dense supervised signals for the 6D pose estimation task. This kind of method tries to regress some kinds of dense features (direction of the key points, xyz coordinates in 3D space or UV maps) to add more constraints to the neural network, but they do not make full use of the information for the 3D model of the objects.

It can be concluded that the development of 6D pose estimation technology in deep learning adds more reasonable and dense geometric constraints to DNNs to fit the precise pose. Inspired by the conventional idea of local descriptors that are rotation- and translation-invariant, such as SIFT [14] and ORB [27], we leverage the excellent feature extraction ability of DNNs to embed each point in the point cloud into a high-dimensional deep descriptor. Traditional manually designed descriptors take advantage of the neighbor color gradient variation to format high-dimensional descriptors for every pixel. Due to the inevitable dependency on image gradients, these descriptors perform poorly on textureless objects whose color gradients are usually invariable. DNNs can not only implicitly extract the features from the neighborhood of each input pixel or 3D point but also learn the geometric structures of the given object. Thus, the descriptive ability of the learned features will be more robust than that of hand-crafted descriptors and have the potential to identify textureless targets. As shown in Fig. 1, compared to the dense regression methods that need to predict unseen point positions, the dense matching strategy only gathers information from the present part, which leads to more stable performance under occlusion, light variation and reflective conditions.

In this paper, we extend the method [41] that learns pixelwise deep 2D-3D correspondences to 3D-3D dense matching. Geometric constraints are further considered in addition to the pointwise matching scheme. The main contribution of this paper is summarized as follows:

  • 1.

    A neighbor-constraint triplet loss is proposed to reduce the influence of ambiguity for correspondences caused by the grand truth pose errors. The proposed loss function improves the stability of the correspondence matching and reduces the dense matching errors.

  • 2.

    Geometric consistency is considered for symmetric objects to prevent the correspondences from incorrectly matching. The experimental results show that geometric consistency helps to learn a better implicit neural representation of the symmetric 3D model.

  • 3.

    We show the state-of-the-art 6D pose estimation performance of our proposed method compared to the performances of the baselines on the YCB-V [42] and Occlusion LineMOD [12] datasets.

Section snippets

Related works

We briefly introduce the deep learning-based 6D pose estimation methods in this section.

Methodology

Given an RGB-D image I and a set of N 3D object models M={Mi|i=1,,N}, our goal is to estimate the 6D pose P=[R|t] w.r.t. the camera coordinate for each object present in image I. R and t are the 3D rotation and translation operations, respectively. To calculate P, the core step is to find enough correspondences C={(pi,vj)|i{1,,n},j{1,,m}} that fulfill pi=Pvj, where pi is a point in I that belongs to the object and vj is a vertex in Mi. Then, P can be obtained through least square fitting

Implementation details

Datasets We conduct our experiments on two datasets: LineMod Occlusion (LM-O) [2] and YCB-Video (YCB-V) [42]. For both datasets, we use both real and synthetic data for training and testing as in He et al. [10].

Object detector We use the same detection model as CosyPose [16], which consists of MaskRCNN with the FPN and ResNet50 backbone to detect the objects in RGB images. The detected object is selected and resized to 256×256. A total of 4096 points are randomly selected from the image with a

Limitations and future work

Although we achieved state-of-the-art performance on the two datasets, the main advantages come from the nonsymmetric objects. Although our proposed geometric consistency has been improved, the performance for symmetric objects is still inferior to that of some competitive methods. Further study of the representation of symmetric objects is essential for our method.

Conclusion

In this paper, we propose a geometric-aware dense matching network for 6D pose estimation of objects. The neighbor-constrained triplet loss, which is designed to eliminate the matching error caused by tough ground truth labels, improves the matching accuracy of the network. The geometric consistency for symmetric objects is proven to be a choice to explicitly learn the deep features of the points and vertices. Experiments on the YCB-V and LM-O datasets demonstrate the superiority and

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Chenrui Wu received his B.Sc. and Ph.D degrees in mechanical engineering from Zhejiang University,China, in 2012 and 2019, respectively. He is currently a lecturer in the School of Mechanical Engineering, University of Shanghai for Science and Technology, China. His research interests include object pose estimation, robot visual servoing and intelligent manufacturing.

References (46)

  • M. Fey et al.

    SplineCNN: fast geometric deep learning with continuous B-spline kernels

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • R.L. Haugaard, A.G. Buch, SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with...
  • R.L. Haugaard et al.

    SurfEmb: dense and continuous correspondence distributions for object pose estimation with learnt surface embeddings

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2022)
  • Y. He et al.

    FFB6D: a full flow bidirectional fusion network for 6D pose estimation

    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    (2021)
  • Y. He et al.

    Pvn3d: a deep point-wise 3D keypoints voting network for 6DoF pose estimation

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2020)
  • S. Hinterstoisser et al.

    Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes

    2011 International Conference on Computer Vision

    (2011)
  • L. Huang et al.

    Neural correspondence field for object pose estimation

    European Conference on Computer Vision

    (2022)
  • Y. Ke et al.

    PCA-SIFT: a more distinctive representation for local image descriptors

    Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004.

    (2004)
  • A. Kendall et al.

    PoseNet: a convolutional network for real-time 6-DoFcamera relocalization

    IEEE International Conference on Computer Vision (ICCV)

    (2015)
  • Y. Labbé, J. Carpentier, M. Aubry, J. Sivic, CosyPose: consistent multi-view multi-object 6D pose estimation,...
  • V. Lepetit et al.

    EPnP: an accurate O(n) solution to the PnP problem

    Int. J. Comput. Vis.

    (2009)
  • H. Li et al.

    DCL-Net: deep correspondence learning network for 6Dpose estimation

  • S. Li et al.

    An ar-assisted deep learning-based approach for automatic inspection of aviation connectors

    IEEE Trans. Ind. Inf.

    (2020)
  • Chenrui Wu received his B.Sc. and Ph.D degrees in mechanical engineering from Zhejiang University,China, in 2012 and 2019, respectively. He is currently a lecturer in the School of Mechanical Engineering, University of Shanghai for Science and Technology, China. His research interests include object pose estimation, robot visual servoing and intelligent manufacturing.

    Long Chen received a B.Sc. degree in mechanical engineering from Wuhan University of Technology, China, in 2002.He received a Ph.D degree in mechanical engineering for Zhejiang University,China, in 2008. He is currently a professor in the School of Mechanical Engineering, University of Shanghai for Science and Technology, China. His research interests include product intelligent computing design, machine vision and robotics.

    Shenlong Wang received his B.Sc. and Ph.D degrees in Engineering Mechanics from Wuhan University of Technology and Zhejiang University, China, in 2010 and 2015, respectively. He is currently an associate professor in the School of Mechanical Engineering, University of Shanghai for Science and Technology, China. His research interests include stochastic dynamics and control, robot visual perception and flexible operation, soft robotics.

    Han Yang received B.Sc. degrees in mechanical engineering from Applied Technology College of Soochow University,China, in 2020. He is currently a master’s degree candidate in the School of Mechanical Engineering, University of Shanghai for Science and Technology, China. His research interests include object pose estimation and action recognition of the human body.

    Junjie Jiang received the B.Sc. Degree in process equipment and control engineering from Nanjing Tech University, China, in 2017.He is currently a Ph.D. candidate in the Department of Mechanical Engineering, Zhejiang University, China. His research interests include scene analysis, pose estimation and image understanding.

    View full text