Geometric-aware dense matching network for 6D pose estimation of objects from RGB-D images
Introduction
Pose estimation for certain targets with visual sensors is a fundamental task in computer vision. The application fields include automatic driving [46], target assembling [29], robotic grasping [5] and augmented reality [19] for digital twins [33]. It has been proven to be a challenging problem due to the complex working conditions, e.g., varying lights, occlusion of targets, light reflection of special surfaces and sensor noise.
Traditional methods tackle the pose estimation problem as a feature matching problem. Different kinds of local descriptors such as SIFT [14], ORB [27], point-pair features (PPFs) [6], histograms of color gradients [28] and norm vectors of local surfaces [12] are manually designed to extract local invariants from the visual information. Then, such descriptors are used to extract features from both the 3D model of the object and the visual information. Features with the highest similarities between the 3D model and the visual information are chosen as the correspondences. The relative transformation from the 3D model to the visual information is finally calculated through PnP [17] for the 2D image or through least square fitting for the 3D point cloud.
With the rapid development of deep learning technologies in recent years, many researchers have approached the problem of 6D pose estimation through deep neural networks (DNNs). The direct solution is to fully utilize the powerful generalization ability of DNNs to regress the rotation and translation from the given visual information [35], [42]. However, this kind of method cannot deal with occlusion. Regression tends to find a mapping between the visual clues and object pose, but the occlusion states are so complex that the networks fail to generalize all the circumstances.
Because the constraint of the object pose is insufficient for deep neural networks with such a large number of parameters, some researchers have proposed training dense supervised signals for the 6D pose estimation task. This kind of method tries to regress some kinds of dense features (direction of the key points, xyz coordinates in 3D space or UV maps) to add more constraints to the neural network, but they do not make full use of the information for the 3D model of the objects.
It can be concluded that the development of 6D pose estimation technology in deep learning adds more reasonable and dense geometric constraints to DNNs to fit the precise pose. Inspired by the conventional idea of local descriptors that are rotation- and translation-invariant, such as SIFT [14] and ORB [27], we leverage the excellent feature extraction ability of DNNs to embed each point in the point cloud into a high-dimensional deep descriptor. Traditional manually designed descriptors take advantage of the neighbor color gradient variation to format high-dimensional descriptors for every pixel. Due to the inevitable dependency on image gradients, these descriptors perform poorly on textureless objects whose color gradients are usually invariable. DNNs can not only implicitly extract the features from the neighborhood of each input pixel or 3D point but also learn the geometric structures of the given object. Thus, the descriptive ability of the learned features will be more robust than that of hand-crafted descriptors and have the potential to identify textureless targets. As shown in Fig. 1, compared to the dense regression methods that need to predict unseen point positions, the dense matching strategy only gathers information from the present part, which leads to more stable performance under occlusion, light variation and reflective conditions.
In this paper, we extend the method [41] that learns pixelwise deep 2D-3D correspondences to 3D-3D dense matching. Geometric constraints are further considered in addition to the pointwise matching scheme. The main contribution of this paper is summarized as follows:
- 1.
A neighbor-constraint triplet loss is proposed to reduce the influence of ambiguity for correspondences caused by the grand truth pose errors. The proposed loss function improves the stability of the correspondence matching and reduces the dense matching errors.
- 2.
Geometric consistency is considered for symmetric objects to prevent the correspondences from incorrectly matching. The experimental results show that geometric consistency helps to learn a better implicit neural representation of the symmetric 3D model.
- 3.
We show the state-of-the-art 6D pose estimation performance of our proposed method compared to the performances of the baselines on the YCB-V [42] and Occlusion LineMOD [12] datasets.
Section snippets
Related works
We briefly introduce the deep learning-based 6D pose estimation methods in this section.
Methodology
Given an RGB-D image and a set of 3D object models , our goal is to estimate the 6D pose w.r.t. the camera coordinate for each object present in image . and are the 3D rotation and translation operations, respectively. To calculate , the core step is to find enough correspondences that fulfill , where is a point in that belongs to the object and is a vertex in . Then, can be obtained through least square fitting
Implementation details
Datasets We conduct our experiments on two datasets: LineMod Occlusion (LM-O) [2] and YCB-Video (YCB-V) [42]. For both datasets, we use both real and synthetic data for training and testing as in He et al. [10].
Object detector We use the same detection model as CosyPose [16], which consists of MaskRCNN with the FPN and ResNet50 backbone to detect the objects in RGB images. The detected object is selected and resized to . A total of 4096 points are randomly selected from the image with a
Limitations and future work
Although we achieved state-of-the-art performance on the two datasets, the main advantages come from the nonsymmetric objects. Although our proposed geometric consistency has been improved, the performance for symmetric objects is still inferior to that of some competitive methods. Further study of the representation of symmetric objects is essential for our method.
Conclusion
In this paper, we propose a geometric-aware dense matching network for 6D pose estimation of objects. The neighbor-constrained triplet loss, which is designed to eliminate the matching error caused by tough ground truth labels, improves the matching accuracy of the network. The geometric consistency for symmetric objects is proven to be a choice to explicitly learn the deep features of the points and vertices. Experiments on the YCB-V and LM-O datasets demonstrate the superiority and
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Chenrui Wu received his B.Sc. and Ph.D degrees in mechanical engineering from Zhejiang University,China, in 2012 and 2019, respectively. He is currently a lecturer in the School of Mechanical Engineering, University of Shanghai for Science and Technology, China. His research interests include object pose estimation, robot visual servoing and intelligent manufacturing.
References (46)
- et al.
Active garment recognition and target grasping point detection using deep learning
Pattern Recognit.
(2018) - S. Peng, Y. Liu, Q. Huang, X. Zhou, H. Bao, PVNet: pixel-wise voting network for 6DoF pose estimation, 2019, pp....
Cyclical learning rates for training neural networks
2017 IEEE Winter Conference on Applications of Computer Vision (WACV)
(2017)- et al.
Pseudo-siamese graph matching network for textureless objects- 6-D pose estimation
IEEE Trans. Ind. Electron.
(2022) - Y. Xiang, T. Schmidt, V. Narayanan, D. Fox, PoseCNN: a convolutional neural network for 6Dobject pose estimation in...
- et al.
Least-squares fitting of two 3-D point sets
IEEE Trans. Pattern Anal. Mach. Intell.
(1987) - et al.
Learning 6D object pose estimation using 3D object coordinates
European Conference on Computer Vision
(2014) - et al.
Deep global registration
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(2020) - et al.
Fully convolutional geometric features
Proceedings of the IEEE/CVF International Conference on Computer Vision
(2019) - et al.
Model globally, match locally: Efficient and robust 3D object recognition
2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(2010)
SplineCNN: fast geometric deep learning with continuous B-spline kernels
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
SurfEmb: dense and continuous correspondence distributions for object pose estimation with learnt surface embeddings
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
FFB6D: a full flow bidirectional fusion network for 6D pose estimation
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Pvn3d: a deep point-wise 3D keypoints voting network for 6DoF pose estimation
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes
2011 International Conference on Computer Vision
Neural correspondence field for object pose estimation
European Conference on Computer Vision
PCA-SIFT: a more distinctive representation for local image descriptors
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004.
PoseNet: a convolutional network for real-time 6-DoFcamera relocalization
IEEE International Conference on Computer Vision (ICCV)
EPnP: an accurate solution to the PnP problem
Int. J. Comput. Vis.
DCL-Net: deep correspondence learning network for 6Dpose estimation
An ar-assisted deep learning-based approach for automatic inspection of aviation connectors
IEEE Trans. Ind. Inf.
Cited by (7)
Open-vocabulary object 6D pose estimation
2023, arXivExploiting Point-Wise Attention in 6D Object Pose Estimation Based on Bidirectional Prediction
2023, IEEE Robotics and Automation Letters
Chenrui Wu received his B.Sc. and Ph.D degrees in mechanical engineering from Zhejiang University,China, in 2012 and 2019, respectively. He is currently a lecturer in the School of Mechanical Engineering, University of Shanghai for Science and Technology, China. His research interests include object pose estimation, robot visual servoing and intelligent manufacturing.
Long Chen received a B.Sc. degree in mechanical engineering from Wuhan University of Technology, China, in 2002.He received a Ph.D degree in mechanical engineering for Zhejiang University,China, in 2008. He is currently a professor in the School of Mechanical Engineering, University of Shanghai for Science and Technology, China. His research interests include product intelligent computing design, machine vision and robotics.
Shenlong Wang received his B.Sc. and Ph.D degrees in Engineering Mechanics from Wuhan University of Technology and Zhejiang University, China, in 2010 and 2015, respectively. He is currently an associate professor in the School of Mechanical Engineering, University of Shanghai for Science and Technology, China. His research interests include stochastic dynamics and control, robot visual perception and flexible operation, soft robotics.
Han Yang received B.Sc. degrees in mechanical engineering from Applied Technology College of Soochow University,China, in 2020. He is currently a master’s degree candidate in the School of Mechanical Engineering, University of Shanghai for Science and Technology, China. His research interests include object pose estimation and action recognition of the human body.
Junjie Jiang received the B.Sc. Degree in process equipment and control engineering from Nanjing Tech University, China, in 2017.He is currently a Ph.D. candidate in the Department of Mechanical Engineering, Zhejiang University, China. His research interests include scene analysis, pose estimation and image understanding.
- 1
The source code will soon be available at https://github.com/Ray0089/geometric-aware-dense-matching.