KVNet: An iterative 3D keypoints voting network for real-time 6-DoF object pose estimation

doi:10.1016/j.neucom.2023.01.036

Neurocomputing

Volume 530, 14 April 2023, Pages 11-22

https://doi.org/10.1016/j.neucom.2023.01.036 Get rights and content

Abstract

Accurate and efficient object pose estimation holds the indispensable part of virtual/augmented reality (VR/AR) and many other applications. While previous works focus on directly regressing 6D pose from RGB and depth image and thus suffer from the non-linearity of rotation space, we propose an iterative 3D keypoints voting network, named as KVNet. Specifically, our method decouples the pose into separate translation and rotation branch, both estimated by Hough voting scheme. By treating the uncertainty of keypoints’ vote as the Lipschitz continuous function of seed points’ fused embedding feature, our method is able to adaptively select the optimal keypoints vote. In this way, we argue that KVNet bridges the gap between the non-linear rotation space and linear Euclidean space, which introduces inductive bias for our network to learn the intrinsic pattern and infer 6D pose from RGB and depth images. Furthermore, our model will refine the initial keypoints localization with iterative fashion. Experiments show that across three challenging benchmark datasets (LineMOD, YCB-Video and Occlusion LineMOD), our method exhibits excellent performance.

Introduction

This paper focuses on the research of 6-DoF (degrees of freedom) pose estimation, i.e. the 3D position and orientation of objects in a canonical frame. Many real-world applications inevitably require accurate, efficient and robust pose estimation of specific objects in 3D space, such as virtual and augmented reality (VR/AR) [1], [2], [3], autonomous driving [4], robotic manipulation [5], [6] and so on. However, this problem has been proven quite challenging due to sensor noise, occlusion, variations of shape and texture and demand for low latency.

Traditionally, for most existing methods [7], [8], [9], [10], template matching is an attractive solution achieved by establishing the correspondences between the object point clouds (or images) and the object 3D mesh model. Statistical properties encoded in these handcrafted features cannot show enough robustness, especially when the point clouds or pixels data exhibit significant changes in occlusion, illumination, noise corruption and so on. Recent work has therefore focused on the data-driven method feeding RGB or RGB-D images to a deep net architecture for 6-DoF pose estimation. Given the context of a single RGB image, PVNet [11] and other RGB-based methods [12], [13], [14], [15] recover the pose of each object instance with a two-stage pipeline covering the 2D keypoints localization in the image coordinate and one standard or improved perspective-n-point (PnP) algorithm. However, it’s not trivial to accurately predict the 2D keypoint position in the low resolution, poor illumination or overlapped inputs, and thus incurs a noisy set of 2D-to-3D correspondences. And the projection of 3D space to 2D image plane may partly lose the object’s intrinsic geometric pattern essential for 6D pose estimation. To overcome above limitations, the advent of cheap RGB-D cameras enables the research in 3D space to infer the 6D pose of objects. Assisted with an additional depth information, a popular approach is to directly regress the 3D rotation and translation using the deep learning network. But in this case, these holistic methods [16], [17], [18] lag behind in estimating rotation due to the non-linearity of rotation space. Alternatively, as robust and discriminative point sets, 3D keypoints can be treated as a compact and abstract representation and therefore can be used to establish 3D-3D correspondences for 6D pose estimation.

In this work, we propose a simple but significantly efficient 3D keypoint voting approach for real-time 6D pose estimation of known and rigid objects, named as KVNet, with an iterative fashion. We decouple the 6D pose $SE (3)$ into translation $t \in R^{3}$ and rotation $R \in SO (3)$ predicted by regressing 3D keypoints in separate network branch. Our method chooses the DenseFusion [18] to be the basic feature embedding learner encoding the texture appearance and geometric information at per-pixel level. The deep network optimizes a set of criteria to learn the pointwise 3D offset and vote for 3D keypoints which bridge the gap between linear Euclidean and non-linear rotation space.

One issue not yet considered in other works is the selection of the optimal keypoint vote. We argue that the confidence of the keypoints voting from point sets in the object surface, defined as the uncertainty of the voting in our method, varies with the spatial distribution of point sets. And the uncertainty of the voting can be treated as the Lipschitz continuous function of the pointwise fused embedding since the uncertainty distributes across [0, 1]. The continuous function can be approximated by a multi-layer perceptron (MLP) benefited from its powerful mapping ability. By integrating the uncertainty into the criteria function, the deep network therefore adaptively achieves the optimal positions of keypoints with the highest confidence of the uncertainty in 3D space. To make progress towards our goals, we also take into account the post-processing that attempts to refine keypoints positions iteratively. Note that the refinement module is continuously differentiable and thus can be trained jointly with our main architecture.

Experiments conducted on three widely-used benchmark datasets (LineMOD [19], YCB-Video [20] and Occlusion LineMOD [46] datasets) show that our method exhibits the excellent performance while achieving the real-time inference speed compared with other 6D pose estimation techniques based on RGB and RGB-D images.

In summary, the main contributions of this paper are threefold:

•
We propose a novel uncertainty-driven keypoints voting scheme to recover 6D pose from RGB and depth images.
•
We introduce a keypoints refinement procedure for establishing the most precise 3D-3D correspondences but only with mild cost of inference speed.
•
We contribute in-depth experimental and mathematical analysis for demonstrating the excellent performance of our method.

Since keypoints detection is an open and representative problem in 3D data processing and computing, we expect that our idea could be transferred to many potential domains, such as 3D object detection, 3D reconstruction and so on.

Section snippets

Holistic methods

Recently, some holistic methods [16], [17], [18], [36], [38], [40], [41], [47] directly regress 3D rotation and translation parameters through RGB images or with additional depth information in deep learning frameworks. Wohlhart et al. [16] rely on the Euclidean distance to assess the similarity between descriptors trained by a Convolutional Neural Network (CNN), and then use scalable Nearest Neighbor search methods to capture both the object identity and poses. PoseCNN [17] uses convolutional

Methods

Here, we elaborately introduce our proposed deep learning framework, KVNet, which consumes the RGB and depth image as inputs and then estimates 6D poses of specific objects. Given the object coordinate frame $F_{O}$ and the camera coordinate frame $F_{C}$ , the 6D pose is defined as the rigid transformation of $F_{O}$ with regard to $F_{C}$ represented by $SE (3)$ , that is, the rotation matrix $SO (3)$ and one translation vector $T \in R^{3}$ .

We notice that the inconsistency of non-linear rotation and linear translation space is

Experiments

In this section, we first introduce elaborate experimental settings including the benchmark datasets for evaluation (Section 4.1), metrics (Section 4.2) and implementation details of training stage (Section 4.3). Next, we compare with state-of-the-art methods for 6D pose estimation on LineMOD [19] (Section 4.4), YCB-Video [20] (Section 4.5) and Occlusion LineMOD [46] (Section 4.6) as well as provide visualizations of detection results illustrating the effectiveness of our design. Then, we

Conclusion

In this work, we present KVNet, an iterative keypoints voting network for 6D pose estimation. This study pioneers the idea treating the keypoints voting’s spatial uncertainty as the Lipschitz continuous function of seed points’ fused embedding features. And then we also employ an iterative refinement procedure to refine the uncertainty-driven keypoints’ vote with minimal time consumption. On two challenging benchmarks, extensive experiments show that our method exhibits excellent performance.

CRediT authorship contribution statement

Fei Wang: Conceptualization. Xing Zhang: Methodology, Software. Tianyue Chen: Validation. Ze Shen: Investigation. Shangdong Liu: Data curation, Investigation. Zhenquan He: Visualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the Foundation of National Natural Science Foundation of China under Grant 61973065, 52075531, the Fundamental Research Funds for the Central Universities of China under Grant N2104008, and the Central Government Guides the Local Science and Technology Development Special Fund: 2021JH6/10500129.

Fei Wang received the B.S. and M.S. degrees in vehicle engineering from the Harbin Institute of Technology, China, in 1997 and 1999 and the Ph.D. degree in functional system engineering from the University of Tokushima, Japan, in 2004. From 2004 to 2005, he was a researcher with the Nissan Technical Center, Nissan Motor Co., Ltd. Since 2005, he has been an Associated Professor with the Faculty of Robot Science and Engineering, Northeastern University, China. He has authored or co-authored over

References (51)

M. Oberweger et al.
Generalized Feedback Loop for Joint Hand-Object Pose Estimation
IEEE Trans. Pattern Anal. Mach. Intell.
(2020)
T.R. Luo et al.
Dream-Experiment: A MR User Interface with Natural Multi-channel Interaction for Virtual Experiments
IEEE Trans. Visualization Comput. Graphics
(2020)
A. Crivellaro et al.
Robust 3D Object Tracking from Monocular Images Using Stable Parts
IEEE Trans. Pattern Anal. Mach. Intell.
(2018)
D.F. Xu, D. Anguelov, A. Jain. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation. In Proceedings of the...
X.K. Deng, Y. Xiang, A. Mousavian, C. Eppner, T. Bretl and D. Fox. Self-supervised 6D Object Pose Estimation for Robot...
S.H. Kasaei, N. Shafii, L.S. Lopes and A.M. Tomé. Interactive Open-Ended Object, Affordance and Grasp Learning for...
J.L. Yang, H.D. Li, and Y.D. Jia. Go-ICP: Solving 3D Registration Efficiently and Globally Optimally. In Proceedings of...
N. Mellado et al.
Super 4PCS Fast Global Pointcloud Registration via Smart Indexing
Computer Graphics Forum (CGF)
(2014)
S. Hinterstoisser et al.
Gradient Response Maps for Real-Time Detection of Textureless Objects
IEEE Trans. Pattern Anal. Mach. Intell.
(2012)
S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige et al. Model Based Training, Detection and...

S.D. Peng, Y. Liu, Q.X. Huang, X.W. Zhou and H.J. Bao. PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation. In...

Y.L. Hu, J. Hugonot, P. Fua and M. Salzmann. Segmentation-Driven 6D Object Pose Estimation. In Proceedings of the...

M. Rad and V. Lepetit. BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of...

B. Tekin, S.N. Sinha, and P. Fua. Real-Time Seamless Single Shot 6D Object Pose Prediction. In Proceedings of the IEEE...

J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox and S. Birchfield. Deep Object Pose Estimation for Semantic...

P. Wohlhart and V. Lepetit. Learning Descriptors for Object Recognition and 3D Pose Estimation. In Proceedings of the...

Y. Xiang, T. Schmidt, V. Narayanan and D. Fox. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in...

C. Wang, D.F. Xu, Y.K. Zhu, R. Martin-Martin, C.W. Lu, L. Fei-Fei, et al. DenseFusion: 6D Object Pose Estimation by...

S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab, et al. Multimodal Templates for Real-time...

B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A.M. Dollar. The Ycb Object and Model Set: Towards Common...

Y.S. He, W. Sun, H.B. Huang, J.R. Liu, H.Q. Fan and J. Sun. PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for...

S. Li, L.X. Yang, J.Q. Huang, X.S. Hua and L. Zhang. Dynamic Anchor Feature Selection for Single-Shot Object Detection....

F. Sun et al.

Feature Pyramid Reconfiguration with Consistent Loss for Object Detection

IEEE Trans. Image Process.

(2019)

J. Redmon, S. Divvala, R. Girshick and A. Farhadi. You Only Look Once: Unified, Real-Time Object Detection. In...

P.J. Besl et al.

A Method for Registration of 3-D Shapes

IEEE Trans. Pattern Anal. Mach. Intell.

(1992)

Cited by (1)

Zero-Shot 3d Pose Estimation of Unseen Object by Two-Step Rgb-D Fusion
2024, SSRN

Xing Zhang received the Bachelor degree in automation from the Northeastern University, China in 2018. And he received the Master degree at the Faculty of Robot Science and Engineering, the Northeastern University, China in 2021.His research focus on the 3D vision perception, deep learning and computer graphics.

Tianyue Chen graduated with a bachelor’s degree in Aircraft design and engineering from Beijing Institute of Technology in 2015 and a master’s degree in Aerospace Science and Technology from Beijing Institute of Technology in 2018. Since 2018, she has been working as an engineer at Shenyang Aircraft Design and Research Institute of Aviation Industry Corporation of China, focusing on aircraft management system design.

Ze Shen is currently studying for a master’s degree at the Faculty of Robot Science and Engineering, the Northeastern University, China. His main research interests include the field of computer vision and robotics.

Shangdong Liu began his master’s degree at Northeastern University, China, in September 2020. He is currently working toward the Master degree at the Faculty of Robot Science and Engineering, Northeastern University, China. His research interests include the fifields of pattern recognition and 6D object pose estimation.

Zhenquan He began his master’s degree at Northeastern University, China, in September 2019. He is currently working toward the Master degree at the Faculty of Information Science and Engineering, Northeastern University, China. His research interests include the fields of pattern recognition and 3D model semantic segmentation.

View full text

KVNet: An iterative 3D keypoints voting network for real-time 6-DoF object pose estimation

Abstract

Introduction

Section snippets

Holistic methods

Methods

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Generalized Feedback Loop for Joint Hand-Object Pose Estimation

IEEE Trans. Pattern Anal. Mach. Intell.

Dream-Experiment: A MR User Interface with Natural Multi-channel Interaction for Virtual Experiments

IEEE Trans. Visualization Comput. Graphics

Robust 3D Object Tracking from Monocular Images Using Stable Parts

IEEE Trans. Pattern Anal. Mach. Intell.

Super 4PCS Fast Global Pointcloud Registration via Smart Indexing

Computer Graphics Forum (CGF)

Gradient Response Maps for Real-Time Detection of Textureless Objects

IEEE Trans. Pattern Anal. Mach. Intell.

Feature Pyramid Reconfiguration with Consistent Loss for Object Detection

IEEE Trans. Image Process.

A Method for Registration of 3-D Shapes

IEEE Trans. Pattern Anal. Mach. Intell.