Skip to main content
Log in

Selective Embedding with Gated Fusion for 6D Object Pose Estimation

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Deep learning method for 6D object pose estimation based on RGB image and depth (RGB-D) has been successfully applied to robot grasping. The fusion of RGB and depth is one of the most important difficulties. Previous works on the fusion of these two features are mostly concatenated together without considering the different contributions of the two types of features to pose estimation. We propose a selective embedding with gated fusion structure called SEGate, which can adjust the weights of RGB and depth features adaptively. Furthermore, we aggregate the local features of point clouds according to the distance between them. More specifically, the close point clouds contribute a lot to local features, while the distant point clouds contribute a little. Experiments show that our approach achieves the state-of-art performance in both LineMOD and YCB-Video datasets. Meanwhile, our approach is more robust to the pose estimation of occluded objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Munaro M, Menegatti E (2014) Fast RGB-D people tracking for service robots. Auton Robot 37(3):227–242

    Article  Google Scholar 

  2. Hinterstoisser S, Cagniart C, Ilic S, Sturm P, Navab N, Fua P, Lepetit V (2011) Gradient response maps for real-time detection oftextureless objects. IEEE Trans PAMI 34(5):876–888

    Article  Google Scholar 

  3. Hinterstoisser S, Lepetit V, Ilic S, Holzer S, Bradski G, Konolige K, Navab N (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In: Asian conference on computer vision, pp 548–562

  4. Besl PJ, McKay ND (1992) Method for registration of 3-D shapes. Sens fus IV: Control Paradig Data Struct 1611:586–606

    Google Scholar 

  5. Drost B, Ulrich M, Navab N, Ilic S (2010) Model globally, match locally: efficient and robust 3D object recognition. In: IEEE computer society conference on computer vision and pattern recognition, pp 998–1005

  6. Papazov C, Burschka D (2010) An efficient ransac for 3d object recognition in noisy and occluded scenes. In: Asian conference on computer vision, pp 135–148

  7. Hinterstoisser S, Lepetit V, Rajkumar N, Konolige K (2016) Going further with point pair features. In: European conference on computer vision, pp 834–848

  8. Kiforenko L, Drost B, Tombari F, Kruger N, Buch AG (2018) A performance evaluation of point pair features. Comput Vis Image Underst 166:66–80

    Article  Google Scholar 

  9. Schnabel R, Wahl R, Klein R (2007) Efficient RANSAC for point-cloud shape detection. Comput Gr forum 26(2):214–226

    Article  Google Scholar 

  10. Aldoma A, Marton ZC, Tombari F, Wohlkinger W, Potthast C, Zeisl B, Vincze M (2012) Tutorial: point cloud library: three-dimensional object recognition and 6 dof pose estimation. IEEE Robot Automation Mag 19(3):80–91

    Article  Google Scholar 

  11. Aldoma A, Tombari F, Stefano LD, Vincze M (2012) A global hypotheses verification method for 3d object recognition. In: European conference on computer vision, pp 511–524

  12. Guo Y, Bennamoun M, Sohel F, Lu M, Wan J, Kwok NM (2016) A comprehensive performance evaluation of 3D local feature descriptors. Int J Comput Vis 116(1):66–89

    Article  MathSciNet  Google Scholar 

  13. Doumanoglou A, Kouskouridas R, Malassiotis S, Kim TK (2016) Recovering 6D object pose and predicting next-best-view in the crowd. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3583–3592

  14. Tejani A, Kouskouridas R, Doumanoglou A, Tang D, Kim TK (2017) Latent-class hough forests for 6 DoF object pose estimation. IEEE Trans PAMI 40(1):119–132

    Article  Google Scholar 

  15. Brachmann E, Krull A, Michel F, Gumhold S, Shotton J, Rother C (2014) Learning 6d object pose estimation using 3d object coordinates. In: European conference on computer vision, pp 536–551

  16. Brachmann E, Michel F, Krull A, Yang MY, Gumhold S (2016) Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3364–3372

  17. Rangaprasad AS (2017) Probabilistic approaches for pose estimation, Carnegie Mellon University

  18. Rad M, Lepetit V (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: Proceedings of the IEEE international conference on computer vision, pp 3828–3836

  19. Kehl W, Manhardt F, Tombari F, Ilic S, Navab N (2017) SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In: Proceedings of the IEEE international conference on computer vision, pp 1521–1529

  20. Xiang Y, Schmidt T, Narayanan V, Fox D (2017) Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. Preprint arXiv:1711.00199

  21. Li C, Bai J, Hager GD (2018) A unified framework for multi-view multi-class object pose estimation. In: Proceedings of the european conference on computer vision (ECCV), pp 254–269

  22. Wang C, Xu D, Zhu Y, Martín-Martín R, Lu C, Fei-Fei L (2019) Densefusion: 6d object pose estimation by iterative dense fusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3343-3352

  23. Suwajanakorn S, Snavely N, Tompson JJ, Norouzi M (2018) Discovery of latent 3d keypoints via end-to-end geometric reasoning. In: Advances in neural information processing systems, pp 2059–2070

  24. Tremblay J, To T, Sundaralingam B, Xiang Y, Fox D, Birchfield S (2018) Deep object pose estimation for semantic robotic grasping of household objects. Preprint arXiv:1809.10790

  25. Kendall A, Grimes M, Cipolla R (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In: Proceedings of the IEEE international conference on computer vision, pp 2938–2946

  26. Song S, Xiao J (2016) Deep sliding shapes for amodal 3d object detection in rgb-d images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 808–816

  27. Li C, Lu B, Zhang Y, Liu H, Qu Y (2018) 3D reconstruction of indoor scenes via image registration. Neural Process Lett 48(3):1281–1304

    Article  Google Scholar 

  28. Qi CR, Liu W, Wu C, Su H, Guibas LJ (2018) Frustum pointnets for 3d object detection from rgb-d data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 918–927

  29. Zhou Y, Tuzel O (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4490–4499

  30. Guo D, Li W, Fang X (2017) Capturing temporal structures for video captioning by spatio-temporal contexts and channel attention mechanism. Neural Process Lett 46(1):313–328

    Article  Google Scholar 

  31. Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  32. Park J, Woo S, Lee JY, Kweon IS (2018) Bam: bottleneck attention module. Preprint arXiv:1807.06514

  33. Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the european conference on computer vision (ECCV), pp 3–19

  34. Wojek C, Walk S, Roth S, Schiele B (2011) Monocular 3D scene understanding with explicit occlusion reasoning. CVPR 2011:1993–2000

    Google Scholar 

  35. Xu Y, Zhou X, Liu P, Xu H (2019) Rapid pedestrian detection based on deep omega-shape features with partial occlusion handing. Neural Process Lett 49(3):923–937

    Article  Google Scholar 

  36. Sanyal R, Ahmed SM, Jaiswal M, Chaudhury KN (2017) A scalable ADMM algorithm for rigid registration. IEEE Signal Process Lett 24(10):1453–1457

    Article  Google Scholar 

  37. Eitel A, Springenberg J T, Spinello L, Riedmiller M, Burgard W (2015) Multimodal deep learning for robust RGB-D object recognition. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 681–687

  38. Wang W, Neumann U (2018) Depth-aware cnn for rgb-d segmentation. In: Proceedings of the european conference on computer vision (ECCV), pp 135–150

  39. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  40. Bell S, Lawrence Zitnick C, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2874–2883

  41. Cheng Y, Cai R, Li Z, Zhao X, Huang K (2017) Locality-sensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3029–3037

  42. Qi CR, Su H, Mo K, Guibas LJ (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 652–660

  43. Qi CR, Yi L, Su H, Guibas LJ (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in neural information processing systems, pp 5099–5108

  44. Sundermeyer M, Marton ZC, Durner M, Brucker M, Triebel R (2018) Implicit 3d orientation learning for 6d object detection from rgb images. In: Proceedings of the european conference on computer vision (ECCV), pp 699–715

  45. Xu D, Anguelov D, Jain A (2018) Pointfusion: deep sensor fusion for 3d bounding box estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 244–253

  46. Hinterstoisser S, Holzer S, Cagniart C, Ilic S, Konolige K, Navab N, Lepetit V (2011) Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In: International conference on computer vision, pp 858–865

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China under grant 61231010.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rongke Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

The LineMOD dataset contains more than 18,000 real images in 13 video sequences, each of which contains several low-textured objects. It is one of the classic datasets, which is widely used in traditional approaches and recent deep learning methods. Figure 7 shows the qualitative results for all remaining objects on the LineMOD dataset.

Fig. 8
figure 8

Qualitative results of severe occluded objects on the YCB-Video dataset

Table 4 Quantitative evaluation of 6D pose (ADD[20] ) on YCB-Video dataset

The YCB-Video dataset consists of 21 objects in 92 RGB-D video sequences. It contains a total of 133,827 frames with 3 to 9 objects per frame. The YCB-Video dataset contains many heavily occluded objects. Therefore, it is often used to verify severe occlusion problem in object pose estimation. We compare the pose estimation of objects with more severe occlusion on the YCB-Video dataset in Fig. 8. As shown in Fig. 8, PoseCNN+ICP and DenseFusion perform poorly in estimating the pose of the objects (tomato_soup_can, bowl, cracker_box, wood_block, large_clamp and pudding_box) due to severe occlusion. Our approach is more robust to the pose estimation of heavily occluded objects.

Table 4 lists the results of all 21 objects on the YCB-Video dataset under the ADD metrics. Ours(SEGate) method outperforms DenseFusion by 1.3% on the ADD(< 2 cm) metric and is basically the same on the ADD(AUC) metric. PoseCNN+ICP performs slightly better on the AUC metric, but it is extremely time consuming due to ICP. Note that the MEAN value of ADD(AUC) and the MEAN value of ADD(< 2 cm) are the area under the accuracy curve of all objects in the dataset and the accuracy of pose estimation error which is less than 2 cm of all objects in the dataset respectively, not the average of all 21 objects in Table 4.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, S., Liu, R., Du, Q. et al. Selective Embedding with Gated Fusion for 6D Object Pose Estimation. Neural Process Lett 51, 2417–2436 (2020). https://doi.org/10.1007/s11063-020-10198-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-020-10198-8

Keywords

Navigation