KPAM: KeyPoint Affordances for Category-Level Robotic Manipulation

Manuelli, Lucas; Gao, Wei; Florence, Peter; Tedrake, Russ

doi:10.1007/978-3-030-95459-8_9

Lucas Manuelli¹⁵,
Wei Gao¹⁵,
Peter Florence¹⁵ &
…
Russ Tedrake¹⁵

Part of the book series: Springer Proceedings in Advanced Robotics ((SPAR,volume 20))

Included in the following conference series:

The International Symposium of Robotics Research

2554 Accesses

Abstract

We would like robots to achieve purposeful manipulation by placing any instance from a category of objects into a desired set of goal states. Existing manipulation pipelines typically specify the desired configuration as a target 6-DOF pose and rely on explicitly estimating the pose of the manipulated objects. However, representing an object with a parameterized transformation defined on a fixed template cannot capture large intra-category shape variation, and specifying a target pose at a category level can be physically infeasible or fail to accomplish the task – e.g. knowing the pose and size of a coffee mug relative to some canonical mug is not sufficient to successfully hang it on a rack by its handle. Hence we propose a novel formulation of category-level manipulation that uses semantic 3D keypoints as the object representation. This keypoint representation enables a simple and interpretable specification of the manipulation target as geometric costs and constraints on the keypoints, which flexibly generalizes existing pose-based manipulation methods. Using this formulation, we factor the manipulation policy into instance segmentation, 3D keypoint detection, optimization-based robot action planning and local dense-geometry-based action execution. This factorization allows us to leverage advances in these sub-problems and combine them into a general and effective perception-to-action manipulation pipeline. Our pipeline is robust to large intra-category shape variation and topology changes as the keypoint representation ignores task-irrelevant geometric details. Extensive hardware experiments demonstrate our method can reliably accomplish tasks with never-before seen objects in a category, such as placing shoes and mugs with significant shape variation into category level target configurations. The video, supplementary material and source code are available on our project page https://sites.google.com/view/kpam.

L. Manuelli and W. Gao—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation

Category-Level Object Detection, Pose Estimation and Reconstruction from Stereo Images

Effective and Robust Non-prehensile Manipulation via Persistent Homology Guided Monte-Carlo Tree Search

References

Andrychowicz, M., et al.: Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177 (2018)
Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 303–312. ACM (1996)
Google Scholar
Florence, P.R., Manuelli, L., Tedrake, R.: Dense object nets: learning dense visual object descriptors by and for robotic manipulation. In: Conference on Robot Learning (CoRL) (2018)
Google Scholar
Gao, W., Tedrake, R.: Filterreg: robust and efficient probabilistic point-set registration using gaussian filter and twist parameterization. arXiv preprint arXiv:1811.10136 (2018)
Gualtieri, M., ten Pas, A., Platt, R.: Pick and place without geometric object models. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7433–7440. IEEE (2018)
Google Scholar
Gualtieri, M., Ten Pas, A., Saenko, K., Platt, R.: High precision grasp pose detection in dense clutter. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 598–605. IEEE (2016)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969 (2017)
Google Scholar
Kalashnikov, D., et al.: QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293 (2018)
Mahler, J., et al.: Learning ambidextrous robot grasping policies. Sci. Robot. 4(26), eaau4984 (2019)
Article Google Scholar
Mahler, J., et al.: Dex-Net 1.0: a cloud-based network of 3D objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1957–1964. IEEE (2016)
Google Scholar
Maitin-Shepard, J., Cusumano-Towner, M., Lei, J., Abbeel, P.: Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In: 2010 IEEE International Conference on Robotics and Automation, pp. 2308–2315. IEEE (2010)
Google Scholar
Marion, P., Florence, P.R., Manuelli, L., Tedrake, R.: LabelFusion: a pipeline for generating ground truth labels for real rgbd data of cluttered scenes. arXiv preprint arXiv:1707.04796 (2017)
Miller, S., Fritz, M., Darrell, T., Abbeel, P.: Parametrized shape models for clothing. In: 2011 IEEE International Conference on Robotics and Automation, pp. 4861–4868. IEEE (2011)
Google Scholar
Miller, S., Van Den Berg, J., Fritz, M., Darrell, T., Goldberg, K., Abbeel, P.: A geometric approach to robotic laundry folding. Int. J. Robot. Res. 31(2), 249–267 (2012)
Article Google Scholar
Myronenko, A., Song, X.: Point set registration: coherent point drift. IEEE Trans. Pattern Anal. Mach. Intell. 32(12), 2262–2275 (2010)
Article Google Scholar
Rodriguez, D., Cogswell, C., Koo, S., Behnke, S.: Transferring grasping skills to novel instances by latent space non-rigid registration. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE (2018)
Google Scholar
Sahin, C., Kim, T.K.: Category-level 6D object pose recovery in depth images. In: Leal-Taixé, L., Roth, S. (eds.). ECCV 2018. LNCS, vol. 11129, pp. 665–681. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11009-3_41
Schmidt, T., Newcombe, R., Fox, D.: Self-supervised visual descriptor learning for dense correspondence. IEEE Robot. Autom. Lett. 2(2), 420–427 (2017)
Article Google Scholar
Schwarz, M., et al.: Fast object learning and dual-arm coordination for cluttered stowing, picking, and packing. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3347–3354. IEEE (2018)
Google Scholar
Seita, D., et al.: Robot bed-making: Deep transfer learning using depth sensing of deformable fabric. arXiv preprint arXiv:1809.09810 (2018)
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 529–545 (2018)
Google Scholar
Russ Tedrake and the Drake Development Team. Drake: A planning, control, and analysis toolbox for nonlinear dynamical systems (2016)
Google Scholar
Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox, D., Birchfield, S.: Deep object pose estimation for semantic robotic grasping of household objects. In: Conference on Robot Learning (CoRL) (2018)
Google Scholar
van den Berg, J., Miller, S., Goldberg, K., Abbeel, P.: Gravity-based robotic cloth folding. In: Hsu, D., Isler, V., Latombe, J.C., Lin, M.C. (eds.) Algorithmic Foundations of Robotics IX. Springer Tracts in Advanced Robotics, vol. 68, pp. 409–424 (2010). Springer, Heidelberg. https://doi.org/10.1007/978-3-642-17452-0_24
Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. arXiv preprint arXiv:1901.02970 (2019)
Xiang, Y., Mottaghi, R., Savarese, S.: Beyond pascal: a benchmark for 3D object detection in the wild. In: IEEE Winter Conference on Applications of Computer Vision, pp. 75–82. IEEE (2014)
Google Scholar
Zeng, A., et al.: Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE (2018)
Google Scholar
Zeng, A., et al.: Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1386–1383. IEEE (2017)
Google Scholar

Download references

Acknowledgements

The authors thank Ethan Weber (instance segmentation training data generation) and Pat Marion (visualization) for their help. This work was supported by: National Science Foundation, Award No. IIS-1427050; Draper Laboratory Incorporated, Award No. SC001-0000001002; Lockheed Martin Corporation, Award No. RPP2016-002; Amazon Research Award.

Author information

Authors and Affiliations

CSAIL, Massachusetts Institute of Technology, Cambridge, USA
Lucas Manuelli, Wei Gao, Peter Florence & Russ Tedrake

Authors

Lucas Manuelli
View author publications
You can also search for this author in PubMed Google Scholar
Wei Gao
View author publications
You can also search for this author in PubMed Google Scholar
Peter Florence
View author publications
You can also search for this author in PubMed Google Scholar
Russ Tedrake
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lucas Manuelli .

Editor information

Editors and Affiliations

Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Karlsruhe, Baden-Württemberg, Germany
Tamim Asfour
Department of Information Technology and Human Factors, National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan
Eiichi Yoshida
Seoul National University, Seoul, Korea (Republic of)
Jaeheung Park
Jacobs School of Engineering, Institute for Contextual Robotics, San Diego, CA, USA
Henrik Christensen
Department of Computer Science, Stanford University, Stanford, CA, USA
Oussama Khatib

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 8205 KB)

Appendices

A Robot Hardware

Our experimental setup consists of a robot arm, an end-effector mounted RGBD camera and a parallel jaw gripper. Our robot is a 7-DOF Kuka IIWA LBR. Mounted on the end-effector is a Schunk WSG 50 parallel jaw gripper. Additionally we mount a Primesense Carmine 1.09 RGBD sensor to the gripper body.

B Dataset Generation and Annotation

In order to reduce the human annotation time required for neural network training we use a data collection pipeline similar to that used in [3]. The main idea is to collect many RGBD images of a static scene and perform a dense 3D reconstruction. Then, similarly to [12], we can label the 3D reconstruction and propagate these labels back to the individual RGBD frames. This 3D to 2D labelling approach allows us to generate over 100,000 labelled images with only a few hours of human annotation time.

1.1 B.1 3D Reconstruction and Masking

Here we give a brief overview of the approach used to generate the 3D reconstruction, more details can be found in [3]. Our data is made up of 3D reconstructions of a static scene containing a single object of interest. Using our the wrist mounted camera on the robot, we move the robot’s end-effector to capture a variety of RGBD images of the static scene. From the robot’s forward kinematics, we know the camera pose corresponding to each image which allows us to use TSDF fusion [2] to obtain a dense 3D reconstruction. After discarding images that were taken from very similar poses, we are left with approximately 400 RGBD images per scene.

The next step is to detect which parts of the 3D reconstruction correspond to the object of interest. This is done using the change detection method described in [2]. In our particular setup all the reconstructions were of a tabletop scene in front of the robot. Since our reconstructions are globally aligned (due to the fact that we use the robot’s forward kinematics to compute camera poses), we can simply crop the 3D reconstruction to the area above the table. At this point we have the portion of the 3D reconstruction that corresponds to the object of interest. This, together with the fact that we have camera poses, allows us to easily render binary masks (which segments the object from the background) for each RGBD image.

1.2 B.2 BInstance Segmentation

The instance segmentation network requires training images with pixelwise semantic labels. Using the background subtraction technique detailed in Sect. B.1, we have pixelwise labels for all the images in our 3D reconstructions. However, these images contain only a single object, while we need the instance segmentation network to handle multiple instances at the test time. Thus, we augment the training data by creating multi-object composite images from our single object annotated images using a method similar to [19]. We crop the object from one image (using the binary mask described in Sect. B.1) and paste this cropped section on top of an existing background. This process can be repeated to generate composite images with arbitrary numbers of object. Examples of such images are shown in Fig. 8.

1.3 B.3 Keypoint Detection

The keypoint detection network requires training images annotated with pixel coordinates and depth for each keypoint. As mentioned in Sect. 3.2, we annotate 3D keypoints on the reconstructed mesh, transform the keypoints into the camera frame and project the keypoints into each image. To annotate the 3D keypoints on the reconstructed mesh, we developed a custom labelling tool based on the Director [2] user interface, shown in Fig. 9. We labelled a total of 117 scenes, 43 of which were shoes and 74 of which were mugs. Annotating these scenes took only a few hours and resulted in over 100,000 labelled images for keypoint network training.

C Neural Network Architecture and Training

1.1 C.1 Instance Segmentation

For the instance segmentation, we used an open source Mask R-CNN implementation [2]. We used a R-101-FPN backbone that was pretrained on the COCO dataset [2]. We then fine-tuned on a dataset of 10,000 images generated using the procedure outlined in Sect. B.2. The network was trained for 40,000 iterations using the default training schedule of [2].

1.2 C.2 Keypoint Detection

We modify the integral network [21] for 3D keypoint detection. The network takes images cropped by the bounding box from MaskRCNN as the input. The network produces the probability distribution map $g_i(u, v)$ that represents how likely keypoint i is to occur at pixel (u, v), with $\sum _{u,v}g_i(u,v)=1$. We then compute the expected values of these spatial distributions to recover a pixel coordinate of the keypoint i (Fig. 10):

$$\begin{aligned}{}[u_i, v_i]^T=\sum _{u, v} [u \cdot g_i(u,v), v \cdot g_i(u,v)]^T \end{aligned}$$

(8)

For the z coordinates (depth) of the keypoint, we also predict a depth value at every pixel denoted as $d_i(u,v)$. The depth of the keypoint i can be computed as

$$\begin{aligned} z_i=\sum _{u, v} d_i(u,v) \cdot g_i(u,v) \end{aligned}$$

(9)

Given the training images with annotated pixel coordinate and depth for each keypoint, we use the integral loss and heatmap regression loss (see Sect. 2 of [21] for details) to train the network. We use a network with a 34 layers Resnet as the backbone. The network is trained on a dataset generated using the procedure described in Sect. B.3.

D Experiments

Figures 11, 12, 13 illustrate the results of experiments. These figures containing tiled images showing thee initial RGB image used for keypoint detection, along with an image of the object after running the kPAM pipeline. In the following sections we discuss more details related to the mug on shelf and mug on rack experiments.

1.1 D.1 Mugs Upright on Shelf

Results for the mug on shelf experiment are detailed in Fig. 7. A trial was classified as a success if the mug ended up upright on the shelf with it’s bottom center keypoint within 5 cm of the target location. Out of 118 trials we experienced 2 failures. One failure was due to a combination of inaccurate keypoint detections together with the mug being torqued as it was grasped. Since we only have a wrist mounted camera we cannot re-perceive the object to compensate for the fact that the object moves during the grasping process. As discussed in Sect. 6 this could be alleviated by adding an externally mounted camera.

The other failure was resulted from the mug being placed upside down. Figure 14 shows the RGB image used for keypoint detection, along with the final position of the mug. As discussed in Sect. 5.2 this failure occurred because the keypoint detection confused the top and bottom of the mug. Given that the image was taken from a side view where the handle is occluded and it is difficult to distinguish top from bottom is understandable that the keypoint detection failed in this case. There are several ways to deal with this type of issue in the future. One approach would be to additionally predict a confidence value for each keypoint detection. This would allow us to detect that we were uncertain about the keypoint detections in Fig. 14(a). We could then move the robot and collect another image that would allow us to unambiguously detect the keypoints.

1.2 D.2 Hang Mug on Rack By Its Handle

As discussed in Sect. 5.3 mugs were divided into two groups, regular and small, based on their size. A mug was characterized as small if the handle had a minimum dimension (either height or width) of less than 2 cm. Examples of mugs from each category are shown in Fig. 15. Mugs with such small handle sizes presented a challenge for our manipulation pipeline since hanging them on the rack requires increased precision.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Manuelli, L., Gao, W., Florence, P., Tedrake, R. (2022). KPAM: KeyPoint Affordances for Category-Level Robotic Manipulation. In: Asfour, T., Yoshida, E., Park, J., Christensen, H., Khatib, O. (eds) Robotics Research. ISRR 2019. Springer Proceedings in Advanced Robotics, vol 20. Springer, Cham. https://doi.org/10.1007/978-3-030-95459-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-95459-8_9
Published: 17 February 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95458-1
Online ISBN: 978-3-030-95459-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics