Skip to main content

KPAM: KeyPoint Affordances for Category-Level Robotic Manipulation

  • Conference paper
  • First Online:
Robotics Research (ISRR 2019)

Part of the book series: Springer Proceedings in Advanced Robotics ((SPAR,volume 20))

Included in the following conference series:

  • 2554 Accesses

Abstract

We would like robots to achieve purposeful manipulation by placing any instance from a category of objects into a desired set of goal states. Existing manipulation pipelines typically specify the desired configuration as a target 6-DOF pose and rely on explicitly estimating the pose of the manipulated objects. However, representing an object with a parameterized transformation defined on a fixed template cannot capture large intra-category shape variation, and specifying a target pose at a category level can be physically infeasible or fail to accomplish the task – e.g. knowing the pose and size of a coffee mug relative to some canonical mug is not sufficient to successfully hang it on a rack by its handle. Hence we propose a novel formulation of category-level manipulation that uses semantic 3D keypoints as the object representation. This keypoint representation enables a simple and interpretable specification of the manipulation target as geometric costs and constraints on the keypoints, which flexibly generalizes existing pose-based manipulation methods. Using this formulation, we factor the manipulation policy into instance segmentation, 3D keypoint detection, optimization-based robot action planning and local dense-geometry-based action execution. This factorization allows us to leverage advances in these sub-problems and combine them into a general and effective perception-to-action manipulation pipeline. Our pipeline is robust to large intra-category shape variation and topology changes as the keypoint representation ignores task-irrelevant geometric details. Extensive hardware experiments demonstrate our method can reliably accomplish tasks with never-before seen objects in a category, such as placing shoes and mugs with significant shape variation into category level target configurations. The video, supplementary material and source code are available on our project page https://sites.google.com/view/kpam.

L. Manuelli and W. Gao—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Andrychowicz, M., et al.: Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177 (2018)

  2. Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 303–312. ACM (1996)

    Google Scholar 

  3. Florence, P.R., Manuelli, L., Tedrake, R.: Dense object nets: learning dense visual object descriptors by and for robotic manipulation. In: Conference on Robot Learning (CoRL) (2018)

    Google Scholar 

  4. Gao, W., Tedrake, R.: Filterreg: robust and efficient probabilistic point-set registration using gaussian filter and twist parameterization. arXiv preprint arXiv:1811.10136 (2018)

  5. Gualtieri, M., ten Pas, A., Platt, R.: Pick and place without geometric object models. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7433–7440. IEEE (2018)

    Google Scholar 

  6. Gualtieri, M., Ten Pas, A., Saenko, K., Platt, R.: High precision grasp pose detection in dense clutter. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 598–605. IEEE (2016)

    Google Scholar 

  7. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969 (2017)

    Google Scholar 

  8. Kalashnikov, D., et al.: QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293 (2018)

  9. Mahler, J., et al.: Learning ambidextrous robot grasping policies. Sci. Robot. 4(26), eaau4984 (2019)

    Article  Google Scholar 

  10. Mahler, J., et al.: Dex-Net 1.0: a cloud-based network of 3D objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1957–1964. IEEE (2016)

    Google Scholar 

  11. Maitin-Shepard, J., Cusumano-Towner, M., Lei, J., Abbeel, P.: Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In: 2010 IEEE International Conference on Robotics and Automation, pp. 2308–2315. IEEE (2010)

    Google Scholar 

  12. Marion, P., Florence, P.R., Manuelli, L., Tedrake, R.: LabelFusion: a pipeline for generating ground truth labels for real rgbd data of cluttered scenes. arXiv preprint arXiv:1707.04796 (2017)

  13. Miller, S., Fritz, M., Darrell, T., Abbeel, P.: Parametrized shape models for clothing. In: 2011 IEEE International Conference on Robotics and Automation, pp. 4861–4868. IEEE (2011)

    Google Scholar 

  14. Miller, S., Van Den Berg, J., Fritz, M., Darrell, T., Goldberg, K., Abbeel, P.: A geometric approach to robotic laundry folding. Int. J. Robot. Res. 31(2), 249–267 (2012)

    Article  Google Scholar 

  15. Myronenko, A., Song, X.: Point set registration: coherent point drift. IEEE Trans. Pattern Anal. Mach. Intell. 32(12), 2262–2275 (2010)

    Article  Google Scholar 

  16. Rodriguez, D., Cogswell, C., Koo, S., Behnke, S.: Transferring grasping skills to novel instances by latent space non-rigid registration. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE (2018)

    Google Scholar 

  17. Sahin, C., Kim, T.K.: Category-level 6D object pose recovery in depth images. In: Leal-Taixé, L., Roth, S. (eds.). ECCV 2018. LNCS, vol. 11129, pp. 665–681. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11009-3_41

  18. Schmidt, T., Newcombe, R., Fox, D.: Self-supervised visual descriptor learning for dense correspondence. IEEE Robot. Autom. Lett. 2(2), 420–427 (2017)

    Article  Google Scholar 

  19. Schwarz, M., et al.: Fast object learning and dual-arm coordination for cluttered stowing, picking, and packing. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3347–3354. IEEE (2018)

    Google Scholar 

  20. Seita, D., et al.: Robot bed-making: Deep transfer learning using depth sensing of deformable fabric. arXiv preprint arXiv:1809.09810 (2018)

  21. Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 529–545 (2018)

    Google Scholar 

  22. Russ Tedrake and the Drake Development Team. Drake: A planning, control, and analysis toolbox for nonlinear dynamical systems (2016)

    Google Scholar 

  23. Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox, D., Birchfield, S.: Deep object pose estimation for semantic robotic grasping of household objects. In: Conference on Robot Learning (CoRL) (2018)

    Google Scholar 

  24. van den Berg, J., Miller, S., Goldberg, K., Abbeel, P.: Gravity-based robotic cloth folding. In: Hsu, D., Isler, V., Latombe, J.C., Lin, M.C. (eds.) Algorithmic Foundations of Robotics IX. Springer Tracts in Advanced Robotics, vol. 68, pp. 409–424 (2010). Springer, Heidelberg. https://doi.org/10.1007/978-3-642-17452-0_24

  25. Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. arXiv preprint arXiv:1901.02970 (2019)

  26. Xiang, Y., Mottaghi, R., Savarese, S.: Beyond pascal: a benchmark for 3D object detection in the wild. In: IEEE Winter Conference on Applications of Computer Vision, pp. 75–82. IEEE (2014)

    Google Scholar 

  27. Zeng, A., et al.: Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE (2018)

    Google Scholar 

  28. Zeng, A., et al.: Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1386–1383. IEEE (2017)

    Google Scholar 

Download references

Acknowledgements

The authors thank Ethan Weber (instance segmentation training data generation) and Pat Marion (visualization) for their help. This work was supported by: National Science Foundation, Award No. IIS-1427050; Draper Laboratory Incorporated, Award No. SC001-0000001002; Lockheed Martin Corporation, Award No. RPP2016-002; Amazon Research Award.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lucas Manuelli .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 8205 KB)

Appendices

A Robot Hardware

Our experimental setup consists of a robot arm, an end-effector mounted RGBD camera and a parallel jaw gripper. Our robot is a 7-DOF Kuka IIWA LBR. Mounted on the end-effector is a Schunk WSG 50 parallel jaw gripper. Additionally we mount a Primesense Carmine 1.09 RGBD sensor to the gripper body.

B Dataset Generation and Annotation

In order to reduce the human annotation time required for neural network training we use a data collection pipeline similar to that used in [3]. The main idea is to collect many RGBD images of a static scene and perform a dense 3D reconstruction. Then, similarly to [12], we can label the 3D reconstruction and propagate these labels back to the individual RGBD frames. This 3D to 2D labelling approach allows us to generate over 100,000 labelled images with only a few hours of human annotation time.

1.1 B.1 3D Reconstruction and Masking

Here we give a brief overview of the approach used to generate the 3D reconstruction, more details can be found in [3]. Our data is made up of 3D reconstructions of a static scene containing a single object of interest. Using our the wrist mounted camera on the robot, we move the robot’s end-effector to capture a variety of RGBD images of the static scene. From the robot’s forward kinematics, we know the camera pose corresponding to each image which allows us to use TSDF fusion [2] to obtain a dense 3D reconstruction. After discarding images that were taken from very similar poses, we are left with approximately 400 RGBD images per scene.

The next step is to detect which parts of the 3D reconstruction correspond to the object of interest. This is done using the change detection method described in [2]. In our particular setup all the reconstructions were of a tabletop scene in front of the robot. Since our reconstructions are globally aligned (due to the fact that we use the robot’s forward kinematics to compute camera poses), we can simply crop the 3D reconstruction to the area above the table. At this point we have the portion of the 3D reconstruction that corresponds to the object of interest. This, together with the fact that we have camera poses, allows us to easily render binary masks (which segments the object from the background) for each RGBD image.

1.2 B.2 BInstance Segmentation

The instance segmentation network requires training images with pixelwise semantic labels. Using the background subtraction technique detailed in Sect. B.1, we have pixelwise labels for all the images in our 3D reconstructions. However, these images contain only a single object, while we need the instance segmentation network to handle multiple instances at the test time. Thus, we augment the training data by creating multi-object composite images from our single object annotated images using a method similar to [19]. We crop the object from one image (using the binary mask described in Sect. B.1) and paste this cropped section on top of an existing background. This process can be repeated to generate composite images with arbitrary numbers of object. Examples of such images are shown in Fig. 8.

Fig. 8.
figure 8

Multi object composite images used in instance segmentation training

Fig. 9.
figure 9

A screenshot from our custom keypoint annotation tool.

1.3 B.3 Keypoint Detection

The keypoint detection network requires training images annotated with pixel coordinates and depth for each keypoint. As mentioned in Sect. 3.2, we annotate 3D keypoints on the reconstructed mesh, transform the keypoints into the camera frame and project the keypoints into each image. To annotate the 3D keypoints on the reconstructed mesh, we developed a custom labelling tool based on the Director [2] user interface, shown in Fig. 9. We labelled a total of 117 scenes, 43 of which were shoes and 74 of which were mugs. Annotating these scenes took only a few hours and resulted in over 100,000 labelled images for keypoint network training.

C Neural Network Architecture and Training

1.1 C.1 Instance Segmentation

For the instance segmentation, we used an open source Mask R-CNN implementation [2]. We used a R-101-FPN backbone that was pretrained on the COCO dataset [2]. We then fine-tuned on a dataset of 10,000 images generated using the procedure outlined in Sect. B.2. The network was trained for 40,000 iterations using the default training schedule of [2].

Fig. 10.
figure 10

3D visualization of pointcloud and keypoint detections for the image from (a). The keypoints are colored as in Fig. 6. The top center keypoint is green, the bottom center keypoint is red, and the handle center keypoint is purple.

1.2 C.2 Keypoint Detection

We modify the integral network [21] for 3D keypoint detection. The network takes images cropped by the bounding box from MaskRCNN as the input. The network produces the probability distribution map \(g_i(u, v)\) that represents how likely keypoint i is to occur at pixel (uv), with \(\sum _{u,v}g_i(u,v)=1\). We then compute the expected values of these spatial distributions to recover a pixel coordinate of the keypoint i (Fig. 10):

$$\begin{aligned}{}[u_i, v_i]^T=\sum _{u, v} [u \cdot g_i(u,v), v \cdot g_i(u,v)]^T \end{aligned}$$
(8)

For the z coordinates (depth) of the keypoint, we also predict a depth value at every pixel denoted as \(d_i(u,v)\). The depth of the keypoint i can be computed as

$$\begin{aligned} z_i=\sum _{u, v} d_i(u,v) \cdot g_i(u,v) \end{aligned}$$
(9)

Given the training images with annotated pixel coordinate and depth for each keypoint, we use the integral loss and heatmap regression loss (see Sect. 2 of [21] for details) to train the network. We use a network with a 34 layers Resnet as the backbone. The network is trained on a dataset generated using the procedure described in Sect. B.3.

D Experiments

Figures 11, 12, 13 illustrate the results of experiments. These figures containing tiled images showing thee initial RGB image used for keypoint detection, along with an image of the object after running the kPAM pipeline. In the following sections we discuss more details related to the mug on shelf and mug on rack experiments.

Fig. 11.
figure 11

Before and after images of the shoe on rack experiment for all 100 trials.

Fig. 12.
figure 12

Before and after images of the mug on rack experiments for all 120 trials.

Fig. 13.
figure 13

Before and after images of the mug on shelf experiments for all 118 trials.

1.1 D.1 Mugs Upright on Shelf

Results for the mug on shelf experiment are detailed in Fig. 7. A trial was classified as a success if the mug ended up upright on the shelf with it’s bottom center keypoint within 5 cm of the target location. Out of 118 trials we experienced 2 failures. One failure was due to a combination of inaccurate keypoint detections together with the mug being torqued as it was grasped. Since we only have a wrist mounted camera we cannot re-perceive the object to compensate for the fact that the object moves during the grasping process. As discussed in Sect. 6 this could be alleviated by adding an externally mounted camera.

The other failure was resulted from the mug being placed upside down. Figure 14 shows the RGB image used for keypoint detection, along with the final position of the mug. As discussed in Sect. 5.2 this failure occurred because the keypoint detection confused the top and bottom of the mug. Given that the image was taken from a side view where the handle is occluded and it is difficult to distinguish top from bottom is understandable that the keypoint detection failed in this case. There are several ways to deal with this type of issue in the future. One approach would be to additionally predict a confidence value for each keypoint detection. This would allow us to detect that we were uncertain about the keypoint detections in Fig. 14(a). We could then move the robot and collect another image that would allow us to unambiguously detect the keypoints.

Fig. 14.
figure 14

(a) The RGB image for the single failure trial of the mug on shelf task that led to the mug being put in an incorrect orientation. In this case the keypoint detection confused the top and bottom of the mug and it was placed upside down. (b) The resulting upside down placement of the mug.

Fig. 15.
figure 15

The 5 mugs on the left are the test mugs used in experiment that were characterized as small. For comparison the four mugs on the right are part of the regular category.

1.2 D.2 Hang Mug on Rack By Its Handle

As discussed in Sect. 5.3 mugs were divided into two groups, regular and small, based on their size. A mug was characterized as small if the handle had a minimum dimension (either height or width) of less than 2 cm. Examples of mugs from each category are shown in Fig. 15. Mugs with such small handle sizes presented a challenge for our manipulation pipeline since hanging them on the rack requires increased precision.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Manuelli, L., Gao, W., Florence, P., Tedrake, R. (2022). KPAM: KeyPoint Affordances for Category-Level Robotic Manipulation. In: Asfour, T., Yoshida, E., Park, J., Christensen, H., Khatib, O. (eds) Robotics Research. ISRR 2019. Springer Proceedings in Advanced Robotics, vol 20. Springer, Cham. https://doi.org/10.1007/978-3-030-95459-8_9

Download citation

Publish with us

Policies and ethics