Skip to main content

A Multiview Approach to Learning Articulated Motion Models

  • Conference paper
  • First Online:

Part of the book series: Springer Proceedings in Advanced Robotics ((SPAR,volume 10))

Abstract

In order for robots to operate effectively in homes and workplaces, they must be able to manipulate the articulated objects common within environments built for and by humans. Kinematic models provide a concise representation of these objects that enable deliberate, generalizable manipulation policies. However, existing approaches to learning these models rely upon visual observations of an object’s motion, and are subject to the effects of occlusions and feature sparsity. Natural language descriptions provide a flexible and efficient means by which humans can provide complementary information in a weakly supervised manner suitable for a variety of different interactions (e.g., demonstrations and remote manipulation). In this paper, we present a multimodal learning framework that incorporates both vision and language information acquired in situ to estimate the structure and parameters that define kinematic models of articulated objects. The visual signal takes the form of an RGB-D image stream that opportunistically captures object motion in an unprepared scene. Accompanying natural language descriptions of the motion constitute the linguistic signal. We model linguistic information using a probabilistic graphical model that grounds natural language descriptions to their referent kinematic motion. By exploiting the complementary nature of the vision and language observations, our method infers correct kinematic models for various multiple-part objects on which the previous state-of-the-art, visual-only system fails. We evaluate our multimodal learning framework on a dataset comprised of a variety of household objects, and demonstrate a \(23\%\) improvement in model accuracy over the vision-only baseline.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    We employ RANSAC [11] to improve robustness to erroneous corresponences.

References

  1. Alayrac, J.B., Sivic, J., Laptev, I., Lacoste-Julien, S.: Joint discovery of object states and manipulation actions (2017). arXiv:170202738

  2. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering (2015). arXiv:150500468

  3. Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robot. Auton. Syst. (2009)

    Google Scholar 

  4. Artzi, Y., Zettlemoyer, L.: Weakly supervised learning of semantic parsers for mapping instructions to actions. Trans. Assoc. Comput. Linguist. (2013)

    Google Scholar 

  5. Bouguet, J.Y.: Pyramidal implementation of the affine Lucas-Kanade feature tracker description of the algorithm. Intel Corp. (2001)

    Google Scholar 

  6. Byravan, A., Fox, D.: SE3-Nets: learning rigid body motion using deep neural networks. In: Proceedings of ICRA (2017)

    Google Scholar 

  7. Chen, D.L., Mooney, R.J.: Learning to interpret natural language navigation instructions from observations. In: Proceedings of AAAI (2011)

    Google Scholar 

  8. Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description (2014). arXiv:14114389

  9. Duvallet, F., Walter, M.R., Howard, T., Hemachandra, S., Oh, J., Teller, S., Roy, N., Stentz, A.: Inferring maps and behaviors from natural language instructions. In: Proceedings of ISER (2014)

    Google Scholar 

  10. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD (1996)

    Google Scholar 

  11. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM (1981)

    Google Scholar 

  12. Guadarrama, S., Riano, L., Golland, D., Gohring, D., Jia, Y., Klein, D., Abbeel, P., Darrell, T.: Grounding spatial relations for human-robot interaction. In: Proceedings of IROS (2013)

    Google Scholar 

  13. Guadarrama, S., Rodner, E., Saenko, K., Zhang, N., Farrell, R., Donahue, J., Darrell, T.: Open-vocabulary object retrieval. In: Proceedings of RSS (2014)

    Google Scholar 

  14. Hausman, K., Niekum, S., Ostenoski, S., Sukhatme, G.S.: Active articulation model estimation through interactive perception. In: Proceedings of ICRA (2015)

    Google Scholar 

  15. Hemachandra, S., Walter, M.R., Tellex, S., Teller, S.: Learning spatially-semantic representations from natural language descriptions and scene classifications. In: Proceedings of ICRA (2014)

    Google Scholar 

  16. Hemachandra, S., Duvallet, F., Howard, T.M., Roy, N., Stentz, A., Walter, M.R.: Learning models for following natural language directions in unknown environments. In: Proceedings of ICRA (2015)

    Google Scholar 

  17. Howard, T.M., Tellex, S., Roy, N.: A natural language planner interface for mobile manipulators. In: Proceedings of ICRA (2014)

    Google Scholar 

  18. Huang, X., Walker, I., Birchfield, S.: Occlusion-aware reconstruction and manipulation of 3D articulated objects. In: Proceedings of ICRA (2012)

    Google Scholar 

  19. Jain, A., Kemp, C.C.: Pulling open doors and drawers: coordinating an omni-directional base and a compliant arm with equilibrium point control. In: Proceedings of ICRA (2010)

    Google Scholar 

  20. Kaess, M., Ranganathan, A., Dellaert, F.: iSAM: incremental smoothing and mapping. Trans. Robot. (2008)

    Google Scholar 

  21. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of CVPR (2015)

    Google Scholar 

  22. Katz, D., Orthey, A., Brock, O.: Interactive perception of articulated objects. In: Proceedings of ISER (2010)

    Google Scholar 

  23. Katz, D., Kazemi, M., Andrew Bagnell, J., Stentz, A.: Interactive segmentation, tracking, and kinematic modeling of unknown 3D articulated objects. In: Proceedings of ICRA (2013)

    Google Scholar 

  24. Kollar, T., Tellex, S., Roy, D., Roy, N.: Toward understanding natural language directions. In: Proceedings of HRI (2010)

    Google Scholar 

  25. Kollar, T., Krishnamurthy. J., Strimel. G.: Toward interactive grounded language acquisition. In: Proceedings of RSS (2013)

    Google Scholar 

  26. Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to-image coreference. In: Proceedings of CVPR (2014)

    Google Scholar 

  27. Krishnamurthy, J., Kollar, T.: Jointly learning to parse and perceive: connecting natural language to the physical world. Trans. Assoc. Comput. Linguist. (2013)

    Google Scholar 

  28. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. (2004)

    Google Scholar 

  29. Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., Murphy, K.: What’s cookin’? Interpreting cooking videos using text, speech and vision (2015). arXiv:150301558

  30. Martín-Martín, R., Brock, O.: Online interactive perception of articulated objects with multi-level recursive estimation based on task-specific priors. In: Proceedings of IROS (2014)

    Google Scholar 

  31. Martín-Martín, R., Höfer, S., Brock, O.: An integrated approach to visual perception of articulated objects. In: Proceedings of ICRA (2016)

    Google Scholar 

  32. Matuszek, C., Fox, D., Koscher, K.: Following directions using statistical machine translation. In: Proceedings of HRI (2010)

    Google Scholar 

  33. Matuszek, C., Herbst, E., Zettlemoyer, L., Fox, D.: Learning to parse natural language commands to a robot control system. In: Proceedings of ISER (2012)

    Google Scholar 

  34. Mei, H., Bansal, M., Walter, M.R.: Listen, attend, and walk: neural mapping of navigational instructions to action sequences. In: Proceedings of AAAI (2016)

    Google Scholar 

  35. Misra, D.K., Sung, J., Lee, K., Saxena, A.: Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions. Int. J. Robot. Res. (2016)

    Google Scholar 

  36. Olson, E.: AprilTag: a robust and flexible visual fiducial system. In: Proceedings of ICRA (2011)

    Google Scholar 

  37. Ordonez, V., Kulkarni, G., Berg, T.L.: Im2Text: describing images using 1 million captioned photographs. In: Proceedings of NIPS (2011)

    Google Scholar 

  38. Paul, R., Arkin, J., Roy, N., Howard, T.M.: Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators. In: Proceedings of RSS (2016)

    Google Scholar 

  39. Pillai, S., Walter. M.R., Teller. S.: Learning articulated motions from visual demonstration. In: Proceedings of RSS (2014)

    Google Scholar 

  40. Pronobis, A., Jensfelt, P.: Large-scale semantic mapping and reasoning with heterogeneous modalities. In: Proceedings of ICRA (2012)

    Google Scholar 

  41. Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L.: Linking people in videos with their names using coreference resolution. In: Proceedings of ECCV (2014)

    Google Scholar 

  42. Schmidt, T., Newcombe, R., Fox, D.: DART: dense articulated real-time tracking. In: Proceedings of RSS (2014)

    Google Scholar 

  43. Sener, O., Zamir, A.R., Savarese, S., Saxena, A.: Unsupervised semantic parsing of video collections. In: Proceedings of ICCV (2015)

    Google Scholar 

  44. Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: Proceedings of ICML (2015)

    Google Scholar 

  45. Sturm, J., Stachniss. C., Burgard, W.: A probabilistic framework for learning kinematic models of articulated objects. J. Artif. Intell. Res. (2011)

    Google Scholar 

  46. Sung, J., Jin, S.H., Saxena, A.: Robobarista: object part-based transfer of manipulation trajectories from crowd-sourcing in 3D pointclouds. In: Proceedings of ISRR (2015)

    Google Scholar 

  47. Tellex, S., Kollar, T., Dickerson, S., Walter, M.R., Banerjee, A.G., Teller, S., Roy, N.: Understanding natural language commands for robotic navigation and mobile manipulation. In: Proceedings of AAAI (2011)

    Google Scholar 

  48. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of CVPR (2015)

    Google Scholar 

  49. Walter, M.R., Hemachandra, S., Homberg, B., Tellex, S., Teller, S.: Learning semantic maps from natural language descriptions. In: Proceedings of RSS (2013)

    Google Scholar 

  50. Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: Proceedings of CVPR (2011)

    Google Scholar 

  51. Winograd, T.: Understanding natural language. Cogn. Psychol. (1972)

    Google Scholar 

  52. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of ICML (2015)

    Google Scholar 

  53. Yan, J., Pollefeys, M.: A general framework for motion segmentation: independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In: Proceedings ECCV (2006)

    Google Scholar 

  54. Yang, Y., Li, Y., Fermüller, C., Aloimonos, Y.: Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In: Proceedings of AAAI (2015)

    Google Scholar 

  55. Yu, S.I., Jiang, L., Hauptmann, A.: Instructional videos for unsupervised harvesting and learning of action examples. In: Proceedings of International Conference on Multimedia (2014)

    Google Scholar 

  56. Zender, H., Martínez Mozos, O., Jensfelt, P., Kruijff, G., Burgard, W.: Conceptual spatial representations for indoor mobile robots. Robot. Auton. Syst. (2008)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Science Foundation under grants IIS-1638072 and IIS-1637813, and by the Robotics Consortium of the U.S. Army Research Laboratory under the Collaborative Technology Alliance Program Cooperative Agreement W911NF-10-2-0016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrea F. Daniele .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Daniele, A.F., Howard, T.M., Walter, M.R. (2020). A Multiview Approach to Learning Articulated Motion Models. In: Amato, N., Hager, G., Thomas, S., Torres-Torriti, M. (eds) Robotics Research. Springer Proceedings in Advanced Robotics, vol 10. Springer, Cham. https://doi.org/10.1007/978-3-030-28619-4_30

Download citation

Publish with us

Policies and ethics