A Multiview Approach to Learning Articulated Motion Models

Daniele, Andrea F.; Howard, Thomas M.; Walter, Matthew R.

doi:10.1007/978-3-030-28619-4_30

A Multiview Approach to Learning Articulated Motion Models

Andrea F. Daniele¹⁴,
Thomas M. Howard¹⁵ &
Matthew R. Walter¹⁴

Conference paper
First Online: 28 November 2019

2599 Accesses
2 Citations

Part of the book series: Springer Proceedings in Advanced Robotics ((SPAR,volume 10))

Abstract

In order for robots to operate effectively in homes and workplaces, they must be able to manipulate the articulated objects common within environments built for and by humans. Kinematic models provide a concise representation of these objects that enable deliberate, generalizable manipulation policies. However, existing approaches to learning these models rely upon visual observations of an object’s motion, and are subject to the effects of occlusions and feature sparsity. Natural language descriptions provide a flexible and efficient means by which humans can provide complementary information in a weakly supervised manner suitable for a variety of different interactions (e.g., demonstrations and remote manipulation). In this paper, we present a multimodal learning framework that incorporates both vision and language information acquired in situ to estimate the structure and parameters that define kinematic models of articulated objects. The visual signal takes the form of an RGB-D image stream that opportunistically captures object motion in an unprepared scene. Accompanying natural language descriptions of the motion constitute the linguistic signal. We model linguistic information using a probabilistic graphical model that grounds natural language descriptions to their referent kinematic motion. By exploiting the complementary nature of the vision and language observations, our method infers correct kinematic models for various multiple-part objects on which the previous state-of-the-art, visual-only system fails. We evaluate our multimodal learning framework on a dataset comprised of a variety of household objects, and demonstrate a \(23\%\) improvement in model accuracy over the vision-only baseline.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
We employ RANSAC [11] to improve robustness to erroneous corresponences.

References

Alayrac, J.B., Sivic, J., Laptev, I., Lacoste-Julien, S.: Joint discovery of object states and manipulation actions (2017). arXiv:170202738
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering (2015). arXiv:150500468
Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robot. Auton. Syst. (2009)
Google Scholar
Artzi, Y., Zettlemoyer, L.: Weakly supervised learning of semantic parsers for mapping instructions to actions. Trans. Assoc. Comput. Linguist. (2013)
Google Scholar
Bouguet, J.Y.: Pyramidal implementation of the affine Lucas-Kanade feature tracker description of the algorithm. Intel Corp. (2001)
Google Scholar
Byravan, A., Fox, D.: SE3-Nets: learning rigid body motion using deep neural networks. In: Proceedings of ICRA (2017)
Google Scholar
Chen, D.L., Mooney, R.J.: Learning to interpret natural language navigation instructions from observations. In: Proceedings of AAAI (2011)
Google Scholar
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description (2014). arXiv:14114389
Duvallet, F., Walter, M.R., Howard, T., Hemachandra, S., Oh, J., Teller, S., Roy, N., Stentz, A.: Inferring maps and behaviors from natural language instructions. In: Proceedings of ISER (2014)
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD (1996)
Google Scholar
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM (1981)
Google Scholar
Guadarrama, S., Riano, L., Golland, D., Gohring, D., Jia, Y., Klein, D., Abbeel, P., Darrell, T.: Grounding spatial relations for human-robot interaction. In: Proceedings of IROS (2013)
Google Scholar
Guadarrama, S., Rodner, E., Saenko, K., Zhang, N., Farrell, R., Donahue, J., Darrell, T.: Open-vocabulary object retrieval. In: Proceedings of RSS (2014)
Google Scholar
Hausman, K., Niekum, S., Ostenoski, S., Sukhatme, G.S.: Active articulation model estimation through interactive perception. In: Proceedings of ICRA (2015)
Google Scholar
Hemachandra, S., Walter, M.R., Tellex, S., Teller, S.: Learning spatially-semantic representations from natural language descriptions and scene classifications. In: Proceedings of ICRA (2014)
Google Scholar
Hemachandra, S., Duvallet, F., Howard, T.M., Roy, N., Stentz, A., Walter, M.R.: Learning models for following natural language directions in unknown environments. In: Proceedings of ICRA (2015)
Google Scholar
Howard, T.M., Tellex, S., Roy, N.: A natural language planner interface for mobile manipulators. In: Proceedings of ICRA (2014)
Google Scholar
Huang, X., Walker, I., Birchfield, S.: Occlusion-aware reconstruction and manipulation of 3D articulated objects. In: Proceedings of ICRA (2012)
Google Scholar
Jain, A., Kemp, C.C.: Pulling open doors and drawers: coordinating an omni-directional base and a compliant arm with equilibrium point control. In: Proceedings of ICRA (2010)
Google Scholar
Kaess, M., Ranganathan, A., Dellaert, F.: iSAM: incremental smoothing and mapping. Trans. Robot. (2008)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of CVPR (2015)
Google Scholar
Katz, D., Orthey, A., Brock, O.: Interactive perception of articulated objects. In: Proceedings of ISER (2010)
Google Scholar
Katz, D., Kazemi, M., Andrew Bagnell, J., Stentz, A.: Interactive segmentation, tracking, and kinematic modeling of unknown 3D articulated objects. In: Proceedings of ICRA (2013)
Google Scholar
Kollar, T., Tellex, S., Roy, D., Roy, N.: Toward understanding natural language directions. In: Proceedings of HRI (2010)
Google Scholar
Kollar, T., Krishnamurthy. J., Strimel. G.: Toward interactive grounded language acquisition. In: Proceedings of RSS (2013)
Google Scholar
Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to-image coreference. In: Proceedings of CVPR (2014)
Google Scholar
Krishnamurthy, J., Kollar, T.: Jointly learning to parse and perceive: connecting natural language to the physical world. Trans. Assoc. Comput. Linguist. (2013)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. (2004)
Google Scholar
Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., Murphy, K.: What’s cookin’? Interpreting cooking videos using text, speech and vision (2015). arXiv:150301558
Martín-Martín, R., Brock, O.: Online interactive perception of articulated objects with multi-level recursive estimation based on task-specific priors. In: Proceedings of IROS (2014)
Google Scholar
Martín-Martín, R., Höfer, S., Brock, O.: An integrated approach to visual perception of articulated objects. In: Proceedings of ICRA (2016)
Google Scholar
Matuszek, C., Fox, D., Koscher, K.: Following directions using statistical machine translation. In: Proceedings of HRI (2010)
Google Scholar
Matuszek, C., Herbst, E., Zettlemoyer, L., Fox, D.: Learning to parse natural language commands to a robot control system. In: Proceedings of ISER (2012)
Google Scholar
Mei, H., Bansal, M., Walter, M.R.: Listen, attend, and walk: neural mapping of navigational instructions to action sequences. In: Proceedings of AAAI (2016)
Google Scholar
Misra, D.K., Sung, J., Lee, K., Saxena, A.: Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions. Int. J. Robot. Res. (2016)
Google Scholar
Olson, E.: AprilTag: a robust and flexible visual fiducial system. In: Proceedings of ICRA (2011)
Google Scholar
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2Text: describing images using 1 million captioned photographs. In: Proceedings of NIPS (2011)
Google Scholar
Paul, R., Arkin, J., Roy, N., Howard, T.M.: Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators. In: Proceedings of RSS (2016)
Google Scholar
Pillai, S., Walter. M.R., Teller. S.: Learning articulated motions from visual demonstration. In: Proceedings of RSS (2014)
Google Scholar
Pronobis, A., Jensfelt, P.: Large-scale semantic mapping and reasoning with heterogeneous modalities. In: Proceedings of ICRA (2012)
Google Scholar
Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L.: Linking people in videos with their names using coreference resolution. In: Proceedings of ECCV (2014)
Google Scholar
Schmidt, T., Newcombe, R., Fox, D.: DART: dense articulated real-time tracking. In: Proceedings of RSS (2014)
Google Scholar
Sener, O., Zamir, A.R., Savarese, S., Saxena, A.: Unsupervised semantic parsing of video collections. In: Proceedings of ICCV (2015)
Google Scholar
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: Proceedings of ICML (2015)
Google Scholar
Sturm, J., Stachniss. C., Burgard, W.: A probabilistic framework for learning kinematic models of articulated objects. J. Artif. Intell. Res. (2011)
Google Scholar
Sung, J., Jin, S.H., Saxena, A.: Robobarista: object part-based transfer of manipulation trajectories from crowd-sourcing in 3D pointclouds. In: Proceedings of ISRR (2015)
Google Scholar
Tellex, S., Kollar, T., Dickerson, S., Walter, M.R., Banerjee, A.G., Teller, S., Roy, N.: Understanding natural language commands for robotic navigation and mobile manipulation. In: Proceedings of AAAI (2011)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of CVPR (2015)
Google Scholar
Walter, M.R., Hemachandra, S., Homberg, B., Tellex, S., Teller, S.: Learning semantic maps from natural language descriptions. In: Proceedings of RSS (2013)
Google Scholar
Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: Proceedings of CVPR (2011)
Google Scholar
Winograd, T.: Understanding natural language. Cogn. Psychol. (1972)
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of ICML (2015)
Google Scholar
Yan, J., Pollefeys, M.: A general framework for motion segmentation: independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In: Proceedings ECCV (2006)
Google Scholar
Yang, Y., Li, Y., Fermüller, C., Aloimonos, Y.: Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In: Proceedings of AAAI (2015)
Google Scholar
Yu, S.I., Jiang, L., Hauptmann, A.: Instructional videos for unsupervised harvesting and learning of action examples. In: Proceedings of International Conference on Multimedia (2014)
Google Scholar
Zender, H., Martínez Mozos, O., Jensfelt, P., Kruijff, G., Burgard, W.: Conceptual spatial representations for indoor mobile robots. Robot. Auton. Syst. (2008)
Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Science Foundation under grants IIS-1638072 and IIS-1637813, and by the Robotics Consortium of the U.S. Army Research Laboratory under the Collaborative Technology Alliance Program Cooperative Agreement W911NF-10-2-0016.

Author information

Authors and Affiliations

Toyota Technological Institute at Chicago, Chicago, IL, USA
Andrea F. Daniele & Matthew R. Walter
University of Rochester, Rochester, NY, USA
Thomas M. Howard

Authors

Andrea F. Daniele
View author publications
You can also search for this author in PubMed Google Scholar
Thomas M. Howard
View author publications
You can also search for this author in PubMed Google Scholar
Matthew R. Walter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrea F. Daniele .

Editor information

Editors and Affiliations

Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
Nancy M. Amato
Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Greg Hager
Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA
Shawna Thomas
Department of Electrical Engineering, Pontificia Universidad Católica de Chile, Santiago, Chile
Miguel Torres-Torriti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Daniele, A.F., Howard, T.M., Walter, M.R. (2020). A Multiview Approach to Learning Articulated Motion Models. In: Amato, N., Hager, G., Thomas, S., Torres-Torriti, M. (eds) Robotics Research. Springer Proceedings in Advanced Robotics, vol 10. Springer, Cham. https://doi.org/10.1007/978-3-030-28619-4_30

Download citation

DOI: https://doi.org/10.1007/978-3-030-28619-4_30
Published: 28 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28618-7
Online ISBN: 978-3-030-28619-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics