Abstract
In order for robots to operate effectively in homes and workplaces, they must be able to manipulate the articulated objects common within environments built for and by humans. Kinematic models provide a concise representation of these objects that enable deliberate, generalizable manipulation policies. However, existing approaches to learning these models rely upon visual observations of an object’s motion, and are subject to the effects of occlusions and feature sparsity. Natural language descriptions provide a flexible and efficient means by which humans can provide complementary information in a weakly supervised manner suitable for a variety of different interactions (e.g., demonstrations and remote manipulation). In this paper, we present a multimodal learning framework that incorporates both vision and language information acquired in situ to estimate the structure and parameters that define kinematic models of articulated objects. The visual signal takes the form of an RGB-D image stream that opportunistically captures object motion in an unprepared scene. Accompanying natural language descriptions of the motion constitute the linguistic signal. We model linguistic information using a probabilistic graphical model that grounds natural language descriptions to their referent kinematic motion. By exploiting the complementary nature of the vision and language observations, our method infers correct kinematic models for various multiple-part objects on which the previous state-of-the-art, visual-only system fails. We evaluate our multimodal learning framework on a dataset comprised of a variety of household objects, and demonstrate a \(23\%\) improvement in model accuracy over the vision-only baseline.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
We employ RANSAC [11] to improve robustness to erroneous corresponences.
References
Alayrac, J.B., Sivic, J., Laptev, I., Lacoste-Julien, S.: Joint discovery of object states and manipulation actions (2017). arXiv:170202738
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering (2015). arXiv:150500468
Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robot. Auton. Syst. (2009)
Artzi, Y., Zettlemoyer, L.: Weakly supervised learning of semantic parsers for mapping instructions to actions. Trans. Assoc. Comput. Linguist. (2013)
Bouguet, J.Y.: Pyramidal implementation of the affine Lucas-Kanade feature tracker description of the algorithm. Intel Corp. (2001)
Byravan, A., Fox, D.: SE3-Nets: learning rigid body motion using deep neural networks. In: Proceedings of ICRA (2017)
Chen, D.L., Mooney, R.J.: Learning to interpret natural language navigation instructions from observations. In: Proceedings of AAAI (2011)
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description (2014). arXiv:14114389
Duvallet, F., Walter, M.R., Howard, T., Hemachandra, S., Oh, J., Teller, S., Roy, N., Stentz, A.: Inferring maps and behaviors from natural language instructions. In: Proceedings of ISER (2014)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD (1996)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM (1981)
Guadarrama, S., Riano, L., Golland, D., Gohring, D., Jia, Y., Klein, D., Abbeel, P., Darrell, T.: Grounding spatial relations for human-robot interaction. In: Proceedings of IROS (2013)
Guadarrama, S., Rodner, E., Saenko, K., Zhang, N., Farrell, R., Donahue, J., Darrell, T.: Open-vocabulary object retrieval. In: Proceedings of RSS (2014)
Hausman, K., Niekum, S., Ostenoski, S., Sukhatme, G.S.: Active articulation model estimation through interactive perception. In: Proceedings of ICRA (2015)
Hemachandra, S., Walter, M.R., Tellex, S., Teller, S.: Learning spatially-semantic representations from natural language descriptions and scene classifications. In: Proceedings of ICRA (2014)
Hemachandra, S., Duvallet, F., Howard, T.M., Roy, N., Stentz, A., Walter, M.R.: Learning models for following natural language directions in unknown environments. In: Proceedings of ICRA (2015)
Howard, T.M., Tellex, S., Roy, N.: A natural language planner interface for mobile manipulators. In: Proceedings of ICRA (2014)
Huang, X., Walker, I., Birchfield, S.: Occlusion-aware reconstruction and manipulation of 3D articulated objects. In: Proceedings of ICRA (2012)
Jain, A., Kemp, C.C.: Pulling open doors and drawers: coordinating an omni-directional base and a compliant arm with equilibrium point control. In: Proceedings of ICRA (2010)
Kaess, M., Ranganathan, A., Dellaert, F.: iSAM: incremental smoothing and mapping. Trans. Robot. (2008)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of CVPR (2015)
Katz, D., Orthey, A., Brock, O.: Interactive perception of articulated objects. In: Proceedings of ISER (2010)
Katz, D., Kazemi, M., Andrew Bagnell, J., Stentz, A.: Interactive segmentation, tracking, and kinematic modeling of unknown 3D articulated objects. In: Proceedings of ICRA (2013)
Kollar, T., Tellex, S., Roy, D., Roy, N.: Toward understanding natural language directions. In: Proceedings of HRI (2010)
Kollar, T., Krishnamurthy. J., Strimel. G.: Toward interactive grounded language acquisition. In: Proceedings of RSS (2013)
Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to-image coreference. In: Proceedings of CVPR (2014)
Krishnamurthy, J., Kollar, T.: Jointly learning to parse and perceive: connecting natural language to the physical world. Trans. Assoc. Comput. Linguist. (2013)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. (2004)
Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., Murphy, K.: What’s cookin’? Interpreting cooking videos using text, speech and vision (2015). arXiv:150301558
Martín-Martín, R., Brock, O.: Online interactive perception of articulated objects with multi-level recursive estimation based on task-specific priors. In: Proceedings of IROS (2014)
Martín-Martín, R., Höfer, S., Brock, O.: An integrated approach to visual perception of articulated objects. In: Proceedings of ICRA (2016)
Matuszek, C., Fox, D., Koscher, K.: Following directions using statistical machine translation. In: Proceedings of HRI (2010)
Matuszek, C., Herbst, E., Zettlemoyer, L., Fox, D.: Learning to parse natural language commands to a robot control system. In: Proceedings of ISER (2012)
Mei, H., Bansal, M., Walter, M.R.: Listen, attend, and walk: neural mapping of navigational instructions to action sequences. In: Proceedings of AAAI (2016)
Misra, D.K., Sung, J., Lee, K., Saxena, A.: Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions. Int. J. Robot. Res. (2016)
Olson, E.: AprilTag: a robust and flexible visual fiducial system. In: Proceedings of ICRA (2011)
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2Text: describing images using 1 million captioned photographs. In: Proceedings of NIPS (2011)
Paul, R., Arkin, J., Roy, N., Howard, T.M.: Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators. In: Proceedings of RSS (2016)
Pillai, S., Walter. M.R., Teller. S.: Learning articulated motions from visual demonstration. In: Proceedings of RSS (2014)
Pronobis, A., Jensfelt, P.: Large-scale semantic mapping and reasoning with heterogeneous modalities. In: Proceedings of ICRA (2012)
Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L.: Linking people in videos with their names using coreference resolution. In: Proceedings of ECCV (2014)
Schmidt, T., Newcombe, R., Fox, D.: DART: dense articulated real-time tracking. In: Proceedings of RSS (2014)
Sener, O., Zamir, A.R., Savarese, S., Saxena, A.: Unsupervised semantic parsing of video collections. In: Proceedings of ICCV (2015)
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: Proceedings of ICML (2015)
Sturm, J., Stachniss. C., Burgard, W.: A probabilistic framework for learning kinematic models of articulated objects. J. Artif. Intell. Res. (2011)
Sung, J., Jin, S.H., Saxena, A.: Robobarista: object part-based transfer of manipulation trajectories from crowd-sourcing in 3D pointclouds. In: Proceedings of ISRR (2015)
Tellex, S., Kollar, T., Dickerson, S., Walter, M.R., Banerjee, A.G., Teller, S., Roy, N.: Understanding natural language commands for robotic navigation and mobile manipulation. In: Proceedings of AAAI (2011)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of CVPR (2015)
Walter, M.R., Hemachandra, S., Homberg, B., Tellex, S., Teller, S.: Learning semantic maps from natural language descriptions. In: Proceedings of RSS (2013)
Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: Proceedings of CVPR (2011)
Winograd, T.: Understanding natural language. Cogn. Psychol. (1972)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of ICML (2015)
Yan, J., Pollefeys, M.: A general framework for motion segmentation: independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In: Proceedings ECCV (2006)
Yang, Y., Li, Y., Fermüller, C., Aloimonos, Y.: Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In: Proceedings of AAAI (2015)
Yu, S.I., Jiang, L., Hauptmann, A.: Instructional videos for unsupervised harvesting and learning of action examples. In: Proceedings of International Conference on Multimedia (2014)
Zender, H., Martínez Mozos, O., Jensfelt, P., Kruijff, G., Burgard, W.: Conceptual spatial representations for indoor mobile robots. Robot. Auton. Syst. (2008)
Acknowledgements
This work was supported in part by the National Science Foundation under grants IIS-1638072 and IIS-1637813, and by the Robotics Consortium of the U.S. Army Research Laboratory under the Collaborative Technology Alliance Program Cooperative Agreement W911NF-10-2-0016.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Daniele, A.F., Howard, T.M., Walter, M.R. (2020). A Multiview Approach to Learning Articulated Motion Models. In: Amato, N., Hager, G., Thomas, S., Torres-Torriti, M. (eds) Robotics Research. Springer Proceedings in Advanced Robotics, vol 10. Springer, Cham. https://doi.org/10.1007/978-3-030-28619-4_30
Download citation
DOI: https://doi.org/10.1007/978-3-030-28619-4_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28618-7
Online ISBN: 978-3-030-28619-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)