Learning Collaborative Action Plans from YouTube Videos

Zhang, Hejia; Lai, Po-Jen; Paul, Sayan; Kothawade, Suraj; Nikolaidis, Stefanos

doi:10.1007/978-3-030-95459-8_13

Learning Collaborative Action Plans from YouTube Videos

Hejia Zhang¹⁵,
Po-Jen Lai¹⁵,
Sayan Paul¹⁵,
Suraj Kothawade¹⁵ &
…
Stefanos Nikolaidis¹⁵

Conference paper
First Online: 17 February 2022

1655 Accesses
2 Citations

Part of the book series: Springer Proceedings in Advanced Robotics ((SPAR,volume 20))

Abstract

Videos from the World Wide Web provide a rich source of information that robots could use to acquire knowledge about manipulation tasks. Previous work has focused on generating action sequences from unconstrained videos for a single robot performing manipulation tasks by itself. However, robots operating in the same physical space with people need to not only perform actions autonomously, but also coordinate seamlessly with their human counterparts. This often requires representing and executing collaborative manipulation actions, such as handing over a tool or holding an object for the other agent. We present a system for knowledge acquisition of collaborative manipulation action plans that outputs commands to the robot in the form of visual sentence. We show the performance of the system in 12 unlabeled action clips taken from collaborative cooking videos on YouTube. We view this as the first step towards extracting collaborative manipulation action sequences from unconstrained, unlabeled online videos.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://www.youtube.com/playlist?list=PL1204B2E3981AF56E.

References

Yang, Y., Li, Y., Fermüller, C., Aloimonos, Y.: Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In: AAAI, pp. 3686–3693 (2015)
Google Scholar
Das, P., Xu, C., Doell, R.F., Corso, J.J.: A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2634–2641 (2013)
Google Scholar
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: CVPR (2017)
Google Scholar
Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv preprint arXiv:1812.08008 (2018)
Yang, Y., Guha, A., Fermuller, C., Aloimonos, Y.: A cognitive system for understanding human manipulation actions. Adv. Cogn. Syst. 3, 67–86 (2014)
Google Scholar
Chelba, C., et al.: One billion word benchmark for measuring progress in statistical language modeling. In: INTERSPEECH 2014. 15th Annual Conference of the International Speech Communication Association, pp. 2635–2639 (2014)
Google Scholar
Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robot. Auton. Syst. 57(5), 469–483 (2009)
Article Google Scholar
Zhang, H., Heiden, E., Nikolaidis, S., Lim, J.J., Sukhatme, G.S.: Auto-conditioned recurrent mixture density networks for learning generalizable robot skills. CoRR abs/1810.00146 (2018). http://arxiv.org/abs/1810.00146
Mandlekar, A., et al.: ROBOTURK: a crowdsourcing platform for robotic skill learning through imitation. In: 2nd Annual Conference on Robot Learning, CoRL 2018, pp. 879–893 (2018)
Google Scholar
Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., Murphy, K.: What’s cookin’? Interpreting cooking videos using text, speech and vision. CoRR abs/1503.01558 (2015). http://arxiv.org/abs/1503.01558
Turaga, P., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activities: a survey. IEEE Trans. Circ. Syst. Video Technol. 18(11), 1473 (2008)
Article Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: NAACL HLT 2015, pp. 1494–1504 (2015)
Google Scholar
Nguyen, A., Kanoulas, D., Muratore, L., Caldwell, D.G., Tsagarakis, N.G.: Translating videos to commands for robotic manipulation with deep recurrent neural networks. In: 2018 ICRA. IEEE (2018)
Google Scholar
Sun, S.H., Noh, H., Somasundaram, S., Lim, J.: Neural program synthesis from diverse demonstration videos. In: ICML, pp. 4797–4806 (2018)
Google Scholar
Pastra, K., Aloimonos, Y.: The minimalist grammar of action. Philos. Trans. Roy. Soc. B Biol. Sci. 367(1585), 103–117 (2012)
Article Google Scholar
Chomsky, N.: Lectures on Government and Binding: The Pisa Lectures, no. 9. Walter de Gruyter, Berlin (1993)
Google Scholar
Ryoo, M.S., Aggarwal, J.K.: Semantic representation and recognition of continued and recursive human activities. Int. J. Comput. Vis. 82(1), 1–24 (2009). https://doi.org/10.1007/s11263-008-0181-1
Article Google Scholar
Ryoo, M., Aggarwal, J.: Stochastic representation and recognition of high-level group activities. Int. J. Comput. Vis. 93(2), 183–200 (2011). https://doi.org/10.1007/s11263-010-0355-5
Article MathSciNet MATH Google Scholar
Summers-Stay, D., Teo, C.L., Yang, Y., Fermüller, C., Aloimonos, Y.: Using a minimal action grammar for activity understanding in the real world. In: IROS, pp. 4104–4111. IEEE (2012)
Google Scholar
Shu, T., Gao, X., Ryoo, M.S., Zhu, S.: Learning social affordance grammar from videos: transferring human interactions to human-robot interactions. In: 2017 IEEE International Conference on Robotics and Automation, ICRA, pp. 1669–1676 (2017)
Google Scholar
Koppula, H., Saxena, A.: Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation. In: ICML, pp. 792–800 (2013)
Google Scholar
Koppula, H.S., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. IJRR 32(8), 951–970 (2013)
Google Scholar
Amor, H.B., Neumann, G., Kamthe, S., Kroemer, O., Peters, J.: Interaction primitives for human-robot cooperation tasks. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 2831–2837. IEEE (2014)
Google Scholar
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Chelba, C., et al.: One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005 (2013)
Dwibedi, D., Misra, I., Hebert, M.: Cut, paste and learn: surprisingly easy synthesis for instance detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1301–1310 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Southern California, Los Angeles, 90089, USA
Hejia Zhang, Po-Jen Lai, Sayan Paul, Suraj Kothawade & Stefanos Nikolaidis

Authors

Hejia Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Po-Jen Lai
View author publications
You can also search for this author in PubMed Google Scholar
Sayan Paul
View author publications
You can also search for this author in PubMed Google Scholar
Suraj Kothawade
View author publications
You can also search for this author in PubMed Google Scholar
Stefanos Nikolaidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hejia Zhang .

Editor information

Editors and Affiliations

Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Karlsruhe, Baden-Württemberg, Germany
Tamim Asfour
Department of Information Technology and Human Factors, National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan
Eiichi Yoshida
Seoul National University, Seoul, Korea (Republic of)
Jaeheung Park
Jacobs School of Engineering, Institute for Contextual Robotics, San Diego, CA, USA
Henrik Christensen
Department of Computer Science, Stanford University, Stanford, CA, USA
Oussama Khatib

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (ppt 429 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, H., Lai, PJ., Paul, S., Kothawade, S., Nikolaidis, S. (2022). Learning Collaborative Action Plans from YouTube Videos. In: Asfour, T., Yoshida, E., Park, J., Christensen, H., Khatib, O. (eds) Robotics Research. ISRR 2019. Springer Proceedings in Advanced Robotics, vol 20. Springer, Cham. https://doi.org/10.1007/978-3-030-95459-8_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-95459-8_13
Published: 17 February 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95458-1
Online ISBN: 978-3-030-95459-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics