Skip to main content

Learning Collaborative Action Plans from YouTube Videos

  • Conference paper
  • First Online:

Part of the book series: Springer Proceedings in Advanced Robotics ((SPAR,volume 20))

Abstract

Videos from the World Wide Web provide a rich source of information that robots could use to acquire knowledge about manipulation tasks. Previous work has focused on generating action sequences from unconstrained videos for a single robot performing manipulation tasks by itself. However, robots operating in the same physical space with people need to not only perform actions autonomously, but also coordinate seamlessly with their human counterparts. This often requires representing and executing collaborative manipulation actions, such as handing over a tool or holding an object for the other agent. We present a system for knowledge acquisition of collaborative manipulation action plans that outputs commands to the robot in the form of visual sentence. We show the performance of the system in 12 unlabeled action clips taken from collaborative cooking videos on YouTube. We view this as the first step towards extracting collaborative manipulation action sequences from unconstrained, unlabeled online videos.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.youtube.com/playlist?list=PL1204B2E3981AF56E.

References

  1. Yang, Y., Li, Y., Fermüller, C., Aloimonos, Y.: Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In: AAAI, pp. 3686–3693 (2015)

    Google Scholar 

  2. Das, P., Xu, C., Doell, R.F., Corso, J.J.: A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2634–2641 (2013)

    Google Scholar 

  3. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  4. Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: CVPR (2017)

    Google Scholar 

  5. Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv preprint arXiv:1812.08008 (2018)

  6. Yang, Y., Guha, A., Fermuller, C., Aloimonos, Y.: A cognitive system for understanding human manipulation actions. Adv. Cogn. Syst. 3, 67–86 (2014)

    Google Scholar 

  7. Chelba, C., et al.: One billion word benchmark for measuring progress in statistical language modeling. In: INTERSPEECH 2014. 15th Annual Conference of the International Speech Communication Association, pp. 2635–2639 (2014)

    Google Scholar 

  8. Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robot. Auton. Syst. 57(5), 469–483 (2009)

    Article  Google Scholar 

  9. Zhang, H., Heiden, E., Nikolaidis, S., Lim, J.J., Sukhatme, G.S.: Auto-conditioned recurrent mixture density networks for learning generalizable robot skills. CoRR abs/1810.00146 (2018). http://arxiv.org/abs/1810.00146

  10. Mandlekar, A., et al.: ROBOTURK: a crowdsourcing platform for robotic skill learning through imitation. In: 2nd Annual Conference on Robot Learning, CoRL 2018, pp. 879–893 (2018)

    Google Scholar 

  11. Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., Murphy, K.: What’s cookin’? Interpreting cooking videos using text, speech and vision. CoRR abs/1503.01558 (2015). http://arxiv.org/abs/1503.01558

  12. Turaga, P., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activities: a survey. IEEE Trans. Circ. Syst. Video Technol. 18(11), 1473 (2008)

    Article  Google Scholar 

  13. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: NAACL HLT 2015, pp. 1494–1504 (2015)

    Google Scholar 

  14. Nguyen, A., Kanoulas, D., Muratore, L., Caldwell, D.G., Tsagarakis, N.G.: Translating videos to commands for robotic manipulation with deep recurrent neural networks. In: 2018 ICRA. IEEE (2018)

    Google Scholar 

  15. Sun, S.H., Noh, H., Somasundaram, S., Lim, J.: Neural program synthesis from diverse demonstration videos. In: ICML, pp. 4797–4806 (2018)

    Google Scholar 

  16. Pastra, K., Aloimonos, Y.: The minimalist grammar of action. Philos. Trans. Roy. Soc. B Biol. Sci. 367(1585), 103–117 (2012)

    Article  Google Scholar 

  17. Chomsky, N.: Lectures on Government and Binding: The Pisa Lectures, no. 9. Walter de Gruyter, Berlin (1993)

    Google Scholar 

  18. Ryoo, M.S., Aggarwal, J.K.: Semantic representation and recognition of continued and recursive human activities. Int. J. Comput. Vis. 82(1), 1–24 (2009). https://doi.org/10.1007/s11263-008-0181-1

    Article  Google Scholar 

  19. Ryoo, M., Aggarwal, J.: Stochastic representation and recognition of high-level group activities. Int. J. Comput. Vis. 93(2), 183–200 (2011). https://doi.org/10.1007/s11263-010-0355-5

    Article  MathSciNet  MATH  Google Scholar 

  20. Summers-Stay, D., Teo, C.L., Yang, Y., Fermüller, C., Aloimonos, Y.: Using a minimal action grammar for activity understanding in the real world. In: IROS, pp. 4104–4111. IEEE (2012)

    Google Scholar 

  21. Shu, T., Gao, X., Ryoo, M.S., Zhu, S.: Learning social affordance grammar from videos: transferring human interactions to human-robot interactions. In: 2017 IEEE International Conference on Robotics and Automation, ICRA, pp. 1669–1676 (2017)

    Google Scholar 

  22. Koppula, H., Saxena, A.: Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation. In: ICML, pp. 792–800 (2013)

    Google Scholar 

  23. Koppula, H.S., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. IJRR 32(8), 951–970 (2013)

    Google Scholar 

  24. Amor, H.B., Neumann, G., Kamthe, S., Kroemer, O., Peters, J.: Interaction primitives for human-robot cooperation tasks. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 2831–2837. IEEE (2014)

    Google Scholar 

  25. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

    Google Scholar 

  26. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

    Google Scholar 

  27. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  28. Chelba, C., et al.: One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005 (2013)

  29. Dwibedi, D., Misra, I., Hebert, M.: Cut, paste and learn: surprisingly easy synthesis for instance detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1301–1310 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hejia Zhang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (ppt 429 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, H., Lai, PJ., Paul, S., Kothawade, S., Nikolaidis, S. (2022). Learning Collaborative Action Plans from YouTube Videos. In: Asfour, T., Yoshida, E., Park, J., Christensen, H., Khatib, O. (eds) Robotics Research. ISRR 2019. Springer Proceedings in Advanced Robotics, vol 20. Springer, Cham. https://doi.org/10.1007/978-3-030-95459-8_13

Download citation

Publish with us

Policies and ethics