Skip to main content
Log in

A Multi-modal Framework for Robots to Learn Manipulation Tasks from Human Demonstrations

  • Regular paper
  • Published:
Journal of Intelligent & Robotic Systems Aims and scope Submit manuscript

Abstract

Enabling robots to learn manipulation tasks by observing human demonstrations remains a major challenge. Recent advances in video captioning tasks provide an end-to-end method to translate demonstration videos into robotic commands. Compared with general video captioning tasks, Video to Command (V2C) task faces two key challenges: (1) How to extract key frames containing fine-grained manipulation actions from demonstration videos that contain a large amount of redundant information; (2) How to significantly improve the accuracy of generated commands so that the V2C method can be applied to real robot tasks. In response to the above problems, we propose a multi-modal framework for robots to learn manipulation tasks from human demonstrations. This framework consists of five components: Text Encoder, Video Encoder, Action Classifier, Keyframe Aligner and Command Decoder. In this framework, we have mainly done two aspects of work: (1) The key frame information of the video is extracted, and the effect of key frame information on improving the translation accuracy of robot commands is analyzed; (2) Based on the video and caption text information, we explore the effect of multimodal information fusion on improving the accuracy of the command generated by the model. Experiments show that our model is significantly superior to the existing methods on the standard metrics of video captioning tasks, such as BLEU_N, METEOR, ROUGE_L, and CIDEr. Among them, the performance of the variant model CGM-V using only video information on BLEU_4 is increased by 0.8%, and that of the variant model CGM-M using multi-modal information on BLEU_4 is significantly increased by 43.7%. Furthermore, our framework, when combined with an affordance detection network and a motion planner, can enable the robot to reproduce the tasks in the demonstration. Our source code and expanded annotations for the IIT-V2C dataset are at https://github.com/yin0816/CGM-M.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data Availability

The public IIT-V2C dataset is used in this project.

Code availability

Our source code and expanded annotations for the IIT-V2C dataset are at https://github.com/yin0816/CGM-M.

References

  1. Kyrarini, M., Haseeb, M.A., Ristić-Durrant, D., Gräser, A.: Robot learning of industrial assembly task via human demonstrations. Auton. Robot. 43(1), 239–257 (2019). https://doi.org/10.1007/s10514-018-9725-6

    Article  Google Scholar 

  2. Wang, Y., Xiong, R., Yu, H., Zhang, J., Liu, Y.: Perception of Demonstration for Automatic Programing of Robotic Assembly: Framework, Algorithm, and Validation. IEEE/ASME Trans. Mechatron. 23(3), 1059–1070 (2018). https://doi.org/10.1109/TMECH.2018.2799963

    Article  Google Scholar 

  3. Nguyen, A., Do, T.-T., Reid, I., Caldwell, D.G., Tsagarakis, N.G.: V2CNet: A Deep Learning Framework to Translate Videos to Commands for Robotic Manipulation. arXiv preprint, arXiv: 1903.10869 (2019)

  4. Xu, X., Qian, K., Zhou, B., Chen, S., Li, Y.: Two-stream 2D/3D Residual Networks for Learning Robot Manipulations from Human Demonstration Videos. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 3353–3358. IEEE (2021)

  5. Qian, K., Xu, X., Liu, H., Bai, J., Luo, S.: Environment-adaptive learning from demonstration for proactive assistance in human–robot collaborative tasks. Robot. Auton. Syst. 151, 104046 (2022). https://doi.org/10.1016/j.robot.2022.104046

    Article  Google Scholar 

  6. Ramirez-Amaro, K., Yang, Y., Cheng, G.: A survey on semantic-based methods for the understanding of human movements. Robot. Auton. Syst. 119, 31–50 (2019). https://doi.org/10.1016/j.robot.2019.05.013

    Article  Google Scholar 

  7. Bates, T., Ramirez-Amaro, K., Inamura, T., Cheng, G.: On-line simultaneous learning and recognition of everyday activities from virtual reality performances. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3510–3515. IEEE (2017)

  8. Duan, J., Ou, Y., Xu, S., Liu, M.: Sequential learning unification controller from human demonstrations for robotic compliant manipulation. Neurocomputing 366, 35–45 (2019). https://doi.org/10.1016/j.neucom.2019.07.081

    Article  Google Scholar 

  9. Qian, K., Liu, H., Valls Miro, J., Jing, X., Zhou, B.: Hierarchical and parameterized learning of pick-and-place manipulation from under-specified human demonstrations. Adv. Robot. 34(13), 858–872 (2020). https://doi.org/10.1080/01691864.2020.1778523

    Article  Google Scholar 

  10. Zhang, Q., Chen, J., Liang, D., Liu, H., Zhou, X., Ye, Z., Liu, W.: An Object Attribute Guided Framework for Robot Learning Manipulations from Human Demonstration Videos. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6113–6119. IEEE (2019)

  11. Yang, S., Zhang, W., Lu, W., Wang, H., Li, Y.: Learning Actions from Human Demonstration Video for Robotic Manipulation. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1805–1811. IEEE (2019)

  12. Nguyen, A., Kanoulas, D., Muratore, L., Caldwell, D.G., Tsagarakis, N.G.: Translating Videos to Commands for Robotic Manipulation with Deep Recurrent Neural Networks. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3782–3788. IEEE (2018)

  13. Jain, V., Al-Turjman, F., Chaudhary, G., Nayar, D., Gupta, V., Kumar, A.: Video captioning: a review of theory, techniques and practices. Multimed. Tools Appl. (2022). https://doi.org/10.1007/s11042-021-11878-w

    Article  Google Scholar 

  14. Lei, J., Wang, L., Shen, Y., Yu, D., Berg, T., Bansal, M.: MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2603–2614. ACL (2020)

  15. Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical Boundary-Aware Neural Encoder for Video Captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3185–3194. IEEE (2017)

  16. Tu, Y., Zhang, X., Liu, B., Yan, C.: Video Description with Spatial-Temporal Attention. In: Proceedings of the 25th ACM international conference on Multimedia, pp. 1014–1022. ACM (2017)

  17. Ryu, H., Kang, S., Kang, H., Yoo, C.D.: Semantic Grouping Network for Video Captioning. In: 35th AAAI Conference on Artificial Intelligence/33rd Conference on Innovative Applications of Artificial Intelligence/11th Symposium on Educational Advances in Artificial Intelligence, pp. 2514–2522. AAAI (2021)

  18. Xing, Y., Shen, Y., Wu, X., Zhang, S.: An Improved Algorithm of Moving Object Detection Based on Background Subtraction and Consecutive Frame Diffidence. In: Proceedings of the 2012 Third International Conference on Mechanic Automation and Control Engineering, pp. 1170–1172. ACM (2012)

  19. Wang, Y., Jiao, Y., Xiong, R., Yu, H., Zhang, J., Liu, Y.: MASD: A Multimodal Assembly Skill Decoding System for Robot Programming by Demonstration. IEEE Trans. Autom. Sci. Eng. 15(4), 1722–1734 (2018). https://doi.org/10.1109/TASE.2017.2783342

    Article  Google Scholar 

  20. Dean-Leon, E., Ramirez-Amaro, K., Bergner, F., Dianov, I., Cheng, G.: Integration of Robotic Technologies for Rapidly Deployable Robots. IEEE Trans. Industr. Inf. 14(4), 1691–1700 (2018). https://doi.org/10.1109/TII.2017.2766096

    Article  Google Scholar 

  21. Ramirez-Amaro, K., Dean-Leon, E., Bergner, F., Cheng, G.: A Semantic-Based Method for Teaching Industrial Robots New Tasks. KI - Künstliche Intelligenz 33(2), 117–122 (2019). https://doi.org/10.1007/s13218-019-00582-5

    Article  Google Scholar 

  22. Steinmetz, F., Nitsch, V., Stulp, F.: Intuitive Task-Level Programming by Demonstration Through Semantic Skill Recognition. IEEE Robotics and Automation Letters 4(4), 3742–3749 (2019). https://doi.org/10.1109/LRA.2019.2928782

    Article  Google Scholar 

  23. Huang, B., Ye, M., Lee, S.L., Yang, G.Z.: A vision-guided multi-robot cooperation framework for learning-by-demonstration and task reproduction. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4797–4804. IEEE (2017)

  24. Patel, M., Miro, J.V., Kragic, D., Ek, C.H., Dissanayake, G.: Learning object, grasping and manipulation activities using hierarchical HMMs. Auton. Robot. 37(3), 317–331 (2014). https://doi.org/10.1007/s10514-014-9392-1

    Article  Google Scholar 

  25. Koppula, H.S., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. Int. J. Robot. Res. 32(8), 951–970 (2013). https://doi.org/10.1177/0278364913478446

    Article  Google Scholar 

  26. Yang, X., Ji, Z., Wu, J., Lai, Y.K., Wei, C., Liu, G., Setchi, R.: Hierarchical Reinforcement Learning With Universal Policies for Multistep Robotic Manipulation. IEEE Trans. Neural Netw. Learn. Syst. 33(9), 4727–4741 (2022). https://doi.org/10.1109/TNNLS.2021.3059912

    Article  MathSciNet  Google Scholar 

  27. Abed-alguni, B.H., Chalup, S.K., Henskens, F.A., Paul, D.J.: A multi-agent cooperative reinforcement learning model using a hierarchy of consultants, tutors and workers. Vietnam J. Comput. Sci. 2(4), 213–226 (2015). https://doi.org/10.1007/s40595-015-0045-x

    Article  Google Scholar 

  28. Abed-Alguni, B.H., Paul, D.J., Chalup, S.K., Henskens, F.A.: A comparison study of cooperative Q-learning algorithms for independent learners. Int. J. Artif. Intell. 14(1), 71–93 (2016)

    Google Scholar 

  29. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE (2016)

  30. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning, pp. 6105–6114. PMLR (2019)

  31. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002. IEEE (2021)

  32. Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677–691 (2017). https://doi.org/10.1109/TPAMI.2016.2599174

    Article  Google Scholar 

  33. Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less Is More: Picking Informative Frames for Video Captioning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV, pp. 367–384. Springer, Cham  (2018)

  34. Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: A Joint Model for Video and Language Representation Learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7463–7472. IEEE (2019)

  35. Yanokura, I., Wake, N., Sasabuchi, K., Ikeuchi, K., Inaba, M.: Understanding Action Sequences based on Video Captioning for Learning-from-Observation. arXiv preprint, arXiv: 2101.05061 (2020)

  36. Yanaokura, I., Wake, N., Sasabuchi, K., Arakawa, R., Okada, K., Takamatsu, J., Inaba, M., Ikeuchi, K.: A Multimodal Learning-from-Observation Towards All-at-once Robot Teaching using Task Cohesion. In: 2022 IEEE/SICE International Symposium on System Integration (SII), pp. 367–374.  IEEE (2022)

  37. Behrens, J.K., Stepanova, K., Lange, R., Skoviera, R.: Specifying Dual-Arm Robot Planning Problems Through Natural Language and Demonstration. IEEE Robot. Autom. Lett. 4(3), 2622–2629 (2019). https://doi.org/10.1109/LRA.2019.2898714

    Article  Google Scholar 

  38. Jiang, C., Jagersand, M.: Bridging Visual Perception with Contextual Semantics for Understanding Robot Manipulation Tasks. In: 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), pp. 1447–1452. IEEE (2020)

  39. Jiang, C., Dehghan, M., Jagersand, M.: Understanding Contexts Inside Robot and Human Manipulation Tasks through Vision-Language Model and Ontology System in Video Streams. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8366–8372. IEEE (2020)

  40. Yin, C., Zhang, Q.: Object affordance detection with boundary-preserving network for robotic manipulation tasks. Neural. Comput. Appl. 34(20), 17963–17980 (2022). https://doi.org/10.1007/s00521-022-07446-4

    Article  Google Scholar 

  41. Chitta, S.: MoveIt!: An Introduction. In: Koubaa, A. (ed.) Robot Operating System (ROS): The Complete Reference (Volume 1), pp. 3–27. Springer International Publishing, Cham (2016)

    Chapter  Google Scholar 

  42. Ijspeert, A.J., Nakanishi, J., Hoffmann, H., Pastor, P., Schaal, S.: Dynamical movement primitives: Learning attractor models for motor behaviors. Neural Comput. 25(2), 328–373 (2013). https://doi.org/10.1162/NECO_a_00393

    Article  MathSciNet  MATH  Google Scholar 

  43. Hara, K., Kataoka, H., Satoh, Y.: Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 3154–3160. IEEE (2017)

  44. Luong, M.-T., Pham, H., Manning, C.D.: Effective Approaches to Attention-based Neural Machine Translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421. ACL (2015)

  45. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010. NIPS, Long Beach (2017)

  46. Li, Z., Liu, F., Yang, W., Peng, S., Zhou, J.: A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Transactions on Neural Networks and Learning Systems, 1–21 (2021). https://doi.org/10.1109/TNNLS.2021.3084827

  47. Deng, J., Dong, W., Socher, R., Li, L.J., Kai, L., Li, F.-F.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255.  IEEE (2009)

  48. Hara, K., Kataoka, H., Satoh, Y.: Towards Good Practice for Action Recognition with Spatiotemporal 3D Convolutions. In: 2018 24th International Conference on Pattern Recognition, pp. 2516–2521. ICPR (2018)

  49. Carreira, J., Zisserman, A.: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. IEEE (2017)

  50. Kenton, J.D.M.-W.C., Toutanova, L.K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186. ACL (2019)

  51. Sun, J.: Jieba chinese word segmentation tool. In. Available online: https://github.com/fxsjy/jieba, (2012)

  52. Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5288–5296. IEEE (2016)

  53. Gella, S., Lewis, M., Rohrbach, M.: A Dataset for Telling the Stories of Social Media Videos. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 968–974. EMNLP (2018)

  54. Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.F., Wang, W.Y.: VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4580–4590. IEEE (2019)

  55. Kuehne, H., Arslan, A., Serre, T.: The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787. IEEE (2014)

  56. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint, arXiv:1301.3781 (2013)

  57. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to Sequence -- Video to Text. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4534–4542. IEEE (2015)

  58.  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint, arXiv:1412.6980 (2014)

  59. Kumar, A., Vembu, S., Menon, A.K., Elkan, C.: Beam search algorithms for multilabel learning. Mach. Learn. 92(1), 65–89 (2013). https://doi.org/10.1007/s10994-013-5371-6

    Article  MathSciNet  MATH  Google Scholar 

  60. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. ACL (2002)

  61. Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. ACL (2005)

  62. Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81. ACL (2004)

  63. Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: Consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575. IEEE (2015)

  64. Fellbaum, C.: WordNet. In: Poli, R., Healy, M., Kameas, A. (eds.) Theory and Applications of Ontology: Computer Applications, pp. 231–243. Springer, Netherlands, Dordrecht (2010)

    Chapter  Google Scholar 

  65. Ramanishka, V., Das, A., Zhang, J., Saenko, K.: Top-Down Visual Saliency Guided by Captions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3135–3144. IEEE (2017)

  66. Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., Deng, L.: Semantic Compositional Networks for Visual Captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1141–1150. IEEE (2017)

  67. Yin, C., Zhang, Q., Ren, W.: A New Semantic Edge Aware Network for Object Affordance Detection. J. Intell. Rob. Syst. 104(1), 2 (2021). https://doi.org/10.1007/s10846-021-01525-9

    Article  Google Scholar 

  68. Fang, H.S., Wang, C., Gou, M., Lu, C.: GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11441–11450. IEEE (2020)

  69. Nguyen, A., Kanoulas, D., Caldwell, D.G., Tsagarakis, N.G.: Object-based affordances detection with Convolutional Neural Networks and dense Conditional Random Fields. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5908–5915. IEEE (2017)

Download references

Funding

No funding was received for conducting this study.

Author information

Authors and Affiliations

Authors

Contributions

Congcong Yin is responsible for the main research work and paper writing.

Qiuju Zhang is responsible for the guidance of the project.

Corresponding author

Correspondence to Qiuju Zhang.

Ethics declarations

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

All authors have read this manuscript and would like to have it considered exclusively for publication in Journal of Intelligent & Robotic Systems.

Conflicts of Interest

The authors declare that they have no conflicts of interest to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yin, C., Zhang, Q. A Multi-modal Framework for Robots to Learn Manipulation Tasks from Human Demonstrations. J Intell Robot Syst 107, 56 (2023). https://doi.org/10.1007/s10846-023-01856-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10846-023-01856-9

Keywords

Navigation