Skip to main content

Imitation Learning of Long-Horizon Manipulation Tasks Through Temporal Sub-action Sequencing

  • Conference paper
  • First Online:
Computer Vision and Image Processing (CVIP 2023)

Abstract

This research proposes an approach to long-horizon manipulation which uses video and kinesthetic demonstrations to imitate human actions. The task learning process involves two stages. To learn the sequence of the sub-actions in the video demonstration, the Task Sequencing Network (TSNet) - a hybrid neural network made up of Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Connectionist Temporal Classification (CTC) loss, is used in the first stage. Through dynamic movement primitive (DMP) models, task-agnostic task primitives are learned in the second stage via kinesthetic demonstrations. To encode the semantic relationship between the sub-actions and the objects, a Multi-relational Embedding Network (MRE) using YOLOv4 for object detection is used to estimate the affordances associated with the objects in the scene. For tasks like liquid pouring, table cleaning and object placement, the proposed imitation learning approach learns task planning and execution in a decoupled manner, resulting in effective sub-action sequencing and quicker and more precise learning of sub-action execution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    tasks like table cleaning, water pouring, table arrangement that involve multiple and precise object manipulations over a long time span.

  2. 2.

    tasks are the combinations of multiple task primitives.

References

  1. Abdulla, W.: Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow (2017). https://github.com/matterport/Mask_RCNN

  2. Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robot. Auton. Syst. 57(5), 469–483 (2009). https://doi.org/10.1016/j.robot.2008.10.024

    Article  Google Scholar 

  3. Behera, L., Kumar, S., Patchaikani, P.K., Nair, R.R., Dutta, S.: Intelligent Control of Robotic Systems. CRC Press, Boca Raton (2020)

    Book  Google Scholar 

  4. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 [cs.CV] (2020)

  5. Chella, A., Dindo, H., Infantino, I.: A cognitive framework for imitation learning. Robot. Auton. Syst. 54(5), 403–408 (2006). https://doi.org/10.1016/j.robot.2006.01.008

    Article  Google Scholar 

  6. Daruna, A., Liu, W., Kira, Z., Chernova, S.: RoboCSE: robot common sense embedding. arXiv preprint arXiv:1903.00412 [cs.RO] (2019). https://doi.org/10.48550/ARXIV.1903.00412

  7. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)

    Google Scholar 

  8. Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_9

    Chapter  Google Scholar 

  9. Hussein, A., Gaber, M.M., Elyan, E., Jayne, C.: Imitation learning: a survey of learning methods. ACM Comput. Surv. (CSUR) 50(2), 1–35 (2017)

    Article  Google Scholar 

  10. Jiang, C., Dehghan, M., Jagersand, M.: Understanding contexts inside robot and human manipulation tasks through a vision-language model and ontology system in a video stream. arXiv preprint arXiv:2003.01163 [cs.CV] (2020)

  11. Kumar, A., Behera, L.: Semi supervised deep quick instance detection and segmentation. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 8325–8331. IEEE (2019)

    Google Scholar 

  12. Lin, M., Inoue, N., Shinoda, K.: CTC network with statistical language modeling for action sequence recognition in videos. In: Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 393–401 (2017)

    Google Scholar 

  13. Liu, H., Wu, Y., Yang, Y.: Analogical inference for multi-relational embeddings. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 2168–2178. PMLR (2017)

    Google Scholar 

  14. Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine learning for knowledge graphs. Proc. IEEE 104(1), 11–33 (2016)

    Article  Google Scholar 

  15. Ramirez-Amaro, K., Dean-Leon, E., Cheng, G.: Robust semantic representations for inferring human co-manipulation activities even with different demonstration styles. In: 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), pp. 1141–1146 (2015). https://doi.org/10.1109/HUMANOIDS.2015.7363496

  16. Sharma, P., Mohan, L., Pinto, L., Gupta, A.: Multiple interactions made easy (mime): large scale demonstrations data for imitation. arXiv preprint arXiv:1810.07121 (2018)

  17. Sharma, P., Pathak, D., Gupta, A.: Third-person visual imitation learning via decoupled hierarchical controller. arXiv preprint arXiv:1911.09676 (2019). https://doi.org/10.48550/ARXIV.1911.09676

  18. Shiarlis, K., Wulfmeier, M., Salter, S., Whiteson, S., Posner, I.: TACO: learning task decomposition via temporal alignment for control. In: International Conference on Machine Learning, pp. 4654–4663. PMLR (2018)

    Google Scholar 

  19. Smith, L., Dhawan, N., Zhang, M., Abbeel, P., Levine, S.: AVID: learning multi-stage tasks via pixel-level translation of human videos. arXiv preprint arXiv:1912.04443 (2019). https://doi.org/10.48550/ARXIV.1912.04443

  20. Solutions, R.R.M.: RMS - 26" yellow grabber reacher with rotating head (2021). https://www.myrmsstore.com/collections/reachers-grabbers/products/26-yellow-grabber-reacher-with-rotating-head

  21. Tomasello, M., Savage-Rumbaugh, S., Kruger, A.C.: Imitative learning of actions on objects by children, chimpanzees, and enculturated chimpanzees. Child Dev. 64(6), 1688–1705 (1993)

    Article  Google Scholar 

  22. Ude, A., Nemec, B., Petri?, T., Morimoto, J.: Orientation in cartesian space dynamic movement primitives. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 2997–3004 (2014). https://doi.org/10.1109/ICRA.2014.6907291

  23. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: a survey of approaches and applications. IEEE Trans. Knowl. Data Eng. 29(12), 2724–2743 (2017). https://doi.org/10.1109/TKDE.2017.2754499

    Article  Google Scholar 

  24. Yang, Y., Li, Y., Fermuller, C., Aloimonos, Y.: Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tushar Sandhan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Singh, N. et al. (2024). Imitation Learning of Long-Horizon Manipulation Tasks Through Temporal Sub-action Sequencing. In: Kaur, H., Jakhetiya, V., Goyal, P., Khanna, P., Raman, B., Kumar, S. (eds) Computer Vision and Image Processing. CVIP 2023. Communications in Computer and Information Science, vol 2010. Springer, Cham. https://doi.org/10.1007/978-3-031-58174-8_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-58174-8_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-58173-1

  • Online ISBN: 978-3-031-58174-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics