Abstract
This research proposes an approach to long-horizon manipulation which uses video and kinesthetic demonstrations to imitate human actions. The task learning process involves two stages. To learn the sequence of the sub-actions in the video demonstration, the Task Sequencing Network (TSNet) - a hybrid neural network made up of Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Connectionist Temporal Classification (CTC) loss, is used in the first stage. Through dynamic movement primitive (DMP) models, task-agnostic task primitives are learned in the second stage via kinesthetic demonstrations. To encode the semantic relationship between the sub-actions and the objects, a Multi-relational Embedding Network (MRE) using YOLOv4 for object detection is used to estimate the affordances associated with the objects in the scene. For tasks like liquid pouring, table cleaning and object placement, the proposed imitation learning approach learns task planning and execution in a decoupled manner, resulting in effective sub-action sequencing and quicker and more precise learning of sub-action execution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
tasks like table cleaning, water pouring, table arrangement that involve multiple and precise object manipulations over a long time span.
- 2.
tasks are the combinations of multiple task primitives.
References
Abdulla, W.: Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow (2017). https://github.com/matterport/Mask_RCNN
Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robot. Auton. Syst. 57(5), 469–483 (2009). https://doi.org/10.1016/j.robot.2008.10.024
Behera, L., Kumar, S., Patchaikani, P.K., Nair, R.R., Dutta, S.: Intelligent Control of Robotic Systems. CRC Press, Boca Raton (2020)
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 [cs.CV] (2020)
Chella, A., Dindo, H., Infantino, I.: A cognitive framework for imitation learning. Robot. Auton. Syst. 54(5), 403–408 (2006). https://doi.org/10.1016/j.robot.2006.01.008
Daruna, A., Liu, W., Kira, Z., Chernova, S.: RoboCSE: robot common sense embedding. arXiv preprint arXiv:1903.00412 [cs.RO] (2019). https://doi.org/10.48550/ARXIV.1903.00412
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_9
Hussein, A., Gaber, M.M., Elyan, E., Jayne, C.: Imitation learning: a survey of learning methods. ACM Comput. Surv. (CSUR) 50(2), 1–35 (2017)
Jiang, C., Dehghan, M., Jagersand, M.: Understanding contexts inside robot and human manipulation tasks through a vision-language model and ontology system in a video stream. arXiv preprint arXiv:2003.01163 [cs.CV] (2020)
Kumar, A., Behera, L.: Semi supervised deep quick instance detection and segmentation. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 8325–8331. IEEE (2019)
Lin, M., Inoue, N., Shinoda, K.: CTC network with statistical language modeling for action sequence recognition in videos. In: Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 393–401 (2017)
Liu, H., Wu, Y., Yang, Y.: Analogical inference for multi-relational embeddings. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 2168–2178. PMLR (2017)
Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine learning for knowledge graphs. Proc. IEEE 104(1), 11–33 (2016)
Ramirez-Amaro, K., Dean-Leon, E., Cheng, G.: Robust semantic representations for inferring human co-manipulation activities even with different demonstration styles. In: 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), pp. 1141–1146 (2015). https://doi.org/10.1109/HUMANOIDS.2015.7363496
Sharma, P., Mohan, L., Pinto, L., Gupta, A.: Multiple interactions made easy (mime): large scale demonstrations data for imitation. arXiv preprint arXiv:1810.07121 (2018)
Sharma, P., Pathak, D., Gupta, A.: Third-person visual imitation learning via decoupled hierarchical controller. arXiv preprint arXiv:1911.09676 (2019). https://doi.org/10.48550/ARXIV.1911.09676
Shiarlis, K., Wulfmeier, M., Salter, S., Whiteson, S., Posner, I.: TACO: learning task decomposition via temporal alignment for control. In: International Conference on Machine Learning, pp. 4654–4663. PMLR (2018)
Smith, L., Dhawan, N., Zhang, M., Abbeel, P., Levine, S.: AVID: learning multi-stage tasks via pixel-level translation of human videos. arXiv preprint arXiv:1912.04443 (2019). https://doi.org/10.48550/ARXIV.1912.04443
Solutions, R.R.M.: RMS - 26" yellow grabber reacher with rotating head (2021). https://www.myrmsstore.com/collections/reachers-grabbers/products/26-yellow-grabber-reacher-with-rotating-head
Tomasello, M., Savage-Rumbaugh, S., Kruger, A.C.: Imitative learning of actions on objects by children, chimpanzees, and enculturated chimpanzees. Child Dev. 64(6), 1688–1705 (1993)
Ude, A., Nemec, B., Petri?, T., Morimoto, J.: Orientation in cartesian space dynamic movement primitives. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 2997–3004 (2014). https://doi.org/10.1109/ICRA.2014.6907291
Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: a survey of approaches and applications. IEEE Trans. Knowl. Data Eng. 29(12), 2724–2743 (2017). https://doi.org/10.1109/TKDE.2017.2754499
Yang, Y., Li, Y., Fermuller, C., Aloimonos, Y.: Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Singh, N. et al. (2024). Imitation Learning of Long-Horizon Manipulation Tasks Through Temporal Sub-action Sequencing. In: Kaur, H., Jakhetiya, V., Goyal, P., Khanna, P., Raman, B., Kumar, S. (eds) Computer Vision and Image Processing. CVIP 2023. Communications in Computer and Information Science, vol 2010. Springer, Cham. https://doi.org/10.1007/978-3-031-58174-8_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-58174-8_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-58173-1
Online ISBN: 978-3-031-58174-8
eBook Packages: Computer ScienceComputer Science (R0)