Learning Macro Actions from Instructional Videos Through Integration of Multiple Modalities

Johnson, David O.; Agah, Arvin

doi:10.1007/s12369-012-0167-6

Learning Macro Actions from Instructional Videos Through Integration of Multiple Modalities

Published: 04 October 2012

Volume 5, pages 53–73, (2013)
Cite this article

International Journal of Social Robotics Aims and scope Submit manuscript

David O. Johnson¹ &
Arvin Agah²

241 Accesses
1 Citation
Explore all metrics

Abstract

We propose an architecture for a system that will “watch and listen to” an instructional video of a human performing a task and translate the audio and video information into a task for a robot to perform. This enables the use of readily available instructional videos from the Internet to train robots to perform tasks instead of programming them. We implemented an operational prototype based on the architecture and showed it could “watch and listen to” two instructional videos on how to clean golf clubs and translate the audio and video information from the instructional video into tasks for a robot to perform. The key contributions of this architecture are: integration of multiple modalities using trees and pruning with filters; task decomposition into macro-tasks composed of parameterized task-primitives and other macro-tasks, where the task-primitive parameters are an action (e.g., dip, clean, dry) taken on an object (e.g., golf club) using a tool (e.g., pail of water, brush, towel); and context, for determining missing and implied task-primitive parameter values, as a set of canonical task-primitive parameter values with a confidence score based on the number of times the parameter value was detected in the video and audio information and how long ago it was detected.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing Robot Manipulation Skill Learning with Multi-task Capability Based on Transformer and Token Reduction

Learning Collaborative Action Plans from YouTube Videos

Procedure Planning in Instructional Videos

Abbreviations

HMM:: Hidden Markov Model;
IBL:: Instruction Based Learning;
PbD:: Programming by Demonstration;
RbD:: Robot Programming by Demonstration;
SIFT:: Scale Invariant Feature Transform;
XML:: Extensible Mark-up Language

References

Alissandrakis A, Nehaniv C, Dautenhahn K (2007) Correspondence mapping induced state and action metrics for robotic imitation. IEEE Trans Syst Man Cybern, Part A, Syst Hum 37(2):299–307. 2007
Article Google Scholar
Bekey G, Ambrose R, Kumar V, Sanderson A, Wilcox B, Zheng Y (2006) Final report. World Technology Evaluation Center, Inc. (WTEC) Panel on International, Assessment of Research and Development in Robotics, January 2006
Billard A, Calinon S, Dillmann R, Schaal S (2008) Robot programming by demonstration. In: Siciliano B, Khatib O (eds) Handbook of robotics. Springer, Berlin. Chap. 59
Google Scholar
Breazeal C, Kidd C, Thomaz A, Hoffman G, Berlin M (2005) Effects of nonverbal communication on efficiency and robustness in human-robot teamwork. In: Proceedings of IEEE/RSJ international conference on intelligent robots and systems (IROS), Edmonton, Alberta, Canada, 2–6 August 2005, pp 708–713
Google Scholar
Bugmann G (2003) Challenges in verbal instruction of domestic robots. In: Proceedings of the ASER’03 1st international workshop on advances in service robotics, Bardolino, Italy, 13–15 March 2003, pp 112–116
Google Scholar
Bugmann G, Klein E, Lauria S, Bos J, Kyriacou T (2004) Corpus-based robotics: a route instruction example. In: Proceedings of the 8th international conference on intelligent autonomous systems (IAS-8), Amsterdam, Netherlands, 10–13 March 2004, pp 96–103.
Google Scholar
Bugmann G, Lauria S, Kyriacou T, Klein E, Bos J, Coventry K (2001) Using verbal instructions for route learning: instruction analysis. In: Proceedings of the TIMR 01—towards intelligent mobile robots, Manchester, UK, 5 April 2001. Technical Report Series. Department of Computer Science, Manchester University. ISSN 1361-6161, Report number UMC-01-4-1
Google Scholar
Calinon S, Billard A (2005) Recognition and reproduction of gestures using a probabilistic framework combining PCA, ICA and HMM. In: Proceedings of the 22nd international conference on machine learning, Bonn, Germany, 7–11 August 2005, pp 105–112
Google Scholar
Calinon S, Guenter F, Billard A (2005) Goal-directed imitation in a humanoid robot. In: Proceedings of the 2005 IEEE international conference on robotics and automation, Barcelona, Spain, 18–22 April 2005, pp 299–304
Chapter Google Scholar
Calinon S, Guenter F, Billard A (2006) On learning the statistical representation of a task and generalizing it to various contexts. In: Proceedings of the 2006 IEEE international conference on robotics and automation, Orlando, FL, USA, 15–19 May 2006, pp 2978–2983
Chapter Google Scholar
Campbell M (2010) Towel-folding robot now on general sale. New scientist, 14 September 2010
Chai J, Pan S, Zhou M (2005) MIND: a context-based multi-modal interpretation framework in conversational systems. In: Bernsen O, Dybkjaer L, van Kuppevelt J (eds) Natural, intelligent and effective interaction in multimodal dialogue systems. Kluwer Academic, Norwell
Google Scholar
Chai J, Prasov Z, Blaim J, Jin R (2005) Linguistic theories in efficient multi-modal reference resolution: an empirical investigation. In: Proceedings of the 10th international conference on intelligent user interfaces (IUI-05), San Diego, CA, USA, 9–12 January 2005, pp 43–50
Chapter Google Scholar
Dominey P, Alvarez M, Gao B, Jeambrun M, Cheylus A, Weitzenfeld A, Martinez A, Medrano A (2005) Robot command, interrogation and teaching via social interaction. In: Proceeding of the IEEE-RAS international conference on humanoid robots, Tsukuba, Japan, 7 December 2005, pp 475–480
Chapter Google Scholar
Drumwright E (2007) The task matrix: a robot-independent framework for programming humanoids. PhD dissertation, Department of Computer Science, University of Southern, California, Los Angeles, CA, USA
Drumwright E, Ng-Thow-Hing V (2006) The task matrix: an extensible framework for creating versatile humanoid robots. In: Proceedings of the 2006 IEEE international conference on robotics and automation, Orlando, Florida, USA, 15–19 May 2006, pp 448–455
Chapter Google Scholar
eHow.com Website, www.ehow.com
Flippo F, Krebs A, Marsic I (2003) A framework for rapid development of multi-modal interfaces. In: Proceedings of the fifth international conference on multimodal interfaces (ICMI-PUI’03), Vancouver, Canada, 5–7 November 2003, pp 109–116
Chapter Google Scholar
Flippo F (2003) A natural human-computer interface for controlling wheeled robotic vehicles. Delft University of Technology, Department of Information Technology and Systems, Delft, Netherlands, August
Green, Severinson-Eklundh K (2001) Task-oriented dialogue for CERO: a user-centered approach. Royal Institute of Technology (KTH), Stockholm, Sweden
Green A, Severinson-Eklundh K, Wrede B, Li S (2006) Integrating miscommunication analysis in natural language interface design for a service robot. In: Proceedings of the 2006 IEEE/RSJ international conference on intelligent robots and systems, Beijing, China, 9–15 October 2006, pp 4678–4683
Chapter Google Scholar
Hornyak T (2010) HRP-4 robot can strike a pose, pour drinks. Cnet news, 16 September 2010
Inamura T, Okadab K, Tokutsub S, Hataob N, Inabab M, Inoue H (2006) HRP-2W: a humanoid platform for research on support behavior in daily life environments. In: Proceedings of the 9th international conference on intelligent autonomous systems (IAS-9), Tokyo, Japan, 7–9 March 2006, vol 57, pp 145–154
Google Scholar
Inamura T, Toshima I, Nakamura Y (2003) Acquiring motion elements for bidirectional computation of motion recognition and generation. In: Siciliano B, Dario P (eds) Experimental robotics VIII, vol 5. Springer, Berlin, pp 372–381
Chapter Google Scholar
iRobot Corporation Website, www.irobot.com
Johnson DO (2008) Human robot interaction through semantic integration of multiple modalities, dialog management, and contexts. PhD dissertation, Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS, USA
Johnson DO, Agah A (2009) Human robot interaction through semantic integration of multiple modalities, dialog management, and contexts. Int J Soc Robot. September 2009. doi:10.1007/s12369-009-0028-0
Google Scholar
Johnson DO, Agah A (2011) A novel efficient algorithm for locating and tracking object parts in low resolution videos. Int J Intell Syst 20(1):79–100. April 2011. doi:10.1515/JISYS.2011.006
Google Scholar
Johnson DO, Agah A (2011) Recognition of Marker-less human actions in videos using hidden Markov models. In: Proceedings of the ICAI’11—the 2011 international conference on artificial intelligence, Las Vegas, NV, USA, 18–21 July 2011, pp 95–100
Google Scholar
Kim H, Jung Y, Hwang Y (2005) Taxonomy of atomic actions for home-service robots. J Adv Comput Intell Intell Inform 9(2):114–120
Google Scholar
Klingspor V, Demiris J, Kaiser M (1997) Human-robot-communication and machine learning. Universität Dortmund, Germany, University of Edinburgh, UK, Universität Karlsruhe, Germany
Lauria S, Bugmann G, Kyriacou T, Klein E (2001) Instruction based learning: how to instruct a personal robot to find HAL. In: Proceedings of the 9th European workshop on learning robots (EWLR-9), Prague, Czech Republic, 8–9 September 2001, pp 74–83
Google Scholar
Lopes L, Teixeira A (2000) Human-robot interaction through spoken language dialogue. In: Proceedings of the 2000 IEEE/RSJ international conference on intelligent robots and systems, Takamatsu, Japan, 30 October–5 November 2000, pp 528–534
Google Scholar
Lowe D (1999) Object recognition from local scale-invariant features. In: Proceedings of the international conference on computer vision, vol 2, pp 1150–1157
Chapter Google Scholar
MacMahon M (2005) MARCO: a modular architecture for following route instructions. Department of Electrical and Computer Engineering, Intelligent Robotics Laboratory, University of Texas at Austin, Austin, TX, USA
The MATLAB image processing toolbox 6: (2009), MathWorks. www.mathworks.com
Mavridis N, Roy D (2005) Grounded situation models for robots: bridging language, perception, and action. In: Proceedings of the AAAI-05 workshop on modular construction of human-like intelligence, Pittsburgh, PA, USA, 10 July 2005, pp 32–39
Google Scholar
Merriam-Webster Dictionary (2011) www.merriam-webster.com. Accessed 28 February 2011
National Institute of Technology and Standards (2009) The history of automatic speech recognition evaluations at NIST. http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html. Accessed 21 August 2011
Nicolescu M, Mataric M (2001) Learning and interacting in human-robot domains. IEEE Trans Syst Man Cybern, Part A, Syst Hum, 31, 419–430
Article Google Scholar
Nicolescu M, Mataric M (2005) Task learning through imitation and human-robot interaction. In: Dautenhahn K, Nehaniv C (eds) Models and mechanisms of imitation and social learning in robots, humans and animals. Cambridge University Press, Cambridge
Google Scholar
The Phoenix parser user manual (2002) The Center for Spoken Language Research, University of Colorado, Boulder, CO, USA
Roy D (2005) Semiotic schemas: a framework for grounding language in action and perception. Artif Intell 167(1–2):170–205
Article Google Scholar
Shertzer M (1986) The elements of grammar. Macmillan Pub, New York
Google Scholar
Suomela J, Halme A (2004) Human robot interaction—case WorkPartner. In: Proceedings of the 2004 IEEE/RSJ international conference on intelligent robots and systems (IROS 2004), Sendai, Japan, 28 September–2 October 2004, pp 3327–3332
Google Scholar
Thomaz A, Jacobsson H, Kruijff G, Skocaj D (2008) In: Proceedings of the interactive robot learning, robotics: science and systems (RSS) 2008 workshop, Zurich, Switzerland, June 28, 2008.
Google Scholar
Toptsis I, Li S, Wrede B, Fink G (2004) A multi-modal dialog system for a mobile robot. In: Proceedings of international conference on spoken language processing, Jeju, South Korea, 14–18 October 2004, pp 273–276
Google Scholar
Vedaldi A, Fulkerson B (2011) VLFeat: an open and portable library of computer vision algorithms. www.vlfeat.org
Wikipedia (2011) Golf equipment, www.wikipedia.org. Accessed 28 February 2011
Wilske S, Kruijff G (2006) Service robots dealing with indirect speech acts. Language Technology Lab, German Research Center for Artificial Intelligence (DFKI), Saarbrucken, Germany
Zhang J, Knoll A (2003) A two-arm situated artificial communicator for human-robot cooperative assembly. IEEE Trans Ind Electron 50(4):651–658
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Electrical Engineering, University of Missouri-Kansas City, Kansas City, MO, 64110, USA
David O. Johnson
Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS, 66045, USA
Arvin Agah

Authors

David O. Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Arvin Agah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David O. Johnson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Johnson, D.O., Agah, A. Learning Macro Actions from Instructional Videos Through Integration of Multiple Modalities. Int J of Soc Robotics 5, 53–73 (2013). https://doi.org/10.1007/s12369-012-0167-6

Download citation

Accepted: 14 September 2012
Published: 04 October 2012
Issue Date: January 2013
DOI: https://doi.org/10.1007/s12369-012-0167-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Macro Actions from Instructional Videos Through Integration of Multiple Modalities

Abstract

Access this article

Similar content being viewed by others

Enhancing Robot Manipulation Skill Learning with Multi-task Capability Based on Transformer and Token Reduction

Learning Collaborative Action Plans from YouTube Videos

Procedure Planning in Instructional Videos

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning Macro Actions from Instructional Videos Through Integration of Multiple Modalities

Abstract

Access this article

Similar content being viewed by others

Enhancing Robot Manipulation Skill Learning with Multi-task Capability Based on Transformer and Token Reduction

Learning Collaborative Action Plans from YouTube Videos

Procedure Planning in Instructional Videos

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation