Skip to main content
Log in

Learning Macro Actions from Instructional Videos Through Integration of Multiple Modalities

  • Published:
International Journal of Social Robotics Aims and scope Submit manuscript

Abstract

We propose an architecture for a system that will “watch and listen to” an instructional video of a human performing a task and translate the audio and video information into a task for a robot to perform. This enables the use of readily available instructional videos from the Internet to train robots to perform tasks instead of programming them. We implemented an operational prototype based on the architecture and showed it could “watch and listen to” two instructional videos on how to clean golf clubs and translate the audio and video information from the instructional video into tasks for a robot to perform. The key contributions of this architecture are: integration of multiple modalities using trees and pruning with filters; task decomposition into macro-tasks composed of parameterized task-primitives and other macro-tasks, where the task-primitive parameters are an action (e.g., dip, clean, dry) taken on an object (e.g., golf club) using a tool (e.g., pail of water, brush, towel); and context, for determining missing and implied task-primitive parameter values, as a set of canonical task-primitive parameter values with a confidence score based on the number of times the parameter value was detected in the video and audio information and how long ago it was detected.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Abbreviations

HMM:

Hidden Markov Model;

IBL:

Instruction Based Learning;

PbD:

Programming by Demonstration;

RbD:

Robot Programming by Demonstration;

SIFT:

Scale Invariant Feature Transform;

XML:

Extensible Mark-up Language

References

  1. Alissandrakis A, Nehaniv C, Dautenhahn K (2007) Correspondence mapping induced state and action metrics for robotic imitation. IEEE Trans Syst Man Cybern, Part A, Syst Hum 37(2):299–307. 2007

    Article  Google Scholar 

  2. Bekey G, Ambrose R, Kumar V, Sanderson A, Wilcox B, Zheng Y (2006) Final report. World Technology Evaluation Center, Inc. (WTEC) Panel on International, Assessment of Research and Development in Robotics, January 2006

  3. Billard A, Calinon S, Dillmann R, Schaal S (2008) Robot programming by demonstration. In: Siciliano B, Khatib O (eds) Handbook of robotics. Springer, Berlin. Chap. 59

    Google Scholar 

  4. Breazeal C, Kidd C, Thomaz A, Hoffman G, Berlin M (2005) Effects of nonverbal communication on efficiency and robustness in human-robot teamwork. In: Proceedings of IEEE/RSJ international conference on intelligent robots and systems (IROS), Edmonton, Alberta, Canada, 2–6 August 2005, pp 708–713

    Google Scholar 

  5. Bugmann G (2003) Challenges in verbal instruction of domestic robots. In: Proceedings of the ASER’03 1st international workshop on advances in service robotics, Bardolino, Italy, 13–15 March 2003, pp 112–116

    Google Scholar 

  6. Bugmann G, Klein E, Lauria S, Bos J, Kyriacou T (2004) Corpus-based robotics: a route instruction example. In: Proceedings of the 8th international conference on intelligent autonomous systems (IAS-8), Amsterdam, Netherlands, 10–13 March 2004, pp 96–103.

    Google Scholar 

  7. Bugmann G, Lauria S, Kyriacou T, Klein E, Bos J, Coventry K (2001) Using verbal instructions for route learning: instruction analysis. In: Proceedings of the TIMR 01—towards intelligent mobile robots, Manchester, UK, 5 April 2001. Technical Report Series. Department of Computer Science, Manchester University. ISSN 1361-6161, Report number UMC-01-4-1

    Google Scholar 

  8. Calinon S, Billard A (2005) Recognition and reproduction of gestures using a probabilistic framework combining PCA, ICA and HMM. In: Proceedings of the 22nd international conference on machine learning, Bonn, Germany, 7–11 August 2005, pp 105–112

    Google Scholar 

  9. Calinon S, Guenter F, Billard A (2005) Goal-directed imitation in a humanoid robot. In: Proceedings of the 2005 IEEE international conference on robotics and automation, Barcelona, Spain, 18–22 April 2005, pp 299–304

    Chapter  Google Scholar 

  10. Calinon S, Guenter F, Billard A (2006) On learning the statistical representation of a task and generalizing it to various contexts. In: Proceedings of the 2006 IEEE international conference on robotics and automation, Orlando, FL, USA, 15–19 May 2006, pp 2978–2983

    Chapter  Google Scholar 

  11. Campbell M (2010) Towel-folding robot now on general sale. New scientist, 14 September 2010

  12. Chai J, Pan S, Zhou M (2005) MIND: a context-based multi-modal interpretation framework in conversational systems. In: Bernsen O, Dybkjaer L, van Kuppevelt J (eds) Natural, intelligent and effective interaction in multimodal dialogue systems. Kluwer Academic, Norwell

    Google Scholar 

  13. Chai J, Prasov Z, Blaim J, Jin R (2005) Linguistic theories in efficient multi-modal reference resolution: an empirical investigation. In: Proceedings of the 10th international conference on intelligent user interfaces (IUI-05), San Diego, CA, USA, 9–12 January 2005, pp 43–50

    Chapter  Google Scholar 

  14. Dominey P, Alvarez M, Gao B, Jeambrun M, Cheylus A, Weitzenfeld A, Martinez A, Medrano A (2005) Robot command, interrogation and teaching via social interaction. In: Proceeding of the IEEE-RAS international conference on humanoid robots, Tsukuba, Japan, 7 December 2005, pp 475–480

    Chapter  Google Scholar 

  15. Drumwright E (2007) The task matrix: a robot-independent framework for programming humanoids. PhD dissertation, Department of Computer Science, University of Southern, California, Los Angeles, CA, USA

  16. Drumwright E, Ng-Thow-Hing V (2006) The task matrix: an extensible framework for creating versatile humanoid robots. In: Proceedings of the 2006 IEEE international conference on robotics and automation, Orlando, Florida, USA, 15–19 May 2006, pp 448–455

    Chapter  Google Scholar 

  17. eHow.com Website, www.ehow.com

  18. Flippo F, Krebs A, Marsic I (2003) A framework for rapid development of multi-modal interfaces. In: Proceedings of the fifth international conference on multimodal interfaces (ICMI-PUI’03), Vancouver, Canada, 5–7 November 2003, pp 109–116

    Chapter  Google Scholar 

  19. Flippo F (2003) A natural human-computer interface for controlling wheeled robotic vehicles. Delft University of Technology, Department of Information Technology and Systems, Delft, Netherlands, August

  20. Green, Severinson-Eklundh K (2001) Task-oriented dialogue for CERO: a user-centered approach. Royal Institute of Technology (KTH), Stockholm, Sweden

  21. Green A, Severinson-Eklundh K, Wrede B, Li S (2006) Integrating miscommunication analysis in natural language interface design for a service robot. In: Proceedings of the 2006 IEEE/RSJ international conference on intelligent robots and systems, Beijing, China, 9–15 October 2006, pp 4678–4683

    Chapter  Google Scholar 

  22. Hornyak T (2010) HRP-4 robot can strike a pose, pour drinks. Cnet news, 16 September 2010

  23. Inamura T, Okadab K, Tokutsub S, Hataob N, Inabab M, Inoue H (2006) HRP-2W: a humanoid platform for research on support behavior in daily life environments. In: Proceedings of the 9th international conference on intelligent autonomous systems (IAS-9), Tokyo, Japan, 7–9 March 2006, vol 57, pp 145–154

    Google Scholar 

  24. Inamura T, Toshima I, Nakamura Y (2003) Acquiring motion elements for bidirectional computation of motion recognition and generation. In: Siciliano B, Dario P (eds) Experimental robotics VIII, vol 5. Springer, Berlin, pp 372–381

    Chapter  Google Scholar 

  25. iRobot Corporation Website, www.irobot.com

  26. Johnson DO (2008) Human robot interaction through semantic integration of multiple modalities, dialog management, and contexts. PhD dissertation, Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS, USA

  27. Johnson DO, Agah A (2009) Human robot interaction through semantic integration of multiple modalities, dialog management, and contexts. Int J Soc Robot. September 2009. doi:10.1007/s12369-009-0028-0

    Google Scholar 

  28. Johnson DO, Agah A (2011) A novel efficient algorithm for locating and tracking object parts in low resolution videos. Int J Intell Syst 20(1):79–100. April 2011. doi:10.1515/JISYS.2011.006

    Google Scholar 

  29. Johnson DO, Agah A (2011) Recognition of Marker-less human actions in videos using hidden Markov models. In: Proceedings of the ICAI’11—the 2011 international conference on artificial intelligence, Las Vegas, NV, USA, 18–21 July 2011, pp 95–100

    Google Scholar 

  30. Kim H, Jung Y, Hwang Y (2005) Taxonomy of atomic actions for home-service robots. J Adv Comput Intell Intell Inform 9(2):114–120

    Google Scholar 

  31. Klingspor V, Demiris J, Kaiser M (1997) Human-robot-communication and machine learning. Universität Dortmund, Germany, University of Edinburgh, UK, Universität Karlsruhe, Germany

  32. Lauria S, Bugmann G, Kyriacou T, Klein E (2001) Instruction based learning: how to instruct a personal robot to find HAL. In: Proceedings of the 9th European workshop on learning robots (EWLR-9), Prague, Czech Republic, 8–9 September 2001, pp 74–83

    Google Scholar 

  33. Lopes L, Teixeira A (2000) Human-robot interaction through spoken language dialogue. In: Proceedings of the 2000 IEEE/RSJ international conference on intelligent robots and systems, Takamatsu, Japan, 30 October–5 November 2000, pp 528–534

    Google Scholar 

  34. Lowe D (1999) Object recognition from local scale-invariant features. In: Proceedings of the international conference on computer vision, vol 2, pp 1150–1157

    Chapter  Google Scholar 

  35. MacMahon M (2005) MARCO: a modular architecture for following route instructions. Department of Electrical and Computer Engineering, Intelligent Robotics Laboratory, University of Texas at Austin, Austin, TX, USA

  36. The MATLAB image processing toolbox 6: (2009), MathWorks. www.mathworks.com

  37. Mavridis N, Roy D (2005) Grounded situation models for robots: bridging language, perception, and action. In: Proceedings of the AAAI-05 workshop on modular construction of human-like intelligence, Pittsburgh, PA, USA, 10 July 2005, pp 32–39

    Google Scholar 

  38. Merriam-Webster Dictionary (2011) www.merriam-webster.com. Accessed 28 February 2011

  39. National Institute of Technology and Standards (2009) The history of automatic speech recognition evaluations at NIST. http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html. Accessed 21 August 2011

  40. Nicolescu M, Mataric M (2001) Learning and interacting in human-robot domains. IEEE Trans Syst Man Cybern, Part A, Syst Hum, 31, 419–430

    Article  Google Scholar 

  41. Nicolescu M, Mataric M (2005) Task learning through imitation and human-robot interaction. In: Dautenhahn K, Nehaniv C (eds) Models and mechanisms of imitation and social learning in robots, humans and animals. Cambridge University Press, Cambridge

    Google Scholar 

  42. The Phoenix parser user manual (2002) The Center for Spoken Language Research, University of Colorado, Boulder, CO, USA

  43. Roy D (2005) Semiotic schemas: a framework for grounding language in action and perception. Artif Intell 167(1–2):170–205

    Article  Google Scholar 

  44. Shertzer M (1986) The elements of grammar. Macmillan Pub, New York

    Google Scholar 

  45. Suomela J, Halme A (2004) Human robot interaction—case WorkPartner. In: Proceedings of the 2004 IEEE/RSJ international conference on intelligent robots and systems (IROS 2004), Sendai, Japan, 28 September–2 October 2004, pp 3327–3332

    Google Scholar 

  46. Thomaz A, Jacobsson H, Kruijff G, Skocaj D (2008) In: Proceedings of the interactive robot learning, robotics: science and systems (RSS) 2008 workshop, Zurich, Switzerland, June 28, 2008.

    Google Scholar 

  47. Toptsis I, Li S, Wrede B, Fink G (2004) A multi-modal dialog system for a mobile robot. In: Proceedings of international conference on spoken language processing, Jeju, South Korea, 14–18 October 2004, pp 273–276

    Google Scholar 

  48. Vedaldi A, Fulkerson B (2011) VLFeat: an open and portable library of computer vision algorithms. www.vlfeat.org

  49. Wikipedia (2011) Golf equipment, www.wikipedia.org. Accessed 28 February 2011

  50. Wilske S, Kruijff G (2006) Service robots dealing with indirect speech acts. Language Technology Lab, German Research Center for Artificial Intelligence (DFKI), Saarbrucken, Germany

  51. Zhang J, Knoll A (2003) A two-arm situated artificial communicator for human-robot cooperative assembly. IEEE Trans Ind Electron 50(4):651–658

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David O. Johnson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Johnson, D.O., Agah, A. Learning Macro Actions from Instructional Videos Through Integration of Multiple Modalities. Int J of Soc Robotics 5, 53–73 (2013). https://doi.org/10.1007/s12369-012-0167-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12369-012-0167-6

Keywords

Navigation