Abstract
We propose an architecture for a system that will “watch and listen to” an instructional video of a human performing a task and translate the audio and video information into a task for a robot to perform. This enables the use of readily available instructional videos from the Internet to train robots to perform tasks instead of programming them. We implemented an operational prototype based on the architecture and showed it could “watch and listen to” two instructional videos on how to clean golf clubs and translate the audio and video information from the instructional video into tasks for a robot to perform. The key contributions of this architecture are: integration of multiple modalities using trees and pruning with filters; task decomposition into macro-tasks composed of parameterized task-primitives and other macro-tasks, where the task-primitive parameters are an action (e.g., dip, clean, dry) taken on an object (e.g., golf club) using a tool (e.g., pail of water, brush, towel); and context, for determining missing and implied task-primitive parameter values, as a set of canonical task-primitive parameter values with a confidence score based on the number of times the parameter value was detected in the video and audio information and how long ago it was detected.
Similar content being viewed by others
Abbreviations
- HMM:
-
Hidden Markov Model;
- IBL:
-
Instruction Based Learning;
- PbD:
-
Programming by Demonstration;
- RbD:
-
Robot Programming by Demonstration;
- SIFT:
-
Scale Invariant Feature Transform;
- XML:
-
Extensible Mark-up Language
References
Alissandrakis A, Nehaniv C, Dautenhahn K (2007) Correspondence mapping induced state and action metrics for robotic imitation. IEEE Trans Syst Man Cybern, Part A, Syst Hum 37(2):299–307. 2007
Bekey G, Ambrose R, Kumar V, Sanderson A, Wilcox B, Zheng Y (2006) Final report. World Technology Evaluation Center, Inc. (WTEC) Panel on International, Assessment of Research and Development in Robotics, January 2006
Billard A, Calinon S, Dillmann R, Schaal S (2008) Robot programming by demonstration. In: Siciliano B, Khatib O (eds) Handbook of robotics. Springer, Berlin. Chap. 59
Breazeal C, Kidd C, Thomaz A, Hoffman G, Berlin M (2005) Effects of nonverbal communication on efficiency and robustness in human-robot teamwork. In: Proceedings of IEEE/RSJ international conference on intelligent robots and systems (IROS), Edmonton, Alberta, Canada, 2–6 August 2005, pp 708–713
Bugmann G (2003) Challenges in verbal instruction of domestic robots. In: Proceedings of the ASER’03 1st international workshop on advances in service robotics, Bardolino, Italy, 13–15 March 2003, pp 112–116
Bugmann G, Klein E, Lauria S, Bos J, Kyriacou T (2004) Corpus-based robotics: a route instruction example. In: Proceedings of the 8th international conference on intelligent autonomous systems (IAS-8), Amsterdam, Netherlands, 10–13 March 2004, pp 96–103.
Bugmann G, Lauria S, Kyriacou T, Klein E, Bos J, Coventry K (2001) Using verbal instructions for route learning: instruction analysis. In: Proceedings of the TIMR 01—towards intelligent mobile robots, Manchester, UK, 5 April 2001. Technical Report Series. Department of Computer Science, Manchester University. ISSN 1361-6161, Report number UMC-01-4-1
Calinon S, Billard A (2005) Recognition and reproduction of gestures using a probabilistic framework combining PCA, ICA and HMM. In: Proceedings of the 22nd international conference on machine learning, Bonn, Germany, 7–11 August 2005, pp 105–112
Calinon S, Guenter F, Billard A (2005) Goal-directed imitation in a humanoid robot. In: Proceedings of the 2005 IEEE international conference on robotics and automation, Barcelona, Spain, 18–22 April 2005, pp 299–304
Calinon S, Guenter F, Billard A (2006) On learning the statistical representation of a task and generalizing it to various contexts. In: Proceedings of the 2006 IEEE international conference on robotics and automation, Orlando, FL, USA, 15–19 May 2006, pp 2978–2983
Campbell M (2010) Towel-folding robot now on general sale. New scientist, 14 September 2010
Chai J, Pan S, Zhou M (2005) MIND: a context-based multi-modal interpretation framework in conversational systems. In: Bernsen O, Dybkjaer L, van Kuppevelt J (eds) Natural, intelligent and effective interaction in multimodal dialogue systems. Kluwer Academic, Norwell
Chai J, Prasov Z, Blaim J, Jin R (2005) Linguistic theories in efficient multi-modal reference resolution: an empirical investigation. In: Proceedings of the 10th international conference on intelligent user interfaces (IUI-05), San Diego, CA, USA, 9–12 January 2005, pp 43–50
Dominey P, Alvarez M, Gao B, Jeambrun M, Cheylus A, Weitzenfeld A, Martinez A, Medrano A (2005) Robot command, interrogation and teaching via social interaction. In: Proceeding of the IEEE-RAS international conference on humanoid robots, Tsukuba, Japan, 7 December 2005, pp 475–480
Drumwright E (2007) The task matrix: a robot-independent framework for programming humanoids. PhD dissertation, Department of Computer Science, University of Southern, California, Los Angeles, CA, USA
Drumwright E, Ng-Thow-Hing V (2006) The task matrix: an extensible framework for creating versatile humanoid robots. In: Proceedings of the 2006 IEEE international conference on robotics and automation, Orlando, Florida, USA, 15–19 May 2006, pp 448–455
eHow.com Website, www.ehow.com
Flippo F, Krebs A, Marsic I (2003) A framework for rapid development of multi-modal interfaces. In: Proceedings of the fifth international conference on multimodal interfaces (ICMI-PUI’03), Vancouver, Canada, 5–7 November 2003, pp 109–116
Flippo F (2003) A natural human-computer interface for controlling wheeled robotic vehicles. Delft University of Technology, Department of Information Technology and Systems, Delft, Netherlands, August
Green, Severinson-Eklundh K (2001) Task-oriented dialogue for CERO: a user-centered approach. Royal Institute of Technology (KTH), Stockholm, Sweden
Green A, Severinson-Eklundh K, Wrede B, Li S (2006) Integrating miscommunication analysis in natural language interface design for a service robot. In: Proceedings of the 2006 IEEE/RSJ international conference on intelligent robots and systems, Beijing, China, 9–15 October 2006, pp 4678–4683
Hornyak T (2010) HRP-4 robot can strike a pose, pour drinks. Cnet news, 16 September 2010
Inamura T, Okadab K, Tokutsub S, Hataob N, Inabab M, Inoue H (2006) HRP-2W: a humanoid platform for research on support behavior in daily life environments. In: Proceedings of the 9th international conference on intelligent autonomous systems (IAS-9), Tokyo, Japan, 7–9 March 2006, vol 57, pp 145–154
Inamura T, Toshima I, Nakamura Y (2003) Acquiring motion elements for bidirectional computation of motion recognition and generation. In: Siciliano B, Dario P (eds) Experimental robotics VIII, vol 5. Springer, Berlin, pp 372–381
iRobot Corporation Website, www.irobot.com
Johnson DO (2008) Human robot interaction through semantic integration of multiple modalities, dialog management, and contexts. PhD dissertation, Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS, USA
Johnson DO, Agah A (2009) Human robot interaction through semantic integration of multiple modalities, dialog management, and contexts. Int J Soc Robot. September 2009. doi:10.1007/s12369-009-0028-0
Johnson DO, Agah A (2011) A novel efficient algorithm for locating and tracking object parts in low resolution videos. Int J Intell Syst 20(1):79–100. April 2011. doi:10.1515/JISYS.2011.006
Johnson DO, Agah A (2011) Recognition of Marker-less human actions in videos using hidden Markov models. In: Proceedings of the ICAI’11—the 2011 international conference on artificial intelligence, Las Vegas, NV, USA, 18–21 July 2011, pp 95–100
Kim H, Jung Y, Hwang Y (2005) Taxonomy of atomic actions for home-service robots. J Adv Comput Intell Intell Inform 9(2):114–120
Klingspor V, Demiris J, Kaiser M (1997) Human-robot-communication and machine learning. Universität Dortmund, Germany, University of Edinburgh, UK, Universität Karlsruhe, Germany
Lauria S, Bugmann G, Kyriacou T, Klein E (2001) Instruction based learning: how to instruct a personal robot to find HAL. In: Proceedings of the 9th European workshop on learning robots (EWLR-9), Prague, Czech Republic, 8–9 September 2001, pp 74–83
Lopes L, Teixeira A (2000) Human-robot interaction through spoken language dialogue. In: Proceedings of the 2000 IEEE/RSJ international conference on intelligent robots and systems, Takamatsu, Japan, 30 October–5 November 2000, pp 528–534
Lowe D (1999) Object recognition from local scale-invariant features. In: Proceedings of the international conference on computer vision, vol 2, pp 1150–1157
MacMahon M (2005) MARCO: a modular architecture for following route instructions. Department of Electrical and Computer Engineering, Intelligent Robotics Laboratory, University of Texas at Austin, Austin, TX, USA
The MATLAB image processing toolbox 6: (2009), MathWorks. www.mathworks.com
Mavridis N, Roy D (2005) Grounded situation models for robots: bridging language, perception, and action. In: Proceedings of the AAAI-05 workshop on modular construction of human-like intelligence, Pittsburgh, PA, USA, 10 July 2005, pp 32–39
Merriam-Webster Dictionary (2011) www.merriam-webster.com. Accessed 28 February 2011
National Institute of Technology and Standards (2009) The history of automatic speech recognition evaluations at NIST. http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html. Accessed 21 August 2011
Nicolescu M, Mataric M (2001) Learning and interacting in human-robot domains. IEEE Trans Syst Man Cybern, Part A, Syst Hum, 31, 419–430
Nicolescu M, Mataric M (2005) Task learning through imitation and human-robot interaction. In: Dautenhahn K, Nehaniv C (eds) Models and mechanisms of imitation and social learning in robots, humans and animals. Cambridge University Press, Cambridge
The Phoenix parser user manual (2002) The Center for Spoken Language Research, University of Colorado, Boulder, CO, USA
Roy D (2005) Semiotic schemas: a framework for grounding language in action and perception. Artif Intell 167(1–2):170–205
Shertzer M (1986) The elements of grammar. Macmillan Pub, New York
Suomela J, Halme A (2004) Human robot interaction—case WorkPartner. In: Proceedings of the 2004 IEEE/RSJ international conference on intelligent robots and systems (IROS 2004), Sendai, Japan, 28 September–2 October 2004, pp 3327–3332
Thomaz A, Jacobsson H, Kruijff G, Skocaj D (2008) In: Proceedings of the interactive robot learning, robotics: science and systems (RSS) 2008 workshop, Zurich, Switzerland, June 28, 2008.
Toptsis I, Li S, Wrede B, Fink G (2004) A multi-modal dialog system for a mobile robot. In: Proceedings of international conference on spoken language processing, Jeju, South Korea, 14–18 October 2004, pp 273–276
Vedaldi A, Fulkerson B (2011) VLFeat: an open and portable library of computer vision algorithms. www.vlfeat.org
Wikipedia (2011) Golf equipment, www.wikipedia.org. Accessed 28 February 2011
Wilske S, Kruijff G (2006) Service robots dealing with indirect speech acts. Language Technology Lab, German Research Center for Artificial Intelligence (DFKI), Saarbrucken, Germany
Zhang J, Knoll A (2003) A two-arm situated artificial communicator for human-robot cooperative assembly. IEEE Trans Ind Electron 50(4):651–658
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Johnson, D.O., Agah, A. Learning Macro Actions from Instructional Videos Through Integration of Multiple Modalities. Int J of Soc Robotics 5, 53–73 (2013). https://doi.org/10.1007/s12369-012-0167-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12369-012-0167-6