skip to main content
research-article

Added Value of Gaze-Exploiting Semantic Representation to Allow Robots Inferring Human Behaviors

Published: 23 March 2017 Publication History

Abstract

Neuroscience studies have shown that incorporating gaze view with third view perspective has a great influence to correctly infer human behaviors. Given the importance of both first and third person observations for the recognition of human behaviors, we propose a method that incorporates these observations in a technical system to enhance the recognition of human behaviors, thus improving beyond third person observations in a more robust human activity recognition system. First, we present the extension of our proposed semantic reasoning method by including gaze data and external observations as inputs to segment and infer human behaviors in complex real-world scenarios. Then, from the obtained results we demonstrate that the combination of gaze and external input sources greatly enhance the recognition of human behaviors. Our findings have been applied to a humanoid robot to online segment and recognize the observed human activities with better accuracy when using both input sources; for example, the activity recognition increases from 77% to 82% in our proposed pancake-making dataset. To provide completeness of our system, we have evaluated our approach with another dataset with a similar setup as the one proposed in this work, that is, the CMU-MMAC dataset. In this case, we improved the recognition of the activities for the egg scrambling scenario from 54% to 86% by combining the external views with the gaze information, thus showing the benefit of incorporating gaze information to infer human behaviors across different datasets.

Supplementary Material

ramirez-amaro (ramirez-amaro.zip)
Supplemental movie, appendix, image and software files for, Added Value of Gaze-Exploiting Semantic Representation to Allow Robots Inferring Human Behaviors

References

[1]
J. K. Aggarwal and M. S. Ryoo. 2011. Human activity analysis: A review. ACM Computing Surveys 43, 3 (April 2011), 16.
[2]
E. E. Aksoy, A. Abramov, J. Dörr, K. Ning, B. Dellen, and F. Wörgötter. 2011. Learning the semantics of object-action relations by observation. International Journal of Robotics Research 30, 10 (September 2011), 1229--1249.
[3]
E. E. Aksoy, M. J. Aein, M. Tamosiunaite, and F. Wörgötter. 2015. Semantic parsing of human manipulation activities using on-line learned models for robot imitation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2875--2882.
[4]
S. Albrecht, K. Ramirez-Amaro, F. Ruiz-Ugalde, D. Weikersdorfer, M. Leibold, M. Ulbrich, and M. Beetz. 2011. Imitating human reaching motions using physically inspired optimization principles. In Proceedings of the IEEE/RAS International Conference on Humanoid Robots (Humanoids). IEEE, 602--607.
[5]
P. Azad, A. Ude, R. Dillmann, and G. Cheng. 2004. A full body human motion capture system using particle filtering and on-the-fly edge detection. In Proceedings of the IEEE/RAS International Conference on Humanoid Robots (Humanoids). IEEE, 941--959.
[6]
M. Beetz, M. Tenorth, D. Jain, and J. Bandouch. 2010. Towards automated models of activities of daily life. Technology and Disability 22 (2010).
[7]
D. C. Bentivegna, C. G. Atkeson, and G. Cheng. 2006. Learning similar tasks from observation and practice. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’06). IEEE, 2677--2683.
[8]
A. Betancourt, P. Morerio, C. S. Regazzoni, and M. Rauterberg. 2015. The evolution of first person vision methods: A survey. IEEE Transactions on Circuits and Systems for Video Technology 25, 5 (May 2015), 744--760.
[9]
A. Billard, S. Calinon, R. Dillmann, and S. Schaal. 2008. Survey: Robot Programming by Demonstration. Handbook of Robotics. Springer Berlin Heidelberg (2008).
[10]
G. R. Bradski and A. Kaehler. 2008. Learning OpenCV - Computer Vision with the OpenCV Library: Software that Sees. O’Reilly. I--XVII, 1--555 pages.
[11]
H. R. Chennamma and X. Yuan. 2013. A survey on eye-gaze tracking techniques. Indian Journal of Computer Science and Engineering 4, 5 (October 2013), 388--393.
[12]
F. De la Torre, J. Hodgins, A. Bargteil, X. Martin, J. Macey, A. Collado, and P. Beltran. 2009. Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database. Technical Report CMU-RI-TR-08-22. Carnegie Mellon University.
[13]
K. S. R. Dubba, A. G. Cohn, D. C. Hogg, M. Bhatt, and F. Dylla. 2015. Learning relational event models from video. Journal of Artificial Intelligence Research 53 (May 2015), 41--90.
[14]
A. Fathi, Y. Li, and J. M. Rehg. 2012. Learning to recognize daily actions using gaze. In Proceedings of the 12th European Conference on Computer Vision (ECCV), Part I, Lecture Notes in Computer Science, Vol. 7572, Andrew W. Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid (Eds.). Springer, Berlin, 314--327.
[15]
J. R. Flanagan and R. S. Johansson. 2003. Action plans used in action observation. Nature 424 (2003), 769--770.
[16]
A. Guha, Y. Yang, C. Fermuuller, and Y. Aloimonos. 2013. Minimalist plans for interpreting manipulation actions. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’13). 5908--5914.
[17]
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. 2009. The WEKA data mining software: An update. Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Explorations Newsletter 11, 1 (June 2009), 10--18.
[18]
M. B. Holte, C. Tran, M. M. Trivedi, and T. B. Moeslund. 2012. Human pose estimation and activity recognition from multi-view videos: Comparative explorations of recent developments. IEEE Journal of Selected Topics in Signal Processing 6, 5 (September 2012), 538--552.
[19]
K. Ikeuchi and T. Suchiro. 1992. Towards an assembly plan from observation: Task recognition using face-contact relations (polyhedral objects). In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’92), Vol. 3. 2171--2177.
[20]
T. Inamura and T. Shibata. 2008. Geometric proto-symbol manipulation towards language-based motion pattern synthesis and recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’08). 334--339.
[21]
R. Jäkel, S. R. Schmidt-Rohr, M. Lösch, and R. Dillmann. 2010. Representation and constrained planning of manipulation strategies in the context of programming by demonstration. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’10). IEEE, 162--169.
[22]
R. S. Johansson and J. R. Flanagan. 2009. Sensorimotor control of manipulation. Encyclopedia of Neuroscience 8 (2009), 593--604.
[23]
T. Kanade. 2009. First-person, inside-out vision. In Keynote Speech, 1st Workshop on Egocentric Vision. In conjunction with IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09).
[24]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F.-F. Li. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE Computer Society, 1725--1732.
[25]
H. Kjellström, J. Romero, and D. Kragic. 2011. Visual object-action recognition: Inferring object affordances from human demonstration. Computer Vision and Image Understanding 115, 1 (2011), 81--90.
[26]
Y. Kuniyoshi, M. Inaba, and H. Inoue. 1994. Learning by watching: Extracting reusable task knowledge from visual observation of human performance. IEEE Transactions on Robotics and Automation 10, 6 (1994), 799--822.
[27]
C. A. Kurby and J. M. Zacks. 2008. Segmentation in the perception and memory of events. Trends in Cognitive Sciences 12, 2 (February 2008), 72--79.
[28]
S. Kwak, B. Han, and J. H. Han. 2014. On-line video event detection by constraint flow. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 6 (June 2014), 1174--1186.
[29]
M. F. Land and M. M. Hayhoe. 2001. In what ways do eye movements contribute to everyday activities? Vision Research 41 (2001), 3559--3565.
[30]
M. F. Land, N. Mennie, and J. Rusted. 1999. The roles of vision and eye movements in the control of activities of daily living. Perception 28 (1999), 1311--1328.
[31]
Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. 2011. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). IEEE, 3361--3368.
[32]
G. Metta, G. Sandini, D. Vernon, L. Natale, and F. Nori. 2008. The iCub humanoid robot: An open platform for research in embodied cognition. In Proceedings of the Performance Metrics for Intelligent Systems Workshop (PerMIS’08). 19--21.
[33]
C. L. Nehaniv and K. Dautenhahn (Eds.). 2007. Imitation and Social Learning in Robots, Humans and Animals: Behavioural, Social and Communicative Dimensions. Cambridge University Press, Cambridge.
[34]
S. Noor and H. N. Minhas. 2014. Context-aware perception for cyber-physical systems. In Computational Intelligence for Decision Support in Cyber-Physical Systems. Studies in Computational Intelligence, Vol. 540. 149--167.
[35]
B. Noris, J. B. Keller, and A. Billard. 2011. A wearable gaze tracking system for children in unconstrained environments. Computer Vision and Image Understanding 115, 4 (2011), 476--486.
[36]
K. Ogawara, T. Tanuki, H.i Kimura, and K. Ikeuchi. 2001. Acquiring hand-action models by attention point analysis. In IEEE International Conference on Robotics and Automation (ICRA’01). IEEE, 465--470.
[37]
S. Park and J. K. Aggarwal. 2004. Semantic-level understanding of human actions and interactions using event hierarchy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’04) Workshop. 12--12.
[38]
J. Pelz, M. Hayhoe, and R. Loeber. 2001. The coordination of eye, head, and hand movements in a natural task. Experimental Brain Research 139 (2001), 266--277.
[39]
J. Perez-Osorio, H. J. Müller, E. Wiese, and A. Wykowska. 2015. Gaze following is modulated by expectations regarding others’ action goals. PLoS ONE 10, 11 (Nov. 2015), e0143614.
[40]
H. Pirsiavash and D. Ramanan. 2012. Detecting activities of daily living in first-person camera views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). IEEE, 2847--2854.
[41]
R. Poppe. 2010. A survey on vision-based human action recognition. Image and Vision Computing 28, 6 (June 2010), 976--990.
[42]
R. Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.
[43]
K. Ramirez-Amaro, M. Beetz, and G. Cheng. 2013. Extracting semantic rules from human observations. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’13) Workshop: Semantics, Identification and Control of Robot-Human-Environment Interaction.
[44]
K. Ramirez-Amaro, M. Beetz, and G. Cheng. 2014. Automatic segmentation and recognition of human activities from observation based on semantic reasoning. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’14). IEEE, 5043--5048.
[45]
K. Ramirez-Amaro, M. Beetz, and G. Cheng. 2015b. Transferring skills to humanoid robots by extracting semantic representations from observations of human activities. Artificial Intelligence (2015).
[46]
K. Ramirez-Amaro, M. Beetz, and G. Cheng. 2015a. Understanding the intention of human activities through semantic perception: Observation, understanding and execution on a humanoid robot. Advanced Robotics 29, 5 (2015), 345--362.
[47]
K. Ramirez-Amaro and J. C. Chimal-Eguia. 2012. Image-based learning approach applied to time series forecasting. Journal of Applied Research and Technology 10, 3 (June 2012), 361--379.
[48]
K. Ramirez-Amaro, T. Inamura, E. Dean-Leon, M. Beetz, and G. Cheng. 2014. Bootstrapping humanoid robot skills by extracting semantic representations of human-like activities from virtual reality. In IEEE/RAS International Conference on Humanoid Robots (Humanoids). IEEE, 438--443.
[49]
K. Ramirez-Amaro, E. S. Kim, J. Kim, B. T. Zhang, M. Beetz, and G. Cheng. 2013. Enhancing human action recognition through spatio-temporal feature learning and semantic rules. In Proceedings of the 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids). 456--461.
[50]
E. Schneider, S. Kohlbecher, K. Bartl, F. Wallhoff, and T. Brandt. 2009a. Experimental platform for wizard-of-Oz evaluations of biomimetic active vision in robots. In Proceedings of the IEEE International Conference on Robotics and Biomimetics (ROBIO’09). IEEE, 1484--1489.
[51]
E. Schneider, T. Villgrattner, J. Vockeroth, K. Bartl, S. Kohlbecher, S. Bardins, H. Ulbrich, and T. Brandt. 2009b. EyeSeeCam: An eye movement-driven head camera for the examination of natural visual exploration. Annals of the New York Academy of Sciences 1164, 1 (2009), 461--467.
[52]
Z. Si, M. Pei, B. Z. Yao, and S. C. Zhu. 2011. Unsupervised learning of event AND-OR grammar and semantics from video. In Proeedings of the IEEE International Conference on Computer Vision (ICCV’11), D. N. Metaxas, L. Quan, A. Sanfeliu, and L. J. Van Gool (Eds.). IEEE Computer Society, 41--48.
[53]
B. Soran, A. Farhadi, and L. G. Shapiro. 2014. Action recognition in the presence of one egocentric and multiple static cameras. In Proceedings of the Asian Conference on Computer Vision (ACCV’14), Part V, Vol. 9007, D. Cremers, I. D. Reid, H. Saito, and M. H. Yang (Eds.). Springer International Publishing, 178--193.
[54]
E. H. Spriggs, F. de la Torre, and M. Hebert. 2009. Temporal segmentation and activity classification from first-person sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09) Workshops. 17--24.
[55]
S. Suzuki and K. Abe. 1985. Topological structural analysis of digitized binary images by border following. Computer Vision, Graphics, and Image Processing 30, 1 (1985), 32--46.
[56]
W. Takano and Y. Nakamura. 2006. Humanoid robot’s autonomous acquisition of proto-symbols through motion segmentation. In Proceedings of the IEEE-RAS International Conference on Humanoid Robots (Humanoids’06). IEEE, 425--431.
[57]
M. Tenorth, F. de la Torre, and M. Beetz. 2013. Learning probability distributions over partially-ordered human everyday activities. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’13). 4539--4544.
[58]
C. L. Teo, Y. Yang, H. D. III, C. Fermüller, and Y. Aloimonos. 2012. Towards a Watson that sees: Language-guided action recognition for robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’12). IEEE, 374--381.
[59]
M. Wächter and T. Asfour. 2015. Hierarchical segmentation of manipulation actions based on object relations and motion characteristics. In Proceedings of the International Conference on Advanced Robotics (ICAR’15). IEEE, 549--556.
[60]
M. Wächter, S. Schulz, T. Asfour, E. Aksoy, F. Wörgötter, and R. Dillmann. 2013. Action sequence reproduction based on automatic segmentation and object-action complexes. In Proceedings of the IEEE/RAS International Conference on Humanoid Robots (Humanoids’13).
[61]
H. Wang and C. Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13). IEEE, 3551--3558.
[62]
F. Wörgötter, C. W. Geib, M. Tamosiunaite, E. E. Aksoy, J. H. Piater, H. Xiong, A. Ude, B. Nemec, D. Kraft, N. Krüger, M. Wächter, and T. Asfour. 2015. Structural bootstrapping—A novel, generative mechanism for faster and more efficient acquisition of action-knowledge. IEEE Transactions on Autonomous Mental Development 7, 2 (2015), 140--154.
[63]
Y. Yang, C. Fermüller, and Y. Aloimonos. 2013. Detection of manipulation action consequences (MAC). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). IEEE, 2563--2570.
[64]
C. Yu and L. B. Smith. 2012. Embodied attention and word learning by toddlers. Cognition 125 (2012), 244--262.

Cited By

View all
  • (2025)VisualSAF-A Novel Framework for Visual Semantic Analysis TasksIEEE Access10.1109/ACCESS.2025.353531413(21052-21063)Online publication date: 2025
  • (2023)Novel Image Dataset and Proposal of Framework for Visual Semantic Analysis Applied on Object Dangerousness Prediction2023 IEEE Latin American Conference on Computational Intelligence (LA-CCI)10.1109/LA-CCI58595.2023.10409390(1-6)Online publication date: 29-Oct-2023
  • (2022)Semantic Segmentation of Agricultural Images Based on Style Transfer Using Conditional and Unconditional Generative Adversarial NetworksApplied Sciences10.3390/app1215778512:15(7785)Online publication date: 2-Aug-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Interactive Intelligent Systems
ACM Transactions on Interactive Intelligent Systems  Volume 7, Issue 1
March 2017
175 pages
ISSN:2160-6455
EISSN:2160-6463
DOI:10.1145/3028254
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 March 2017
Accepted: 01 November 2016
Revised: 01 October 2016
Received: 01 February 2016
Published in TIIS Volume 7, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Robot learning by observation
  2. egocentric analysis
  3. human activity recognition
  4. semantic reasoning

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • European Community's Seventh Framework Programme (FP7/2007-2013)
  • ICT Call 7 ROBOHOW.COG (FP7-ICT)
  • DFG cluster of excellence Cognition for Technical Systems CoTeSys
  • CONACYT-DAAD scholarship

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)VisualSAF-A Novel Framework for Visual Semantic Analysis TasksIEEE Access10.1109/ACCESS.2025.353531413(21052-21063)Online publication date: 2025
  • (2023)Novel Image Dataset and Proposal of Framework for Visual Semantic Analysis Applied on Object Dangerousness Prediction2023 IEEE Latin American Conference on Computational Intelligence (LA-CCI)10.1109/LA-CCI58595.2023.10409390(1-6)Online publication date: 29-Oct-2023
  • (2022)Semantic Segmentation of Agricultural Images Based on Style Transfer Using Conditional and Unconditional Generative Adversarial NetworksApplied Sciences10.3390/app1215778512:15(7785)Online publication date: 2-Aug-2022
  • (2022)Systematic Review of Computer Vision Semantic Analysis in Socially Assistive RoboticsAI10.3390/ai30100143:1(229-249)Online publication date: 17-Mar-2022
  • (2021)Automatic Visual Attention Detection for Mobile Eye Tracking Using Pre-Trained Computer Vision Models and Human GazeSensors10.3390/s2112414321:12(4143)Online publication date: 16-Jun-2021
  • (2019)A survey on semantic-based methods for the understanding of human movementsRobotics and Autonomous Systems10.1016/j.robot.2019.05.013Online publication date: Jun-2019
  • (2019)A Semantic-Based Method for Teaching Industrial Robots New TasksKI - Künstliche Intelligenz10.1007/s13218-019-00582-5Online publication date: 5-Apr-2019
  • (2018)First Person Vision for Activity Prediction Using Probabilistic ModelingMehran University Research Journal of Engineering and Technology10.22581/muet1982.1804.0937:4(545-558)Online publication date: 1-Oct-2018
  • (2018)Integration of Robotic Technologies for Rapidly Deployable RobotsIEEE Transactions on Industrial Informatics10.1109/TII.2017.276609614:4(1691-1700)Online publication date: Apr-2018
  • (2018)Research on Interactive Perceptual Wearable Equipment Based on Conditional Random Field Mining Algorithm2018 International Conference on Information Systems and Computer Aided Education (ICISCAE)10.1109/ICISCAE.2018.8666922(164-168)Online publication date: Jul-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media