Abstract
Pointing gestures are an intuitive and ubiquitous way of human communication and thus constitute a crucial aspect of human-robot interaction. However, isolated pointing recognition is not sufficient, as humans usually accompany their gestures with relevant natural language commands. As ambiguities can occur both visually and textually, an interactive dialog is required to resolve a user’s intentions. In this work, we tackle this problem and present a system for interactive, multimodal, task-oriented robot dialog using pointing gesture recognition. Specifically, we propose a pipeline constituted of state-of-the-art computer vision components to recognize objects, hands, hand orientation as well as human pose, and combine this information to identify not only pointing gesture presence but also the objects which are pointed at. Furthermore, we provide a natural language understanding module which considers pointing information to distinguish unambiguous from ambiguous commands and responds accordingly. Both components are integrated into the proposed interactive and multimodal dialog system. For evaluation purposes, we introduce a challenging benchmark set for pointing recognition from human demonstration videos in unconstrained real-world scenes. Finally, we present experimental results of both the individual components as well as the overall dialog system.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Anbarasan, Lee, J.S.: Speech and gestures for smart-home control and interaction for older adults. In: Proceedings of the 3rd International Workshop on Multimedia for Personal Health and Health Care, pp. 49–57. HealthMedia 2018, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3264996.3265002
Asfour, T., et al.: Armar-6. IEEE Robotics & Automation Magazine. 1070(9932/19) (2019)
Azari, B., Lim, A., Vaughan, R.: Commodifying pointing in HRI: simple and fast pointing gesture detection from RGB-D images. In: 2019 16th Conference on Computer and Robot Vision (CRV), pp. 174–180. IEEE (2019)
Bolt, R.A.: “put-that-there”: voice and gesture at the graphics interface. In: Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques, pp. 262–270. SIGGRAPH 1980, Association for Computing Machinery (1980)
Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020)
Bärmann, L., Peller-Konrad, F., Constantin, S., Asfour, T., Waibel, A.: Deep episodic memory for verbalization of robot experience. IEEE Robot. Autom. Lett. 6(3), 5808–5815 (2021). https://doi.org/10.1109/LRA.2021.3085166
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Chen, Y., et al.: Yourefit: embodied reference understanding with language and gesture. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1385–1395, October 2021
Cosgun, A., Trevor, A.J., Christensen, H.I.: Did you mean this object?: Detecting ambiguity in pointing gesture targets. In: Towards a Framework For Joint Action Workshop, HRI (2015)
Damen, D.: Rescaling egocentric vision. Int. J. Comput. Vision 130(1), 33–55 (2022)
Das, S.S.: A data-set and a method for pointing direction estimation from depth images for human-robot interaction and VR applications. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 11485–11491. IEEE (2021)
Desrochers, S., Morissette, P., Ricard, M.: Two perspectives on pointing in infancy. In: Joint Attention: its Origins and Role in Development, pp. 85–101 (1995)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423
Dhingra, N., Valli, E., Kunz, A.: Recognition and localisation of pointing gestures using a RGB-D camera. In: Stephanidis, C., Antona, M. (eds.) HCII 2020. CCIS, vol. 1224, pp. 205–212. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50726-8_27
Holzapfel, H.: A dialogue manager for multimodal human-robot interaction and learning of a humanoid robot. Ind. Robot Int. J. 35, 528–535 (2008)
Holzapfel, H., Nickel, K., Stiefelhagen, R.: Implementation and evaluation of a constraint based multimodal fusion system for speech and 3d pointing gestures. In: Proceedings of the 6th International Conference on Multimodal Interfaces (ICMI) (2004)
Hu, J., Jiang, Z., Ding, X., Mu, T., Hall, P.: VGPN: voice-guided pointing robot navigation for humans. In: 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1107–1112 (2018). https://doi.org/10.1109/ROBIO.2018.8664854
Jaiswal, S., Mishra, P., Nandi, G.: Deep learning based command pointing direction estimation using a single RGB camera. In: 2018 5th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), pp. 1–6. IEEE (2018)
Jevtić, A., et al.: Personalized robot assistant for support in dressing. IEEE Trans. Cogn. Dev. Syst. 11(3), 363–374 (2019). https://doi.org/10.1109/TCDS.2018.2817283
Jocher, G., et al.: ultralytics/yolov5: v6.1 - TensorRT, TensorFlow Edge TPU and OpenVINO Export and Inference, February 2022. https://doi.org/10.5281/zenodo.6222936
Jojic, N., Brumitt, B., Meyers, B., Harris, S., Huang, T.: Detection and estimation of pointing gestures in dense disparity maps. In: Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580), pp. 468–475. IEEE (2000)
Kehl, R., Van Gool, L.: Real-time pointing gesture recognition for an immersive environment. In: Proceedings of Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004, pp. 577–582. IEEE (2004)
Kingma, D.P., Ba, J.: Adam : a method for stochastic optimization. In: Proceedings of the Third International Conference on Learning Representations (ICLR) (2015)
Lai, Y., Wang, C., Li, Y., Ge, S.S., Huang, D.: 3d pointing gesture recognition for human-robot interaction. In: 2016 Chinese Control and Decision Conference (CCDC), pp. 4959–4964. IEEE (2016)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv:2107.13586 [cs] (2021)
Medeiros, A., Ratsamee, P., Orlosky, J., Uranishi, Y., Higashida, M., Takemura, H.: 3d pointing gestures as target selection tools: guiding monocular UAVs during window selection in an outdoor environment. ROBOMECH J. 8(1), 1–19 (2021)
Medeiros, A.C.S., Ratsamee, P., Uranishi, Y., Mashita, T., Takemura, H.: Human-drone interaction: using pointing gesture to define a target object. In: Kurosu, M. (ed.) HCII 2020. LNCS, vol. 12182, pp. 688–705. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49062-1_48
Mesnil, G., et al.: Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 530–539 (2015). https://doi.org/10.1109/TASLP.2014.2383614
Ni, J., Young, T., Pandelea, V., Xue, F., Adiga, V., Cambria, E.: Recent advances in deep learning based dialogue systems: a systematic survey. CoRR abs/2105.04387 (2021)
Nickel, K., Scemann, E., Stiefelhagen, R.: 3d-tracking of head and hands for pointing gesture recognition in a human-robot interaction scenario. In: Proceedings of Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004, pp. 565–570. IEEE (2004)
Nickel, K., Stiefelhagen, R.: Pointing gesture recognition based on 3d-tracking of face, hands and head orientation. In: Proceedings of the 5th International Conference on Multimodal Interfaces, pp. 140–146 (2003)
Park, C.B., Lee, S.W.: Real-time 3d pointing gesture recognition for mobile robots with cascade hmm and particle filter. Image Vision Comput. 29(1), 51–63 (2011)
Pateraki, M., Baltzakis, H., Trahanias, P.: Visual estimation of pointed targets for robot guidance via fusion of face pose and hand orientation. Comput. Vision Image Underst. 120, 1–13 (2014)
Pozzi, L., Gandolla, M., Roveda, L.: Pointing gestures for human-robot interaction in service robotics: a feasibility study. In: Miesenberger, K., Kouroupetroglou, G., Mavrou, K., Manduchi, R., Covarrubias Rodriguez, M., Penaz, P. (eds.) Computers Helping People with Special Needs. ICCHP-AAATE 2022. LNCS, vol. 13342, pp. 461–468. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08645-8_54
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR, 18–24 July 2021
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525 (2017)
Rosen, E., Whitney, D., Fishman, M., Ullman, D., Tellex, S.: Mixed reality as a bidirectional communication interface for human-robot interaction. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 11431–11438 (2020)
Sadhu, A., Chen, K., Nevatia, R.: Video object grounding using semantic roles in language description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10417–10427 (2020)
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv 2: Inverted residuals and linear bottlenecks (2018)
Schauerte, B., Fink, G.A.: Focusing computational visual attention in multi-modal human-robot interaction. In: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction. ICMI-MLMI 2010, Association for Computing Machinery, New York, NY, USA (2010). https://doi.org/10.1145/1891903.1891912
Schauerte, B., Richarz, J., Fink, G.A.: Saliency-based identification and recognition of pointed-at objects. In: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4638–4643 (2010). https://doi.org/10.1109/IROS.2010.5649430
Shan, D., Geng, J., Shu, M., Fouhey, D.: Understanding human hands in contact at internet scale. In: CVPR (2020)
Showers, A., Si, M.: Pointing estimation for human-robot interaction using hand pose, verbal cues, and confidence heuristics. In: Meiselwitz, G. (ed.) SCSM 2018. LNCS, vol. 10914, pp. 403–412. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91485-5_31
Shukla, D., Erkent, O., Piater, J.: Probabilistic detection of pointing directions for human-robot interaction. In: 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–8. IEEE (2015)
Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1145–1153 (2017)
Stiefelhagen, R., Fugen, C., Gieselmann, R., Holzapfel, H., Nickel, K., Waibel, A.: Natural human-robot interaction using speech, head pose and gestures. In: 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No. 04CH37566), vol. 3, pp. 2422–2427 (2004). https://doi.org/10.1109/IROS.2004.1389771
Stiefelhagen, R., et al.: Enabling multimodal human-robot interaction for the Karlsruhe humanoid robot. IEEE Trans. Robot. 23(5), 840–851 (2007). https://doi.org/10.1109/TRO.2007.907484
Tomasello, M., Carpenter, M., Liszkowski, U.: A new look at infant pointing. Child Dev. 78(3), 705–722 (2007)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008. Curran Associates, Inc. (2017)
Weld, H., Huang, X., Long, S., Poon, J., Han, S.C.: A survey of joint intent detection and slot filling models in natural language understanding. ACM Comput. Surv. 55, 1–38 (2022). https://doi.org/10.1145/3547138
Winograd, T.: Understanding natural language. Cogn. Psychol. 3(1), 1–191 (1972). https://doi.org/10.1016/0010-0285(72)90002-3
Woods, W., Kaplan, R., Nash-Webber, B.: The lunar sciences natural language information system. Final Report 2378, Bolt, Beranek and Newman Inc., Cambridge, MA (1974)
Zlatintsi, A., et al.: I-support: A robotic platform of an assistive bathing robot for the elderly population. Robot. Autonom. Syst. 126, 103451 (2020). https://doi.org/10.1016/j.robot.2020.103451
Acknowledgements
This work has been supported by the German Federal Ministry of Education and Research (BMBF) under the project OML (01IS18040A).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Constantin, S., Eyiokur, F.I., Yaman, D., Bärmann, L., Waibel, A. (2023). Interactive Multimodal Robot Dialog Using Pointing Gesture Recognition. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13806. Springer, Cham. https://doi.org/10.1007/978-3-031-25075-0_43
Download citation
DOI: https://doi.org/10.1007/978-3-031-25075-0_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25074-3
Online ISBN: 978-3-031-25075-0
eBook Packages: Computer ScienceComputer Science (R0)