Interactive Multimodal Robot Dialog Using Pointing Gesture Recognition

Constantin, Stefan; Eyiokur, Fevziye Irem; Yaman, Dogucan; Bärmann, Leonard; Waibel, Alex

doi:10.1007/978-3-031-25075-0_43

Interactive Multimodal Robot Dialog Using Pointing Gesture Recognition

Stefan Constantin ORCID: orcid.org/0000-0002-9759-2684¹⁰,
Fevziye Irem Eyiokur¹⁰,
Dogucan Yaman¹⁰,
Leonard Bärmann ORCID: orcid.org/0000-0003-3092-8726¹⁰ &
…
Alex Waibel¹⁰

Conference paper
First Online: 19 February 2023

1522 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13806))

Abstract

Pointing gestures are an intuitive and ubiquitous way of human communication and thus constitute a crucial aspect of human-robot interaction. However, isolated pointing recognition is not sufficient, as humans usually accompany their gestures with relevant natural language commands. As ambiguities can occur both visually and textually, an interactive dialog is required to resolve a user’s intentions. In this work, we tackle this problem and present a system for interactive, multimodal, task-oriented robot dialog using pointing gesture recognition. Specifically, we propose a pipeline constituted of state-of-the-art computer vision components to recognize objects, hands, hand orientation as well as human pose, and combine this information to identify not only pointing gesture presence but also the objects which are pointed at. Furthermore, we provide a natural language understanding module which considers pointing information to distinguish unambiguous from ambiguous commands and responds accordingly. Both components are integrated into the proposed interactive and multimodal dialog system. For evaluation purposes, we introduce a challenging benchmark set for pointing recognition from human demonstration videos in unconstrained real-world scenes. Finally, we present experimental results of both the individual components as well as the overall dialog system.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://github.com/msc42/dialog-using-pointing-gestures.

References

Anbarasan, Lee, J.S.: Speech and gestures for smart-home control and interaction for older adults. In: Proceedings of the 3rd International Workshop on Multimedia for Personal Health and Health Care, pp. 49–57. HealthMedia 2018, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3264996.3265002
Asfour, T., et al.: Armar-6. IEEE Robotics & Automation Magazine. 1070(9932/19) (2019)
Google Scholar
Azari, B., Lim, A., Vaughan, R.: Commodifying pointing in HRI: simple and fast pointing gesture detection from RGB-D images. In: 2019 16th Conference on Computer and Robot Vision (CRV), pp. 174–180. IEEE (2019)
Google Scholar
Bolt, R.A.: “put-that-there”: voice and gesture at the graphics interface. In: Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques, pp. 262–270. SIGGRAPH 1980, Association for Computing Machinery (1980)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020)
Google Scholar
Bärmann, L., Peller-Konrad, F., Constantin, S., Asfour, T., Waibel, A.: Deep episodic memory for verbalization of robot experience. IEEE Robot. Autom. Lett. 6(3), 5808–5815 (2021). https://doi.org/10.1109/LRA.2021.3085166
Article Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Google Scholar
Chen, Y., et al.: Yourefit: embodied reference understanding with language and gesture. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1385–1395, October 2021
Google Scholar
Cosgun, A., Trevor, A.J., Christensen, H.I.: Did you mean this object?: Detecting ambiguity in pointing gesture targets. In: Towards a Framework For Joint Action Workshop, HRI (2015)
Google Scholar
Damen, D.: Rescaling egocentric vision. Int. J. Comput. Vision 130(1), 33–55 (2022)
Article MathSciNet Google Scholar
Das, S.S.: A data-set and a method for pointing direction estimation from depth images for human-robot interaction and VR applications. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 11485–11491. IEEE (2021)
Google Scholar
Desrochers, S., Morissette, P., Ricard, M.: Two perspectives on pointing in infancy. In: Joint Attention: its Origins and Role in Development, pp. 85–101 (1995)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423
Dhingra, N., Valli, E., Kunz, A.: Recognition and localisation of pointing gestures using a RGB-D camera. In: Stephanidis, C., Antona, M. (eds.) HCII 2020. CCIS, vol. 1224, pp. 205–212. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50726-8_27
Chapter Google Scholar
Holzapfel, H.: A dialogue manager for multimodal human-robot interaction and learning of a humanoid robot. Ind. Robot Int. J. 35, 528–535 (2008)
Article Google Scholar
Holzapfel, H., Nickel, K., Stiefelhagen, R.: Implementation and evaluation of a constraint based multimodal fusion system for speech and 3d pointing gestures. In: Proceedings of the 6th International Conference on Multimodal Interfaces (ICMI) (2004)
Google Scholar
Hu, J., Jiang, Z., Ding, X., Mu, T., Hall, P.: VGPN: voice-guided pointing robot navigation for humans. In: 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1107–1112 (2018). https://doi.org/10.1109/ROBIO.2018.8664854
Jaiswal, S., Mishra, P., Nandi, G.: Deep learning based command pointing direction estimation using a single RGB camera. In: 2018 5th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), pp. 1–6. IEEE (2018)
Google Scholar
Jevtić, A., et al.: Personalized robot assistant for support in dressing. IEEE Trans. Cogn. Dev. Syst. 11(3), 363–374 (2019). https://doi.org/10.1109/TCDS.2018.2817283
Article Google Scholar
Jocher, G., et al.: ultralytics/yolov5: v6.1 - TensorRT, TensorFlow Edge TPU and OpenVINO Export and Inference, February 2022. https://doi.org/10.5281/zenodo.6222936
Jojic, N., Brumitt, B., Meyers, B., Harris, S., Huang, T.: Detection and estimation of pointing gestures in dense disparity maps. In: Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580), pp. 468–475. IEEE (2000)
Google Scholar
Kehl, R., Van Gool, L.: Real-time pointing gesture recognition for an immersive environment. In: Proceedings of Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004, pp. 577–582. IEEE (2004)
Google Scholar
Kingma, D.P., Ba, J.: Adam : a method for stochastic optimization. In: Proceedings of the Third International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Lai, Y., Wang, C., Li, Y., Ge, S.S., Huang, D.: 3d pointing gesture recognition for human-robot interaction. In: 2016 Chinese Control and Decision Conference (CCDC), pp. 4959–4964. IEEE (2016)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv:2107.13586 [cs] (2021)
Medeiros, A., Ratsamee, P., Orlosky, J., Uranishi, Y., Higashida, M., Takemura, H.: 3d pointing gestures as target selection tools: guiding monocular UAVs during window selection in an outdoor environment. ROBOMECH J. 8(1), 1–19 (2021)
Article Google Scholar
Medeiros, A.C.S., Ratsamee, P., Uranishi, Y., Mashita, T., Takemura, H.: Human-drone interaction: using pointing gesture to define a target object. In: Kurosu, M. (ed.) HCII 2020. LNCS, vol. 12182, pp. 688–705. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49062-1_48
Chapter Google Scholar
Mesnil, G., et al.: Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 530–539 (2015). https://doi.org/10.1109/TASLP.2014.2383614
Article Google Scholar
Ni, J., Young, T., Pandelea, V., Xue, F., Adiga, V., Cambria, E.: Recent advances in deep learning based dialogue systems: a systematic survey. CoRR abs/2105.04387 (2021)
Google Scholar
Nickel, K., Scemann, E., Stiefelhagen, R.: 3d-tracking of head and hands for pointing gesture recognition in a human-robot interaction scenario. In: Proceedings of Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004, pp. 565–570. IEEE (2004)
Google Scholar
Nickel, K., Stiefelhagen, R.: Pointing gesture recognition based on 3d-tracking of face, hands and head orientation. In: Proceedings of the 5th International Conference on Multimodal Interfaces, pp. 140–146 (2003)
Google Scholar
Park, C.B., Lee, S.W.: Real-time 3d pointing gesture recognition for mobile robots with cascade hmm and particle filter. Image Vision Comput. 29(1), 51–63 (2011)
Article Google Scholar
Pateraki, M., Baltzakis, H., Trahanias, P.: Visual estimation of pointed targets for robot guidance via fusion of face pose and hand orientation. Comput. Vision Image Underst. 120, 1–13 (2014)
Article Google Scholar
Pozzi, L., Gandolla, M., Roveda, L.: Pointing gestures for human-robot interaction in service robotics: a feasibility study. In: Miesenberger, K., Kouroupetroglou, G., Mavrou, K., Manduchi, R., Covarrubias Rodriguez, M., Penaz, P. (eds.) Computers Helping People with Special Needs. ICCHP-AAATE 2022. LNCS, vol. 13342, pp. 461–468. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08645-8_54
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR, 18–24 July 2021
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
MathSciNet MATH Google Scholar
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525 (2017)
Google Scholar
Rosen, E., Whitney, D., Fishman, M., Ullman, D., Tellex, S.: Mixed reality as a bidirectional communication interface for human-robot interaction. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 11431–11438 (2020)
Google Scholar
Sadhu, A., Chen, K., Nevatia, R.: Video object grounding using semantic roles in language description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10417–10427 (2020)
Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv 2: Inverted residuals and linear bottlenecks (2018)
Google Scholar
Schauerte, B., Fink, G.A.: Focusing computational visual attention in multi-modal human-robot interaction. In: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction. ICMI-MLMI 2010, Association for Computing Machinery, New York, NY, USA (2010). https://doi.org/10.1145/1891903.1891912
Schauerte, B., Richarz, J., Fink, G.A.: Saliency-based identification and recognition of pointed-at objects. In: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4638–4643 (2010). https://doi.org/10.1109/IROS.2010.5649430
Shan, D., Geng, J., Shu, M., Fouhey, D.: Understanding human hands in contact at internet scale. In: CVPR (2020)
Google Scholar
Showers, A., Si, M.: Pointing estimation for human-robot interaction using hand pose, verbal cues, and confidence heuristics. In: Meiselwitz, G. (ed.) SCSM 2018. LNCS, vol. 10914, pp. 403–412. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91485-5_31
Chapter Google Scholar
Shukla, D., Erkent, O., Piater, J.: Probabilistic detection of pointing directions for human-robot interaction. In: 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–8. IEEE (2015)
Google Scholar
Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1145–1153 (2017)
Google Scholar
Stiefelhagen, R., Fugen, C., Gieselmann, R., Holzapfel, H., Nickel, K., Waibel, A.: Natural human-robot interaction using speech, head pose and gestures. In: 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No. 04CH37566), vol. 3, pp. 2422–2427 (2004). https://doi.org/10.1109/IROS.2004.1389771
Stiefelhagen, R., et al.: Enabling multimodal human-robot interaction for the Karlsruhe humanoid robot. IEEE Trans. Robot. 23(5), 840–851 (2007). https://doi.org/10.1109/TRO.2007.907484
Article Google Scholar
Tomasello, M., Carpenter, M., Liszkowski, U.: A new look at infant pointing. Child Dev. 78(3), 705–722 (2007)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008. Curran Associates, Inc. (2017)
Google Scholar
Weld, H., Huang, X., Long, S., Poon, J., Han, S.C.: A survey of joint intent detection and slot filling models in natural language understanding. ACM Comput. Surv. 55, 1–38 (2022). https://doi.org/10.1145/3547138
Winograd, T.: Understanding natural language. Cogn. Psychol. 3(1), 1–191 (1972). https://doi.org/10.1016/0010-0285(72)90002-3
Article Google Scholar
Woods, W., Kaplan, R., Nash-Webber, B.: The lunar sciences natural language information system. Final Report 2378, Bolt, Beranek and Newman Inc., Cambridge, MA (1974)
Google Scholar
Zlatintsi, A., et al.: I-support: A robotic platform of an assistive bathing robot for the elderly population. Robot. Autonom. Syst. 126, 103451 (2020). https://doi.org/10.1016/j.robot.2020.103451
Article Google Scholar

Download references

Acknowledgements

This work has been supported by the German Federal Ministry of Education and Research (BMBF) under the project OML (01IS18040A).

Author information

Authors and Affiliations

Interactive Systems Lab, Karlsruhe Institute of Technology, Adenauerring 2, 76131, Karlsruhe, Germany
Stefan Constantin, Fevziye Irem Eyiokur, Dogucan Yaman, Leonard Bärmann & Alex Waibel

Authors

Stefan Constantin
View author publications
You can also search for this author in PubMed Google Scholar
Fevziye Irem Eyiokur
View author publications
You can also search for this author in PubMed Google Scholar
Dogucan Yaman
View author publications
You can also search for this author in PubMed Google Scholar
Leonard Bärmann
View author publications
You can also search for this author in PubMed Google Scholar
Alex Waibel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefan Constantin .

Editor information

Editors and Affiliations

IBM Research - MIT-IBM Watson AI Lab, Massachusetts, USA
Leonid Karlinsky
Technion – Israel Institute of Technology, Haifa, Israel
Tomer Michaeli
Kyoto University, Kyoto, Japan
Ko Nishino

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7914 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Constantin, S., Eyiokur, F.I., Yaman, D., Bärmann, L., Waibel, A. (2023). Interactive Multimodal Robot Dialog Using Pointing Gesture Recognition. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13806. Springer, Cham. https://doi.org/10.1007/978-3-031-25075-0_43

Download citation

DOI: https://doi.org/10.1007/978-3-031-25075-0_43
Published: 19 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25074-3
Online ISBN: 978-3-031-25075-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics