ABSTRACT
We propose a method for detecting the group’s focus of attention: the visual point at which a majority of participants direct their gaze in a conversation. This information enables a robot to infer important conversational cues and adjust its behavior to support more natural conversational interactions. Our approach uses a Hidden Markov Model based on mimicry, where the robot observes the head orientation of participants and infers their gaze direction to identify the group’s focus of attention. We demonstrate our method by replicating the gaze patterns of the group members, showing that the robot can accurately determine the focal point. We evaluated our algorithm using a combination of datasets and real-world scenarios with a Fetch robot, demonstrating an accuracy of 81% compared to a baseline of 54%. Our proposed method has the potential to significantly improve group-oriented human-robot interaction.
Supplemental Material
- Leopoldo Acosta, Evelio González, José Natán Rodríguez, Alberto F Hamilton, 2006. Design and implementation of a service robot for a restaurant. International Journal of Robotics & Automation 21, 4 (2006), 273.Google ScholarCross Ref
- Ho Seok Ahn, Sheng Zhang, Min Ho Lee, Jong Yoon Lim, and Bruce A MacDonald. 2018. Robotic Healthcare Service System to Serve Multiple Patients with Multiple Robots. In International Conference on Social Robotics. Springer, 493–502.Google ScholarCross Ref
- X. Alameda-Pineda, J. Staiano, R. Subramanian, L. Batrinca, E. Ricci, B. Lepri, O. Lanz, and N. Sebe. 2016. SALSA: A Novel Dataset for Multimodal Group Behavior Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 8 (Aug 2016), 1707–1720. https://doi.org/10.1109/TPAMI.2015.2496269Google ScholarDigital Library
- Sean Andrist, Bilge Mutlu, and Michael Gleicher. 2013. Conversational gaze aversion for virtual agents. In International Workshop on Intelligent Virtual Agents. Springer, 249–262.Google ScholarCross Ref
- Sean Andrist, Xiang Zhi Tan, Michael Gleicher, and Bilge Mutlu. 2014. Conversational gaze aversion for humanlike robots. In 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 25–32.Google ScholarDigital Library
- Salvatore Maria Anzalone, Elodie Tilmont, Sofiane Boucenna, Jean Xavier, Anne-Lise Jouen, Nicolas Bodeau, Koushik Maharatna, Mohamed Chetouani, David Cohen, MICHELANGELO Study Group, 2014. How children with autism spectrum disorder behave and explore the 4-dimensional (spatial 3D+ time) environment during a joint attention induction task with a robot. Research in Autism Spectrum Disorders 8, 7 (2014), 814–826.Google ScholarCross Ref
- Georgios Athanasopoulos, Werner Verhelst, and Hichem Sahli. 2015. Robust speaker localization for real-world robots. Computer Speech & Language 34, 1 (2015), 129–153.Google ScholarDigital Library
- Franziska Babel, Johannes Kraus, Linda Miller, Matthias Kraus, Nicolas Wagner, Wolfgang Minker, and Martin Baumann. 2021. Small talk with a robot? The impact of dialog content, talk initiative, and gaze behavior of a social robot on trust, acceptance, and proximity. International Journal of Social Robotics 13, 6 (2021), 1485–1498.Google ScholarCross Ref
- Christoph Bartneck and Jodi Forlizzi. 2004. A design-centred framework for social human-robot interaction. In RO-MAN 2004. 13th IEEE international workshop on robot and human interactive communication (IEEE Catalog No. 04TH8759). IEEE, 591–594.Google Scholar
- Dan Bohus, Chit W Saw, and Eric Horvitz. 2014. Directions robot: in-the-wild experiences and lessons learned. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems, 637–644.Google Scholar
- Cynthia Breazeal and Brian Scassellati. 2000. Infant-like social interactions between a robot and a human caregiver. Adaptive Behavior 8, 1 (2000), 49–74.Google ScholarDigital Library
- Laura Cabrera-Quiros, Andrew Demetriou, Ekin Gedik, Leander van der Meij, and Hayley Hung. 2018. The MatchNMingle dataset: a novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates. IEEE Transactions on Affective Computing (2018).Google Scholar
- Punarjay Chakravarty and Tinne Tuytelaars. 2016. Cross-modal supervision for learning active speaker detection in video. In European Conference on Computer Vision. Springer, 285–301.Google ScholarCross Ref
- Yingfeng Chen, Feng Wu, Wei Shuai, Ningyang Wang, Rongya Chen, and Xiaoping Chen. 2015. Kejia robot–an attractive shopping mall guider. In International Conference on Social Robotics. Springer, 145–154.Google Scholar
- Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. In Asian conference on computer vision. Springer, 251–263.Google Scholar
- Shaundra B Daily, Melva T James, David Cherry, John J Porter III, Shelby S Darnell, Joseph Isaac, and Tania Roy. 2017. Affective computing: historical foundations, current applications, and future trends. In Emotions and affect in human factors and human-computer interaction. Elsevier, 213–231.Google Scholar
- Kahneman Daniel. 2017. Thinking, fast and slow.Google Scholar
- Chandan Datta, Anuj Kapuria, and Ritukar Vijay. 2011. A pilot study to understand requirements of a shopping mall robot. In Proceedings of the 6th international conference on Human-robot interaction. ACM, 127–128.Google ScholarDigital Library
- Malcolm Doering, Dražen Brščić, and Takayuki Kanda. 2021. Data-Driven Imitation Learning for a Shopkeeper Robot with Periodically Changing Product Information. ACM Transactions on Human-Robot Interaction (THRI) 10, 4 (2021), 1–20.Google ScholarDigital Library
- Mark Everingham, Josef Sivic, and Andrew Zisserman. 2006. Hello! My name is... Buffy”–Automatic Naming of Characters in TV Video.. In BMVC, Vol. 2. 6.Google Scholar
- Sarah Gillet, Maria Teresa Parreira, Marynel Vázquez, and Iolanda Leite. 2022. Learning Gaze Behaviors for Balancing Participation in Group Human-Robot Interactions. In Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction. 265–274.Google ScholarDigital Library
- Sarah Gillet, Marynel Vázquez, Christopher Peters, Fangkai Yang, and Iolanda Leite. 2022. Multiparty interaction between humans and socially interactive agents. In The Handbook on Socially Interactive Agents: 20 years of Research on Embodied Conversational Agents, Intelligent Virtual Agents, and Social Robotics Volume 2: Interactivity, Platforms, Application. 113–154.Google Scholar
- H-M Gross, H Boehme, Ch Schroeter, Steffen Müller, Alexander König, Erik Einhorn, Ch Martin, Matthias Merten, and Andreas Bley. 2009. TOOMAS: interactive shopping guide robots in everyday use-final implementation and experiences from long-term field trials. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2005–2012.Google ScholarCross Ref
- Hooman Hedayati. 2021. Improving Human-Robot Conversational Groups.Google Scholar
- Hooman Hedayati, Annika Muehlbradt, Daniel J Szafir, and Sean Andrist. [n. d.]. Reform: Recognizing f-formations for social robots. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 11181–11188.Google Scholar
- Hooman Hedayati, Stela Hanbyeol Seo, Takayuki Kanda, Daniel J Rea, Sean Andrist, Yukiko Nakano, and Hiroshi Ishiguro. 2023. Symbiotic Society with Avatars (SSA) Beyond Space and Time. In Companion of the 2023 ACM/IEEE International Conference on Human-Robot Interaction. 953–955.Google Scholar
- Hooman Hedayati and Daniel Szafir. 2022. Predicting Positions of People in Human-Robot Conversational Groups. In Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction. 402–411.Google ScholarDigital Library
- Hooman Hedayati, Daniel Szafir, and James Kennedy. 2020. Comparing f-formations between humans and on-screen agents. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems. 1–9.Google ScholarDigital Library
- Ken Hoover, Sourish Chaudhuri, Caroline Pantofaru, Malcolm Slaney, and Ian Sturdy. 2017. Putting a face to the voice: Fusing audio and visual signals across a video to determine speakers. arXiv preprint arXiv:1706.00079 (2017).Google Scholar
- Chong Huang and Kazuhito Koishida. 2020. Improved active speaker detection based on optical flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 950–951.Google ScholarCross Ref
- Chien-Ming Huang, Takamasa Iio, Satoru Satake, and Takayuki Kanda. 2014. Modeling and Controlling Friendliness for An Interactive Museum Robot.. In Robotics: science and systems. Citeseer, 12–16.Google Scholar
- Takayuki Kanda, Masahiro Shiomi, Zenta Miyashita, Hiroshi Ishiguro, and Norihiro Hagita. 2009. An affective guide robot in a shopping mall. In Proceedings of the 4th ACM/IEEE international conference on Human robot interaction. ACM, 173–180.Google ScholarDigital Library
- Takayuki Kanda, Masahiro Shiomi, Zenta Miyashita, Hiroshi Ishiguro, and Norihiro Hagita. 2010. A communication robot in a shopping mall. IEEE Transactions on Robotics 26, 5 (2010), 897–913.Google ScholarDigital Library
- Adam Kendon. 1990. Conducting interaction: Patterns of behavior in focused encounters. Vol. 7. CUP Archive.Google Scholar
- Hideki Kozima and Hiroyuki Yano. 2001. A robot that learns to communicate with human caregivers. In Proceedings of the First International Workshop on Epigenetic Robotics, Vol. 2001.Google Scholar
- Ivan Marković and Ivan Petrović. 2010. Speaker localization and tracking with a microphone array on a mobile robot using von Mises distribution and particle filtering. Robotics and Autonomous Systems 58, 11 (2010), 1185–1196.Google ScholarDigital Library
- Ivan Markovic, Alban Portello, Patrick Danes, Ivan Petrovic, and Sylvain Argentieri. 2013. Active speaker localization with circular likelihoods and bootstrap filtering. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2914–2920.Google ScholarCross Ref
- Chris Moore, Philip J Dunham, and Phil Dunham. 2014. Joint attention: Its origins and role in development. Psychology Press.Google Scholar
- Peter Mundy, Jessica Block, Christine Delgado, Yuly Pomares, Amy Vaughan Van Hecke, and Meaghan Venezia Parlade. 2007. Individual differences and the development of joint attention in infancy. Child development 78, 3 (2007), 938–954.Google Scholar
- Peter Mundy and Lisa Newell. 2007. Attention, joint attention, and social cognition. Current directions in psychological science 16, 5 (2007), 269–274.Google Scholar
- Yukie Nagai, Koh Hosoda, Akio Morita, and Minoru Asada. 2003. A constructive model for the development of joint attention. Connection Science 15, 4 (2003), 211–229.Google ScholarCross Ref
- Setareh Nasihati Gilani, David Traum, Arcangelo Merla, Eugenia Hee, Zoey Walker, Barbara Manini, Grady Gallagher, and Laura-Ann Petitto. 2018. Multimodal dialogue management for multiparty interaction with infants. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 5–13.Google ScholarDigital Library
- Marketta Niemelä, Päivi Heikkilä, Hanna Lammi, and Virpi Oksman. 2019. A social robot in a shopping mall: studies on acceptance and stakeholder expectations. In Social Robots: Technological, Societal and Ethical Aspects of Human-Robot Interaction. Springer, 119–144.Google Scholar
- Hirotaka Osawa, Arisa Ema, Hiromitsu Hattori, Naonori Akiya, Nobotsugu Kanzaki, Akinori Kubo, Tora Koyama, and Ryutaro Ichise. 2017. Analysis of robot hotel: Reconstruction of works with robots. In 2017 26th IEEE international symposium on robot and human interactive communication (RO-MAN). IEEE, 219–223.Google ScholarDigital Library
- Hirotaka Osawa, Arisa Ema, Hiromitsu Hattori, Naonori Akiya, Nobutsugu Kanzaki, Akinori Kubo, Tora Koyama, and Ryutaro Ichise. 2017. What is Real Risk and Benefit on Work with Robots?: From the Analysis of a Robot Hotel. In Proceedings of the Companion of the 2017 ACM/IEEE International Conference on Human-Robot Interaction. ACM, 241–242.Google ScholarDigital Library
- Matthew KXJ Pan, Sungjoon Choi, James Kennedy, Kyna McIntosh, Daniel Campos Zamora, Günter Niemeyer, Joohyung Kim, Alexis Wieland, and David Christensen. 2020. Realistic and interactive robot gaze. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 11072–11078.Google ScholarDigital Library
- Tomislav Pejsa, Sean Andrist, Michael Gleicher, and Bilge Mutlu. 2015. Gaze and attention management for embodied conversational agents. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 1 (2015), 1–34.Google ScholarDigital Library
- Paola Pennisi, Alessandro Tonacci, Gennaro Tartarisco, Lucia Billeci, Liliana Ruta, Sebastiano Gangemi, and Giovanni Pioggia. 2016. Autism and social robotics: A systematic review. Autism Research 9, 2 (2016), 165–183.Google ScholarCross Ref
- Daniel Prendergast and Daniel Szafir. 2018. Improving object disambiguation from natural language using empirical models. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 477–485.Google ScholarDigital Library
- Caleb Rascon and Ivan Meza. 2017. Localization of sound sources in robotics: A review. Robotics and Autonomous Systems 96 (2017), 184–210.Google ScholarCross Ref
- Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, 2020. Ava active speaker: An audio-visual dataset for active speaker detection. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4492–4496.Google ScholarCross Ref
- Richard Savery, Ryan Rose, and Gil Weinberg. 2019. Establishing human-robot trust through music-driven robotic emotion prosody and gesture. In 2019 28th IEEE international conference on robot and human interactive communication (RO-MAN). IEEE, 1–7.Google ScholarDigital Library
- Francesco Setti, Chris Russell, Chiara Bassetti, and Marco Cristani. 2015. F-formation detection: Individuating free-standing conversational groups in images. PloS one 10, 5 (2015), e0123783.Google ScholarCross Ref
- Masahiro Shiomi, Takayuki Kanda, Hiroshi Ishiguro, and Norihiro Hagita. 2006. Interactive humanoid robots for a science museum. In Proceedings of the 1st ACM SIGCHI/SIGART conference on Human-robot interaction. 305–312.Google ScholarDigital Library
- Christopher Stanton and Catherine J Stevens. 2014. Robot pressure: the impact of robot eye gaze and lifelike bodily movements upon decision-making and trust. In International conference on social robotics. Springer, 330–339.Google ScholarCross Ref
- Mason Swofford, John Peruzzi, Nathan Tsoi, Sydney Thompson, Roberto Martín-Martín, Silvio Savarese, and Marynel Vázquez. 2020. Improving Social Awareness Through DANTE: Deep Affinity Network for Clustering Conversational Interactants. Proceedings of the ACM on Human-Computer Interaction 4, CSCW1 (2020), 1–23.Google ScholarDigital Library
- Stephanie Tan, David MJ Tax, and Hayley Hung. 2022. Conversation Group Detection With Spatio-Temporal Context. arXiv preprint arXiv:2206.02559 (2022).Google Scholar
- Xiang Zhi Tan, Sean Andrist, Dan Bohus, and Eric Horvitz. 2020. Now, over here: Leveraging extended attentional capabilities in human-robot interaction. In Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction. 468–470.Google ScholarDigital Library
- Xiang Zhi Tan, Elizabeth Jeanne Carter, Prithu Pareek, and Aaron Steinfeld. 2022. Group Formation in Multi-Robot Human Interaction During Service Scenarios. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. 159–169.Google Scholar
- Nathan Tsoi, Kate Candon, Yofti Milkessa, and Marynel Vázquez. 2020. An End-to-End Approach for Training Neural Network Binary Classifiers on Metrics Based on the Confusion Matrix. arXiv preprint arXiv:2009.01367 (2020).Google Scholar
- J-M Valin, François Michaud, Jean Rouat, and Dominic Létourneau. 2003. Robust sound source localization using a microphone array on a mobile robot. In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003)(Cat. No. 03CH37453), Vol. 2. IEEE, 1228–1233.Google ScholarCross Ref
- Marynel Vázquez, Elizabeth J Carter, Braden McDorman, Jodi Forlizzi, Aaron Steinfeld, and Scott E Hudson. 2017. Towards robot autonomy in group conversations: Understanding the effects of body orientation and gaze. In Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction. ACM, 42–52.Google ScholarDigital Library
- Marynel Vázquez, Alexander Lew, Eden Gorevoy, and Joe Connolly. 2021. Pose Generation for Social Robots in Conversational Group Formations. Frontiers in Robotics and AI 8 (2021).Google Scholar
- Zachary E Warren, Zhi Zheng, Amy R Swanson, Esubalew Bekele, Lian Zhang, Julie A Crittendon, Amy F Weitlauf, and Nilanjan Sarkar. 2015. Can robotic interaction improve joint attention skills?Journal of autism and developmental disorders 45, 11 (2015), 3726–3734.Google Scholar
- Eduardo Zalama, Jaime Gómez García-Bermejo, Samuel Marcos, Salvador Domínguez, Raúl Feliz, Roberto Pinillos, and Joaquín López. 2014. Sacarino, a service robot in a hotel environment. In ROBOT2013: First Iberian Robotics Conference. Springer, 3–14.Google ScholarCross Ref
- Abolfazl Zaraki, Daniele Mazzei, Manuel Giuliani, and Danilo De Rossi. 2014. Designing and evaluating a social gaze-control system for a humanoid robot. IEEE Transactions on Human-Machine Systems 44, 2 (2014), 157–168.Google ScholarCross Ref
- Lei Zhou, Dingye Yang, Xiaolin Zhai, Shichao Wu, Zhengxi Hu, and Jingtai Liu. 2022. GA-STT: Human Trajectory Prediction with Group Aware Spatial-Temporal Transformer. IEEE Robotics and Automation Letters (2022).Google Scholar
Index Terms
- Identifying the Focus of Attention in Human-Robot Conversational Groups
Recommendations
Conversational gaze mechanisms for humanlike robots
During conversations, speakers employ a number of verbal and nonverbal mechanisms to establish who participates in the conversation, when, and in what capacity. Gaze cues and mechanisms are particularly instrumental in establishing the participant roles ...
Combining dynamic head pose-gaze mapping with the robot conversational state for attention recognition in human-robot interactions
Recognizing the visual focus of attention in the HRI context.Relying on head pose since eye gaze estimation is often impossible to achieve.Inspired from the behavioral models for body, head and gaze dynamics in gaze shifts.Exploiting the robot ...
Spontaneous spoken dialogues with the furhat human-like robot head
HRI '14: Proceedings of the 2014 ACM/IEEE international conference on Human-robot interactionFurhat [1] is a robot head that deploys a back-projected animated face that is realistic and human-like in anatomy. Furhat relies on a state-of-the-art facial animation architecture allowing accurate synchronized lip movements with speech, and the ...
Comments