skip to main content
10.1145/3640794.3665575acmconferencesArticle/Chapter ViewAbstractPublication PagescuiConference Proceedingsconference-collections
Work in Progress

Situated Conversational Agents for Task Guidance: A Preliminary User Study

Published: 08 July 2024 Publication History

Abstract

Multimodal large language models have enabled a new generation of Conversational Agents (CA), leveraging language structure in human discourse to encode-decode multimedia formats (e.g., video-to-audio). These next-generation CAs can be useful in task guidance scenarios, where the user’s attention space is limited and verbal instructions can be overwhelming. In this paper, we explore the role of non-verbal conversational cues in identifying and recovering from errors while performing various assembly tasks. Findings from an exploratory Wizard-of-Oz study (N=8) indicate individual differences and preferences for auditory guidance. Combining these initial findings with our early exploration of the task monitoring system, we discuss implications for the emerging area of situated multimodal CAs for physical task guidance, where conversational interactions are based on inputting visual task actions and generating auditory feedback.

Supplemental Material

ZIP File
"Mp4 files of Guidance sounds", "StudyMaterials: Survey and questionnaire"

References

[1]
2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/.
[2]
2024. Project Astra. https://deepmind.google/technologies/gemini/project-astra/.
[3]
Patricia Baggett and Andrzej Ehrenfeucht. 1991. Building physical and mental models in assembly tasks. International Journal of Industrial Ergonomics 7, 3 (1991), 217–227.
[4]
Francesco N Biondi, Angela Cacanindin, Caitlyn Douglas, and Joel Cort. 2021. Overloaded and at work: Investigating the effect of cognitive workload on assembly task performance. Human factors 63, 5 (2021), 813–820.
[5]
Meera M Blattner, Denise A Sumikawa, and Robert M Greenberg. 1989. Earcons and icons: Their structure and common design principles. Human–Computer Interaction 4, 1 (1989), 11–44.
[6]
Stephen A Brewster. 1997. Using non-speech sound to overcome information overload. Displays 17, 3-4 (1997), 179–189.
[7]
Paul A Crook, Shivani Poddar, Ankita De, Semir Shafi, David Whitney, Alborz Geramifard, and Rajen Subba. 2019. SIMMC: Situated Interactive Multi-Modal Conversational Data Collection And Evaluation Platform. arXiv preprint arXiv:1911.02690 (2019).
[8]
Nils Dahlbäck, Arne Jönsson, and Lars Ahrenberg. 1993. Wizard of Oz studies—why and how. Knowledge-based systems 6, 4 (1993), 258–266. Publisher: Elsevier.
[9]
Abhraneil Dam, Arsh Siddiqui, Charles Leclercq, and Myounghoon Jeon. 2024. Taxonomy and definition of audio augmented reality (AAR): A grounded theory study. International Journal of Human-Computer Studies 182 (2024), 103179.
[10]
Mira El Kamali, Leonardo Angelini, Denis Lalanne, Omar Abou Khaled, and Elena Mugellini. 2020. Multimodal conversational agent for older adults’ behavioral change. In International Conference on Multimodal Interaction. 270–274.
[11]
Marina Robertovna Gozalova, Magomed Gazilovich Gazilov, Olga Victorovna Kobeleva, Maria Igorevna Seredina, and Elena Sergeevna Loseva. 2016. Non-verbal communication in the modern world. Mediterranean Journal of Social Sciences 7, 4 (2016), 553–553. https://doi.org/10.36941/mjss
[12]
Renan Guarese, Emma Pretty, Aidan Renata, Deb Polson, and Fabio Zambetta. 2024. Exploring audio interfaces for vertical guidance in augmented reality via hand-based feedback. IEEE Transactions on Visualization and Computer Graphics (2024).
[13]
Marti A Hearst, J Allen, C Guinn, and Eric Horvitz. 1999. Mixed-initiative interaction: Trends and controversies. IEEE Intelligent Systems 14, 5 (1999), 14–23.
[14]
Eric Horvitz. 1999. Principles of Mixed-Initiative User Interfaces. In CHI ’99: Proceedings of the SIGCHI conference on Human Factors in Computing Systems. 159 – 166.
[15]
Samuel Kernan Freire, Mina Foosherian, Chaofan Wang, and Evangelos Niforatos. 2023. Harnessing large language models for cognitive assistants in factories. In Proceedings of the 5th International Conference on Conversational User Interfaces. 1–6.
[16]
Roberta L Klatzky, James R Marston, Nicholas A Giudice, Reginald G Golledge, and Jack M Loomis. 2006. Cognitive load of navigating without vision when guided by virtual sound versus spatial language.Journal of experimental psychology: Applied 12, 4 (2006), 223.
[17]
Satwik Kottur, Seungwhan Moon, Alborz Geramifard, and Babak Damavandi. 2021. SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations. EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings (2021), 4903–4912. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.401
[18]
Sheetal Kusal, Shruti Patil, Jyoti Choudrie, Ketan Kotecha, Sashikala Mishra, and Ajith Abraham. 2022. AI-based conversational agents: A scoping review from technologies to future directions. IEEE Access 10 (2022), 92337–92356.
[19]
Ze-Hao Lai, Wenjin Tao, Ming C Leu, and Zhaozheng Yin. 2020. Smart augmented reality instructional system for mechanical assembly towards worker-centered intelligent manufacturing. Journal of Manufacturing Systems 55 (2020), 69–81.
[20]
Steven M LaValle. 2023. Virtual Reality. Cambridge University Press.
[21]
Haeju Lee, Oh Joon Kwon, Yunseon Choi, Minho Park, Ran Han, Yoonhyung Kim, Jinhyeon Kim, Youngjune Lee, Haebin Shin, Kangwook Lee, 2022. Learning to embed multi-modal contexts for situated conversational agents. In Findings of the Association for Computational Linguistics: NAACL 2022. 813–830.
[22]
Qing Lin and Youngjoon Han. 2014. A context-aware-based audio guidance system for blind people using a multimodal profile model. Sensors 14, 10 (2014), 18670–18700.
[23]
Ramón López-Cózar, Zoraida Callejas, Gonzalo Espejo, and David Griol. 2011. Enhancement of conversational agents by means of multimodal interaction. In Conversational agents and natural language interaction: techniques and effective practices. IGI Global, 223–252.
[24]
Manohar Madan, Tom Bramorski, and RP Sundarraj. 1995. The effects of grouping parts of ready-to-assemble products onassembly time: an experimental study. International Journal of Operations & Production Management 15, 3 (1995), 39–49.
[25]
Fatik Baran Mandal. 2014. Nonverbal communication in humans. Journal of human behavior in the social environment 24, 4 (2014), 417–421. https://doi.org/10.1080/10911359.2013.831288
[26]
Nikolaos Mavridis. 2007. Grounded situation models for situated conversational assistants. Ph. D. Dissertation. Massachusetts Institute of Technology.
[27]
Seungwhan Moon, Satwik Kottur, Paul A. Crook, Ankita De, Shivani Poddar, Theodore Levin, David Whitney, Daniel Difranco, Ahmad Beirami, Eunjoon Cho, Rajen Subba, and Alborz Geramifard. 2020. Situated and Interactive Multimodal Conversations. COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference (2020), 1103–1121. https://doi.org/10.18653/V1/2020.COLING-MAIN.96
[28]
Tomi Nukarinen, Roope Raisamo, Ahmed Farooq, Grigori Evreinov, and Veikko Surakka. 2014. Effects of directional haptic and non-speech audio cues in a cognitively demanding navigation task. In Proceedings of the 8th Nordic Conference on Human-Computer Interaction: Fun, Fast, Foundational. 61–64.
[29]
Rafael Radkowski, Jordan Herrema, and James Oliver. 2015. Augmented reality-based manual assembly support with visual features for different degrees of difficulty. International Journal of Human-Computer Interaction 31, 5 (2015), 337–349.
[30]
Miles Richardson, Gary Jones, and Mark Torrance. 2004. Identifying the task variables that influence perceived object assembly complexity. Ergonomics 47, 9 (2004), 945–964.
[31]
Anna Rouben and Loren Terveen. 2007. Speech and non-speech audio: Navigational information and cognitive load. In International Conference on Auditory Display. http://hdl.handle.net/1853/50039
[32]
Manaswi Saha, Wendy Ju, Mike Kuniavsky, and David Goedicke. 2023. Audio AR: An Introduction. https://medium.com/labs-notebook/audio-ar-an-introduction-698661405ff4. Accessed on 24 January 2024.
[33]
Jason Sterkenburg, Steven Landry, Shabnam FakhrHosseini, and Myounghoon Jeon. 2023. In-vehicle air gesture design: impacts of display modality and control orientation. Journal on Multimodal User Interfaces 17, 4 (2023), 215–230.
[34]
Anirudh Sundar and Larry Heck. 2022. Multimodal Conversational AI: A Survey of Datasets and Approaches. arXiv preprint arXiv:2205.06907 (2022).
[35]
Anna Syberfeldt, Oscar Danielsson, Magnus Holm, and Lihui Wang. 2015. Visual assembling guidance using augmented reality. Procedia Manufacturing 1 (2015), 98–109.
[36]
Keishi Tainaka, Yuichiro Fujimoto, Masayuki Kanbara, Hirokazu Kato, Atsunori Moteki, Kensuke Kuraki, Kazuki Osamura, Toshiyuki Yoshitake, and Toshiyuki Fukuoka. 2020. Guideline and tool for designing an assembly task support system using augmented reality. In 2020 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 486–497.
[37]
William Thompson. 2007. Situated Conversational Agents. In National Conference on Artificial Intelligence, Vol. 22. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 1908.
[38]
Barbara Tropschuh, Sina Niehues, and Gunther Reinhart. 2021. Measuring physical and mental strain during manual assembly tasks. Procedia CIRP 104 (2021), 968–974.
[39]
Yunyi Yang, Yunhao Li, and Xiaojun Quan. 2021. Ubar: Towards fully end-to-end task-oriented dialog system with gpt-2. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14230–14238.
[40]
Victor W Zue and James R Glass. 2000. Conversational interfaces: Advances and challenges. Proc. IEEE 88, 8 (2000), 1166–1180.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CUI '24: Proceedings of the 6th ACM Conference on Conversational User Interfaces
July 2024
616 pages
ISBN:9798400705113
DOI:10.1145/3640794
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 July 2024

Check for updates

Author Tags

  1. audio augmented reality
  2. conversational agents
  3. multimodal
  4. physical assembly
  5. task guidance

Qualifiers

  • Work in progress
  • Research
  • Refereed limited

Conference

CUI '24
Sponsor:
CUI '24: ACM Conversational User Interfaces 2024
July 8 - 10, 2024
Luxembourg, Luxembourg

Acceptance Rates

Overall Acceptance Rate 34 of 100 submissions, 34%

Upcoming Conference

CUI '25
ACM Conversational User Interfaces 2025
July 8 - 10, 2025
Waterloo , ON , Canada

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 146
    Total Downloads
  • Downloads (Last 12 months)146
  • Downloads (Last 6 weeks)41
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media