Abstract
When analyzing interactions during collaborative problem solving (CPS) tasks, many different communication modalities are likely to be present and interpretable. These modalities may include speech, gesture, action, affect, pose and object position in physical space, amongst others. As AI becomes more prominent in day-to-day use and various learning environments, such as classrooms, there is potential for it to support additional understanding into how small groups work together to complete CPS tasks. Designing interactive AI to support CPS requires creating a system that supports multiple different modalities. In this paper we discuss the importance of multimodal features to modeling CPS, how different modal channels must interact in a multimodal AI agent that supports a wide range of tasks, and design considerations that require forethought when building such a system that most effectively interacts with and aids small groups in successfully completing CPS tasks. We also outline various tool sets that can be leveraged to support each of the individual features and their integration, and various applications for such a system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Participants are conventionally indexed 1–3 from left to right in the video frame.
References
Andrews-Todd, J., Forsyth, C.M.: Exploring social and cognitive dimensions of collaborative problem solving in an open online simulation-based task. Comput. Hum. Behav. 104, 105759 (2020). https://doi.org/10.1016/j.chb.2018.10.025
Arnheim, R.: Hand and mind: what gestures reveal about thought by David McNeill. Leonardo 27(4), 358 (1994)
Banarescu, L., et al.: Abstract meaning representation for sembanking. In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 178–186 (2013)
Barron, B.: When smart groups fail. J. Learn. Sci. 12(3), 307–359 (2003)
Bradford, M., Khebour, I., Blanchard, N., Krishnaswamy, N.: Automatic detection of collaborative states in small groups using multimodal features. In: AIED (2023)
Brutti, R., Donatelli, L., Lai, K., Pustejovsky, J.: Abstract meaning representation for gesture, pp. 1576–1583, June 2022. https://aclanthology.org/2022.lrec-1.169
Castillon, I., Venkatesha, V., VanderHoeven, H., Bradford, M., Krishnaswamy, N., Blanchard, N.: Multimodal features for group dynamic-aware agents. In: Interdisciplinary Approaches to Getting AI Experts and Education Stakeholders Talking Workshop at AIEd. International AIEd Society (2022)
Chejara, P., Prieto, L.P., Rodriguez-Triana, M.J., Kasepalu, R., Ruiz-Calleja, A., Shankar, S.K.: How to build more generalizable models for collaboration quality? Lessons learned from exploring multi-context audio-log datasets using multimodal learning analytics. In: LAK2023, pp. 111–121. Association for Computing Machinery, New York, NY, USA, March 2023. https://doi.org/10.1145/3576050.3576144
Cunico, F., Carletti, M., Cristani, M., Masci, F., Conigliaro, D.: 6D pose estimation for industrial applications, pp. 374–384, September 2019. https://doi.org/10.1007/978-3-030-30754-7_37
Dey, I., et al.: The NICE framework: analyzing students’ nonverbal interactions during collaborative learning. In: Pre-Conference Workshop on Collaboration Analytics at LAK 2023. SOLAR (2023)
D’Mello, S., Graesser, A.: Dynamics of affective states during complex learning. Learn. Instr. 22(2), 145–157 (2012)
Eyben, F., et al.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2016). https://doi.org/10.1109/TAFFC.2015.2457417
Eyben, F., Wöllmer, M., Schuller, B.: OpenSMILE: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462. Association for Computing Machinery, New York, NY, USA, October 2010. https://doi.org/10.1145/1873951.1874246
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Graesser, A.C., Fiore, S.M., Greiff, S., Andrews-Todd, J., Foltz, P.W., Hesse, F.W.: Advancing the science of collaborative problem solving. Psychol. Sci. Pub. Interest 19(2), 59–92 (2018). https://doi.org/10.1177/1529100618808244
de Haas, M., Vogt, P., Krahmer, E.: When preschoolers interact with an educational robot, does robot feedback influence engagement? Multimodal Technol. Interact. 5(12), 77 (2021)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Hesse, F., Care, E., Buder, J., Sassenberg, K., Griffin, P.: A framework for teachable collaborative problem solving skills. In: Griffin, P., Care, E. (eds.) Assessment and Teaching of 21st Century Skills. EAIA, pp. 37–56. Springer, Dordrecht (2015). https://doi.org/10.1007/978-94-017-9395-7_2
Hu, Y., Fua, P., Wang, W., Salzmann, M.: Single-stage 6D object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
Kandoi, C., et al.: Intentional microgesture recognition for extended human-computer interaction. In: Kurosu, M., Hashizume, A. (eds.) HCII 2023. LNCS, vol. 14011, pp. 499–518. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-35596-7_32
Kendon, A.: Gesticulation and speech: two aspects of the process of utterance. In: The Relationship of Verbal and Nonverbal Communication, vol. 25, pp. 207–227 (1980)
Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press (2004)
Khebour, I., et al.: When text and speech are not enough: a multimodal dataset of collaboration in a situated task (2024)
Kita, S.: Pointing: a foundational building block of human communication. In: Pointing: Where Language, Culture, and Cognition Meet, pp. 1–8 (2003)
Kong, A.P.H., Law, S.P., Kwan, C.C.Y., Lai, C., Lam, V.: A coding system with independent annotations of gesture forms and functions during verbal communication: development of a database of speech and gesture (dosage). J. Nonverbal Behav. 39, 93–111 (2015)
Krishnaswamy, N., et al.: Diana’s world: a situated multimodal interactive agent. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13618–13619 (2020)
Krishnaswamy, N., et al.: Communicating and acting: understanding gesture in simulation semantics. In: IWCS 2017-12th International Conference on Computational Semantics-Short papers (2017)
Krishnaswamy, N., Pustejovsky, J.: Generating a novel dataset of multimodal referring expressions. In: Proceedings of the 13th International Conference on Computational Semantics-Short Papers, pp. 44–51 (2019)
Labbé, Y., Carpentier, J., Aubry, M., Sivic, J.: CosyPose: consistent multi-view multi-object 6D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_34
Lai, K., et al.: Modeling theory of mind in multimodal HCI. In: Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Springer (2024)
Lascarides, A., Stone, M.: A formal semantic analysis of gesture. J. Semant. 26(4), 393–449 (2009)
Li, J., Jin, K., Zhou, D., Kubota, N., Ju, Z.: Attention mechanism-based CNN for facial expression recognition. Neurocomputing 411, 340–350 (2020). https://doi.org/10.1016/j.neucom.2020.06.014
Li, S., Deng, W.: Deep facial expression recognition: a survey. IEEE Trans. Affect. Comput. 13(3), 1195–1215 (2022). https://doi.org/10.1109/TAFFC.2020.2981446
Mather, S.M.: Ethnographic research on the use of visually based regulators for teachers and interpreters. In: Attitudes, Innuendo, and Regulators, pp. 136–161 (2005)
McNeill, D.: Hand and mind. In: Advances in Visual Semiotics, vol. 351 (1992)
Narayana, P., Beveridge, R., Draper, B.A.: Gesture recognition: locus on the hands. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5235–5244 (2018)
Oertel, C., Salvi, G.: A gaze-based method for relating group involvement to individual engagement in multimodal multiparty dialogue. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction - ICMI 2013, pp. 99–106. ACM Press, Sydney, Australia (2013). https://doi.org/10.1145/2522848.2522865
Ogden, L.: Collaborative tasks, collaborative children: an analysis of reciprocity during peer interaction at key stage 1. Br. Edu. Res. J. 26(2), 211–226 (2000)
Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–106 (2005)
Pustejovsky, J., Krishnaswamy, N.: VoxML: a visualization modeling language. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 4606–4613. European Language Resources Association (ELRA), Portorož, Slovenia, May 2016. https://aclanthology.org/L16-1730
Pustejovsky, J., Krishnaswamy, N.: Embodied human computer interaction. KI-Künstliche Intelligenz 35(3–4), 307–327 (2021)
Pustejovsky, J., Krishnaswamy, N.: Multimodal semantics for affordances and actions. In: Kurosu, M. (ed.) HCII 2022. LNCS, vol. 13302, pp. 137–160. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-05311-5_9
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision (2022)
Rennie, C., Shome, R., Bekris, K.E., De Souza, A.F.: A dataset for improved RGBD-based object detection and pose estimation for warehouse pick-and-place. IEEE Rob. Autom. Lett. 1(2), 1179–1185 (2016)
Ruan, X., Palansuriya, C., Constantin, A.: Affective dynamic based technique for facial emotion recognition (FER) to support intelligent tutors in education. In: Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O.C., Dimitrova, V. (eds.) AIED, vol. 13916, pp. 774–779. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-36272-9_70
Sap, M., LeBras, R., Fried, D., Choi, Y.: Neural theory-of-mind? On the limits of social intelligence in large LMS. arXiv preprint arXiv:2210.13312 (2022)
Schneider, B., Pea, R.: Does seeing one another’s gaze affect group dialogue? A computational approach. J. Learn. Anal. 2(2), 107–133 (2015)
Stewart, A.E.B., Keirn, Z., D’Mello, S.K.: Multimodal modeling of collaborative problem-solving facets in triads. User Model. User-Adap. Inter. 31(4), 713–751 (2021). https://doi.org/10.1007/s11257-021-09290-y
Sun, C., Shute, V.J., Stewart, A., Yonehiro, J., Duran, N., D’Mello, S.: Towards a generalized competency model of collaborative problem solving. Comput. Educ. 143, 103672 (2020). https://www.sciencedirect.com/science/article/pii/S0360131519302258
Sun, C., et al.: The relationship between collaborative problem solving behaviors and solution outcomes in a game-based learning environment. Comput. Hum. Behav. 128, 107120 (2022)
Terpstra, C., Khebour, I., Bradford, M., Wisniewski, B., Krishnaswamy, N., Blanchard, N.: How good is automatic segmentation as a multimodal discourse annotation aid? (2023)
Tomasello, M., et al.: Joint attention as social cognition. In: Joint Attention: Its Origins and Role in Development, vol. 103130, pp. 103–130 (1995)
Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. In: Advances in Neural Information Processing Systems, vol. 35, pp. 10078–10093 (2022)
Törmänen, T., Järvenoja, H., Mänty, K.: Exploring groups’ affective states during collaborative learning: what triggers activating affect on a group level? Educ. Tech. Res. Dev. 69(5), 2523–2545 (2021)
Tyree, S., et al.: 6-DoF pose estimation of household objects for robotic manipulation: an accessible dataset and benchmark. In: IROS (2022)
Ullman, T.: Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399 (2023)
VanderHoeven, H., Blanchard, N., Krishnaswamy, N.: Robust motion recognition using gesture phase annotation. In: Duffy, V.G. (ed.) HCII 2023. LNCS, vol. 14028, pp. 592–608. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-35741-1_42
VanderHoeven, H., Blanchard, N., Krishnaswamy, N.: Point target detection for multimodal communication. In: Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Springer (2024)
Velikovich, L., Williams, I., Scheiner, J., Aleksic, P., Moreno, P., Riley, M.: Semantic lattice processing in contextual automatic speech recognition for google assistant, pp. 2222–2226 (2018). https://www.isca-speech.org/archive/Interspeech_2018/pdfs/2453.pdf
Wang, G., Manhardt, F., Tombari, F., Ji, X.: GDR-Net: geometry-guided direct regression network for monocular 6D object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16611–16621, June 2021
Wolf, K., Naumann, A., Rohs, M., Müller, J.: A taxonomy of microinteractions: defining microgestures based on ergonomic and scenario-dependent requirements. In: Campos, P., Graham, N., Jorge, J., Nunes, N., Palanque, P., Winckler, M. (eds.) INTERACT 2011. LNCS, vol. 6946, pp. 559–575. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23774-4_45
Zhang, F., et al.: MediaPipe hands: on-device real-time hand tracking. arXiv preprint arXiv:2006.10214 (2020)
Zoric, G., Smid, K., Pandzic, I.S.: Facial gestures: taxonomy and application of non-verbal, non-emotional facial displays for embodied conversational agents. In: Conversational Informatics: An Engineering Approach, pp. 161–182 (2007)
Acknowledgments
This work was partially supported by the National Science Foundation under subcontracts to Colorado State University and Brandeis University on award DRL 2019805. The views expressed are those of the authors and do not reflect the official policy or position of the U.S. Government. All errors and mistakes are, of course, the responsibilities of the authors.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
VanderHoeven, H. et al. (2024). Multimodal Design for Interactive Collaborative Problem-Solving Support. In: Mori, H., Asahi, Y. (eds) Human Interface and the Management of Information. HCII 2024. Lecture Notes in Computer Science, vol 14689. Springer, Cham. https://doi.org/10.1007/978-3-031-60107-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-60107-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-60106-4
Online ISBN: 978-3-031-60107-1
eBook Packages: Computer ScienceComputer Science (R0)