skip to main content
research-article

A Multi-sensor Framework for Personal Presentation Analytics

Published: 05 June 2019 Publication History

Abstract

Presentation has been an effective method for delivering information to an audience for many years. Over the past few decades, technological advancements have revolutionized the way humans deliver presentation. Conventionally, the quality of a presentation is usually evaluated through painstaking manual analysis with experts. Although the expert feedback is effective in assisting users to improve their presentation skills, manual evaluation suffers from high cost and is often not available to most individuals. In this work, we propose a novel multi-sensor self-quantification system for presentations, which is designed based on a new proposed assessment rubric. We present our analytics model with conventional ambient sensors (i.e., static cameras and Kinect sensor) and the emerging wearable egocentric sensors (i.e., Google Glass). In addition, we performed a cross-correlation analysis of speaker’s vocal behavior and body language. The proposed framework is evaluated on a new presentation dataset, namely, NUS Multi-Sensor Presentation dataset, which consists of 51 presentations covering a diverse range of topics. To validate the efficacy of the proposed system, we have conducted a series of user studies with the speakers and an interview with an English communication expert, which reveals positive and promising feedback.

References

[1]
Motasem Alrahabi and Jean-Pierre Desclés. 2008. Automatic annotation of direct reported speech in arabic and french, according to a semantic map of enunciative modalities. In Advances in Natural Language Processing. Springer, 40--51.
[2]
Vahid Aryadoust. 2016. Gender and academic major bias in peer assessment of oral presentations. Lang. Assess. Quart. 13, 1 (2016), 1--24.
[3]
Kartik Audhkhasi, Kundan Kandhway, Om Deshmukh, and Ashish Verma. 2009. Formant-based technique for automatic filled-pause detection in spontaneous spoken english. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. 4857--4860.
[4]
Trudy W. Banta. 2007. Assessing Student Achievement in General Education: Assessment Update Collections, Vol. 5. John Wiley 8 Sons.
[5]
Dean C. Barnlund. 1970. A transactional model of communication. In Language Behavior: A Book of Readings in Communication. 43--61.
[6]
Paolo Bernardis and Maurizio Gentilucci. 2006. Speech and gesture share the same communication system. Neuropsychologia 44, 2 (2006), 178--190.
[7]
Paul Boersma and David Weenink. 2002. PRAAT, a system for doing phonetics by computer. Glot Int. 5, 9/10 (2002), 341--345.
[8]
Anna Bosch, Andrew Zisserman, and Xavier Muñoz. 2007. Image classification using random forests and ferns. In Proceedings of the International Conference on Computer Vision. 1--8.
[9]
Susan M. Brookhart and Fei Chen. 2014. The quality and effectiveness of descriptive rubrics. Edu. Rev. 67, 3 (2014), 343--368.
[10]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291--7299.
[11]
Lei Chen, Gary Feng, Jilliam Joe, Chee Wee Leong, Christopher Kitchen, and Chong Min Lee. 2014. Towards automated assessment of public speaking skills using multimodal cues. In Proceedings of the International Conference on Multimodal Interaction. 200--203.
[12]
Lei Chen, Chee Wee Leong, Gary Feng, and Chong Min Lee. 2014. Using multimodal cues to analyze MLA’14 oral presentation quality corpus: Presentation delivery and slides quality. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 45--52.
[13]
Nivja H. de Jong and Ton Wempe. 2009. Praat script to detect syllable nuclei and measure speech rate automatically. Behav. Res. Methods 41, 2 (2009), 385--390.
[14]
Yu Du, Yongkang Wong, Yonghao Liu, Feilin Han, Yilin Gui, Zhen Wang, Mohan S. Kankanhalli, and Weidong Geng. 2016. Marker-less 3D human motion capture with monocular image sequence and height-maps. In Proceedings of the European Conference Computer Vision (Lecture Notes in Computer Science), Vol. 9908. 20--36.
[15]
Norah E. Dunbar, Catherine F. Brooks, and Tara Kubicka-Miller. 2006. Oral communication skills in higher education: Using a performance-based evaluation rubric to assess communication skills. Innovat. High. Edu. 31, 2 (2006), 115--128.
[16]
Vanessa Echeverría, Allan Avenda no, Katherine Chiluiza, Aníbal Vásquez, and Xavier Ochoa. 2014. Presentation skills estimation based on video and kinect data analysis. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 53--60.
[17]
K. Anders Ericsson, Ralf T. Krampe, and Clemens Tesch-Römer. 1993. The role of deliberate practice in the acquisition of expert performance. Psychol. Rev. 100, 3 (1993), 363--406.
[18]
Miikka Ermes, Juha Pärkkä, Jani Mäntyjärvi, and Ilkka Korhonen. 2008. Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions. IEEE Trans. Info. Technol. Biomed. 12, 1 (2008), 20--26.
[19]
Stephen B. Fawcett and L. Keith Miller. 1975. Training public-speaking behavior: An experimental analysis and social validation. J. Appl. Behav. Anal. 2 (1975), 125--135.
[20]
Tian Gan, Yongkang Wong, Bappaditya Mandal, Vijay Chandrasekhar, and Mohan S. Kankanhalli. 2015. Multi-sensor self-quantification of presentations. In Proceedings of ACM International Conference on Multimedia. 601--610.
[21]
Tian Gan, Yongkang Wong, Bappaditya Mandal, Vijay Chandrasekhar, Liyuan Li, Joo-Hwee Lim, and Mohan S. Kankanhalli. 2014. Recovering social interaction spatial structure from multiple first-person views. In Proceedings of International Workshop on Socially-Aware Multimedia. 7--12.
[22]
Tian Gan, Yongkang Wong, Daqing Zhang, and Mohan S. Kankanhalli. 2013. Temporal encoded F-formation system for social interaction detection. In Proceedings of ACM International Conference on Multimedia. 937--946.
[23]
Uri Hadar, Dafna Wenkert-Olenik, Robert Krauss, and Nachum Soroker. 1998. Gesture and the processing of speech: Neuropsychological evidence. Brain Lang. 62, 1 (1998), 107--126.
[24]
David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16, 12 (2004), 2639--2664.
[25]
Javier Hernandez, Yin Li, James M. Rehg, and Rosalind W. Picard. 2014. BioGlass: Physiological parameter estimation using a head-mounted wearable device. In Proceedings of the EAI International Conference on Wireless Mobile Communication and Healthcare. 55--58.
[26]
Rebecca Hincks. 2005. Measures and perceptions of liveliness in student oral presentation speech: A proposal for an automatic feedback mechanism. System 33, 4 (2005), 575--591.
[27]
Mohammed (Ehsan) Hoque, Matthieu Courgeon, Jean-Claude Martin, Bilge Mutlu, and Rosalind W. Picard. 2013. MACH: My automated conversation coach. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing. 697--706.
[28]
Wenping Hu, Yao Qian, and Frank K. Soong. 2013. A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL). In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’13). 1886--1890.
[29]
Spencer D. Kelly, Aslı Özyürek, and Eric Maris. 2009. Two sides of the same coin: Speech and gesture mutually interact to enhance comprehension. Psychol. Sci. 21 (2009), 260--267.
[30]
Edward S. Klima. 1979. The Signs of Language. Harvard University Press.
[31]
Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra M. Bhandarkar, Wojciech Matusik, and Antonio Torralba. 2016. Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2176--2184.
[32]
Robert M. Krauss, Robert A. Dushay, Yihsiu Chen, and Frances Rauscher. 1995. The communicative value of conversational hand gestures. J. Exp. Soc. Psychol. 31, 6 (1995), 533--552.
[33]
Kazutaka Kurihara, Masataka Goto, Jun Ogata, Yosuke Matsusaka, and Takeo Igarashi. 2007. Presentation sensei: A presentation training system using speech and image processing. In Proceedings of the International Conference on Multimodal Interfaces. 358--365.
[34]
Oscar D. Lara and Miguel A. Labrador. 2013. A survey on human activity recognition using wearable sensors. IEEE Communications Surveys Tutor. 15, 3 (2013), 1192--1209.
[35]
Junnan Li, Yongkang Wong, and Mohan S. Kankanhalli. 2016. Multi-stream deep learning framework for automated presentation assessment. In Proceedings of the IEEE International Symposium on Multimedia. 222--225.
[36]
Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2018. Unsupervised learning of view-invariant action representations. In Advances in Neural Information Processing Systems. MIT Press, 1260--1270.
[37]
Gonzalo Luzardo, Bruno Guamán, Katherine Chiluiza, Jaime Castells, and Xavier Ochoa. 2014. Estimation of presentations skills based on slides and audio features. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 37--44.
[38]
Warren Mansell, David M. Clark, Anke Ehlers, and Yi-Ping Chen. 1999. Social anxiety and attention away from emotional faces. Cogn. Emot. 13, 6 (1999), 673--690.
[39]
Sylvain Meignier and Teva Merlin. 2010. LIUM SpkDiarization: An open-source toolkit for diarization. In Proceedings of the CMU Sphinx Workshop for Users and Developers (CMUSPUD’10).
[40]
Alaeddine Mihoub and Grégoire Lefebvre. 2017. Social intelligence modeling using wearable devices. In Proceedings of the International Conference on Intelligent User Interfaces. 331--341.
[41]
Sherwyn P. Morreale and Phillip M. Backlund. 1996. Large-scale Assessment of Oral Communication: K-12 and Higher Education. National Communication Association.
[42]
Sherwyn P. Morreale, Michael R. Moore, K. Phillip Taylor, Donna Surges-Tatum, and Ruth Hulbert-Johnson. 1993. The Competent Speaker Speech Evaluation Form. National Communication Association.
[43]
Jörg Müller, Juliane Exeler, Markus Buzeck, and Antonio Krüger. 2009. ReflectiveSigns: Digital signs that adapt to audience attention. In Proceedings of the International Conference on Pervasive Computing. 17--24.
[44]
Kevin G. Munhall, Jeffery A. Jones, Daniel E. Callan, Takaaki Kuratate, and Eric Vatikiotis-Bateson. 2004. Visual prosody and speech intelligibility head movement improves auditory speech perception. Psychol. Sci. 15, 2 (2004), 133--137.
[45]
Sasha Nikolic, David Stirling, and Montserrat Ros. 2017. Formative assessment to develop oral communication competency using YouTube: Self- and peer assessment in engineering. Eur. J. Eng. Edu. 43, 4 (2017), 538--551.
[46]
Tomas Pfister and Peter Robinson. 2010. Speech emotion classification and public speaking skill assessment. In Proceedings of the Human Behavior Understanding Workshop. 151--162.
[47]
Tomas Pfister and Peter Robinson. 2011. Real-time recognition of affective states from nonverbal features of speech and its application for public speaking skill analysis. IEEE Trans. Affect. Comput. 2, 2 (2011), 66--78.
[48]
Richard L. Quianthy. 1990. Communication is Life: Essential College Sophomore Speaking and Listening Competencies. Speech Communication Association.
[49]
Don Michael Randel. 2003. The Harvard Dictionary of Music. Vol. 16. Harvard University Press.
[50]
Mehmet Emre Sargin, Yücel Yemez, Engin Erzin, and A. Murat Tekalp. 2007. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans. Multimedia 9, 7 (2007), 1396--1403.
[51]
Jan Schneider, Dirk Börner, Peter van Rosmalen, and Marcus Specht. 2015. Presentation trainer, your public speaking multimodal coach. In Proceedings of the ACM on International Conference on Multimodal Interaction. 539--546.
[52]
Jan Schneider, Dirk Börner, Peter van Rosmalen, and Marcus Specht. 2016. Can you help me with my pitch? Studying a tool for real-time automated feedback. IEEE Trans. Learn. Technol. 9, 4 (2016), 318--327.
[53]
Lisa M. Schreiber, Gregory D. Paul, and Lisa R. Shibley. 2012. The development and test of the public speaking competence rubric. Commun. Edu. 61, 3 (2012), 205--233.
[54]
Aaron W. Siegman and Stanley Feldstein. 2014. Nonverbal Behavior and Communication. Psychology Press.
[55]
Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1145--1153.
[56]
Mel Slater, David-Paul Pertaub, Chris Barker, and David M. Clark. 2006. An experimental study on fear of public speaking using a virtual environment. Cyberpsychol. Behav. Soc. Netw. 9, 5 (2006), 627--633.
[57]
Joan Josep Suñol, Gerard Arbat, Joan Pujol, Lidia Feliu, Rosa Maria Fraguell, and Anna Planas-Lladó. 2016. Peer and self-assessment applied to oral presentations from a multidisciplinary perspective. Assess. Eval. High. Edu. 41, 4 (2016), 622--637.
[58]
Stephen M. Tasko and Kristin Greilick. 2010. Acoustic and articulatory features of diphthong production: A speech clarity study. J. Speech, Lang. Hear. Res. 53, 1 (2010), 84--99.
[59]
Joseph Tepperman and Shrikanth Narayanan. 2005. Automatic syllable stress detection using prosodic features for pronunciation evaluation of language learners. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05). 937--940.
[60]
Stephanie Thomson and Mary L. Rucker. 2002. The development of a specialized public speaking competency scale: Test of reliability. Commun. Res. Reports 19, 1 (2002), 18--28.
[61]
Anh tuan Nguyen, Wei Chen, and Matthias Rauterberg. 2012. Online feedback system for public speakers. In Proceedings of the IEEE Symposium on E-Learning, E-Management and E-Services.
[62]
Stan van Ginkel, Judith Gulikers, Harm Biemans, and Martin Mulder. 2017. Fostering oral presentation performance: Does the quality of feedback differ when provided by the teacher, peers or peers guided by tutor? Assess. Eval. Higher Edu. 42, 6 (2017), 953--966.
[63]
Stan van Ginkel, Judith Gulikers, Harm Biemans, and Martin Mulder. 2017. The impact of the feedback source on developing oral presentation competence. Studies Higher Edu. 42, 9 (2017), 1671--1685.
[64]
Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. 2009. Social signal processing: Survey of an emerging domain. Image Vision Comput. 27, 12 (2009), 1743--1759.
[65]
Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview. Speech Commun. 57 (2014), 209--232.
[66]
Jane Webster and Hayes Ho. 1997. Audience engagement in multimedia presentations. DATA BASE 28, 2 (1997), 63--77.
[67]
Xiao-Yong Wei and Zhen-Qun Yang. 2012. Mining in-class social networks for large-scale pedagogical analysis. In Proceedings of the ACM International Conference on Multimedia. 639--648.
[68]
Xiu-Shen Wei, Jianxin Wu, and Zhi-Hua Zhou. 2014. Scalable multi-instance learning. In Proceedings of the IEEE International Conference on Data Mining. 1037--1042.
[69]
Felix Weninger, Jarek Krajewski, Anton Batliner, and Björn W. Schuller. 2012. The voice of leadership: Models and performances of automatic analysis in online speeches. IEEE Trans. Affect. Comput. 3, 4 (2012), 496--508.
[70]
Yi Wu, Edward Y. Chang, Kevin Chen-Chuan Chang, and John R. Smith. 2004. Optimal multimodal fusion for multimedia data analysis. In Proceedings of ACM International Conference on Multimedia. 572--579.
[71]
Toshihiko Yamasaki, Yusuke Fukushima, Ryosuke Furuta, Litian Sun, Kiyoharu Aizawa, and Danushka Bollegala. 2015. Prediction of user ratings of oral presentations using label relations. In Proceedings of the International Workshop on Affect 8 Sentiment in Multimedia. 33--38.
[72]
Zhihong Zeng, Maja Pantic, Glenn I. Roisman, and Thomas S. Huang. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1 (2009), 39--58.
[73]
Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2015. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4511--4520.
[74]
Jing Zheng, Chao Huang, Min Chu, Frank K. Soong, and Weiping Ye. 2007. Generalized segment posterior probability for automatic mandarin pronunciation evaluation. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’07). 201--204.

Cited By

View all
  • (2024)EcoScript: A Real-Time Presentation Supporting Tool using a Speech Recognition Model2024 IEEE International Conference on Information Reuse and Integration for Data Science (IRI)10.1109/IRI62200.2024.00031(96-101)Online publication date: 7-Aug-2024
  • (2024)Assessing learners’ English public speaking anxiety with multimodal deep learning technologiesComputer Assisted Language Learning10.1080/09588221.2024.2351129(1-29)Online publication date: 11-May-2024
  • (2023)RTQ: Rethinking Video-language Understanding Based on Image-text ModelProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612152(557-566)Online publication date: 26-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 2
May 2019
375 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3339884
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2019
Accepted: 01 December 2018
Revised: 01 May 2018
Received: 01 October 2017
Published in TOMM Volume 15, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Quantified self
  2. learning analytics
  3. multi-modal analysis
  4. presentations

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)3
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)EcoScript: A Real-Time Presentation Supporting Tool using a Speech Recognition Model2024 IEEE International Conference on Information Reuse and Integration for Data Science (IRI)10.1109/IRI62200.2024.00031(96-101)Online publication date: 7-Aug-2024
  • (2024)Assessing learners’ English public speaking anxiety with multimodal deep learning technologiesComputer Assisted Language Learning10.1080/09588221.2024.2351129(1-29)Online publication date: 11-May-2024
  • (2023)RTQ: Rethinking Video-language Understanding Based on Image-text ModelProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612152(557-566)Online publication date: 26-Oct-2023
  • (2023)Temporal Sentence Grounding in Streaming VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612120(4637-4646)Online publication date: 26-Oct-2023
  • (2023)Detection Of Public Speaking Anxiety: A New Dataset And Algorithm2023 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME55011.2023.00448(2633-2638)Online publication date: Jul-2023
  • (2021)A Deep Learning-Based Framework for Human Activity Recognition in Smart HomesMobile Information Systems10.1155/2021/69613432021Online publication date: 1-Jan-2021
  • (2021)Learning Causal Representation for Training Cross-Domain Pose Estimator via Generative Interventions2021 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV48922.2021.01108(11250-11260)Online publication date: Oct-2021
  • (2020)Multimodal Data Fusion in Learning Analytics: A Systematic ReviewSensors10.3390/s2023685620:23(6856)Online publication date: 30-Nov-2020

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media