skip to main content
10.1145/2971648.2971743acmconferencesArticle/Chapter ViewAbstractPublication PagesubicompConference Proceedingsconference-collections
research-article

ROC comment: automated descriptive and subjective captioning of behavioral videos

Published:12 September 2016Publication History

ABSTRACT

We present an automated interface, ROC Comment, for generating natural language comments on behavioral videos. We focus on the domain of public speaking, which many people consider their greatest fear. We collect a dataset of 196 public speaking videos from 49 individuals and gather 12,173 comments, generated by more than 500 independent human judges. We then train a k-Nearest-Neighbor (k-NN) based model by extracting prosodic (e.g., volume) and facial (e.g., smiles) features. Given a new video, we extract features and select the closest comments using k-NN model. We further filter the comments by clustering them using DBScan, and eliminating the outliers. Evaluation of our system with 30 participants conclude that while the generated comments are helpful, there is room for improvement in further personalizing them. Our model has been deployed online, allowing individuals to upload their videos and receive open-ended and interpretative comments. Our system is available at http://tinyurl.com/roccomment.

References

  1. Nazia Ali and Ruchi Nagar. 2013. To study the effectiveness of occupational therapy intervention in the management of fear of public speaking in school going children aged between 12-17 years Methodology: 45, 3: 21--25.Google ScholarGoogle Scholar
  2. E Boath, a Stewart, and a Carryer. 2012. Tapping for PEAS : Emotional Freedom Technique (EFT) in reducing Presentation Expression Anxiety Syndrome (PEAS) in University students. Innovative Practice in Higher Education 1, April: 1--12.Google ScholarGoogle Scholar
  3. Paul Boersma and David Weenink. Praat: doing phonetics by computer. Retrieved from http://www.fon.hum.uva.nl/praat/Google ScholarGoogle Scholar
  4. Yejin Choi, Tamara L Berg, U N C Chapel Hill, Chapel Hill, and Stony Brook. 2014. TREE TALK : Composition and Compression of Trees for Image Descriptions. 2: 351--362.Google ScholarGoogle Scholar
  5. Purvinis Dalia and Susnienė Rūta. 2010. Insights on Problems of Public Speaking and Ways of Overcoming It. Nation & Language: Modern Aspects of Socio-Linguistic Developmen;2010, p106.Google ScholarGoogle Scholar
  6. Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, and C Lawrence Zitnick. 2015. Exploring Nearest Neighbor Approaches for Image Captioning. arXiv preprint arXiv:1505.04467.Google ScholarGoogle Scholar
  7. Martin Ester, Hans P Kriegel, Jorg Sander, and Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Second International Conference on Knowledge Discovery and Data Mining: 226--231. http://doi.org/10.1.1.71.1980 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, et al. 2010. Every picture tells a story: Generating sentences from images. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6314 LNCS, PART 4: 15--29. http://doi.org/10.1007/978-3-642-15561-1_2Google ScholarGoogle Scholar
  9. Michelle Fung, Yina Jin, Ru Zhao, and Mohammed Ehsan Hoque. 2015. ROC Speak: Semi-Automated Personalized Feedback on Nonverbal Behavior from Recorded Videos. Proceedings of 17th International Conference on Ubiquitous Computing (Ubicomp). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kishore Papineni, Salim Roukos, Todd Ward, and Wj Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Computational Linguistics (ACL), July: 311--318. http://doi.org/10.3115/1073083.1073135 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Polly Anne Rice. Emotional Freedom Techniques (EFT): Tap Into Empowerment. Retrieved from http://happyrealhealth.com/emotional-freedom-techniques-eft/Google ScholarGoogle Scholar
  12. Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. Proceedings of the IEEE International Conference on Computer Vision, December: 433--440. http://doi.org/10.1109/ICCV.2013.61 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Bahador Saket, Sijie Yang, Hong Tan, Koji Yatani, and Darren Edge. 2014. TalkZones: Section-based Time Support for Presentations. Proceedings of the 16th international conference on Human-computer interaction with mobile devices & services (MobileHCI '14): 263--272. http://doi.org/10.1145/2628363.2628399 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M Iftekhar Tanveer, Emy Lin, and Mohammed Ehsan Hoque. 2015. Rhema: A Real-Time In-Situ Intelligent Interface to Help People with Public Speaking. IUI 2015: Proceedings of the 20th International Conference on Intelligent User Interfaces, 286--295. http://doi.org/10.1145/2678025.2701386 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ha Trinh, Koji Yatani, and Darren Edge. 2014. PitchPerfect. Proceedings of the 32nd annual ACM conference on Human factors in computing systems - CHI '14: 1571--1580. http://doi.org/10.1145/2556288.2557286 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv: 1412.4729.Google ScholarGoogle Scholar
  17. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2014. Show and Tell: A Neural Image Caption Generator. Retrieved from http://arxiv.org/abs/1411.4555Google ScholarGoogle Scholar
  18. R Xu, C Xiong, W Chen, and Jj Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. Proceedings of AAAI. Retrieved from http://www.acsu.buffalo.edu/~rxu2/xu_corso_AAAI2015_v2t.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. SHORETM - Object and Face Recognition. Retrieved from http://www.iis.fraunhofer.de/en/ff/bsy/tech/bildanalyse/shore-gesichtsdetektion.htmlGoogle ScholarGoogle Scholar

Index Terms

  1. ROC comment: automated descriptive and subjective captioning of behavioral videos

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      UbiComp '16: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing
      September 2016
      1288 pages
      ISBN:9781450344616
      DOI:10.1145/2971648

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 September 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      UbiComp '16 Paper Acceptance Rate101of389submissions,26%Overall Acceptance Rate764of2,912submissions,26%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader