skip to main content
10.1145/3411764.3445131acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article

RubySlippers: Supporting Content-based Voice Navigation for How-to Videos

Authors Info & Claims
Published:07 May 2021Publication History

ABSTRACT

Directly manipulating the timeline, such as scrubbing for thumbnails, is the standard way of controlling how-to videos. However, when how-to videos involve physical activities, people inconveniently alternate between controlling the video and performing the tasks. Adopting a voice user interface allows people to control the video with voice while performing the tasks with hands. However, naively translating timeline manipulation into voice user interfaces (VUI) results in temporal referencing (e.g. “rewind 20 seconds”), which requires a different mental model for navigation and thereby limiting users’ ability to peek into the content. We present RubySlippers, a system that supports efficient content-based voice navigation through keyword-based queries. Our computational pipeline automatically detects referenceable elements in the video, and finds the video segmentation that minimizes the number of needed navigational commands. Our evaluation (N=12) shows that participants could perform three representative navigation tasks with fewer commands and less frustration using RubySlippers than the conventional voice-enabled video interface.

Skip Supplemental Material Section

Supplemental Material

3411764.3445131_videofigure.mp4

mp4

191.1 MB

References

  1. Abir Al-Hajri, Gregor Miller, Matthew Fong, and Sidney S Fels. 2014. Visualization of personal history for video navigation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1187–1196.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Roxette Arisa. 2020. MAKEUP FOR PASSPORT PHOTOS/ID PICTURES *no flashback, smooth skin* | Roxette Arisa. https://www.youtube.com/watch?v=9qoDdXFwBdoGoogle ScholarGoogle Scholar
  3. Zahra Ashktorab, Mohit Jain, Q Vera Liao, and Justin D Weisz. 2019. Resilient chatbots: repair strategy preferences for conversational breakdowns. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Morteza Behrooz, Sarah Mennicken, Jennifer Thom, Rohit Kumar, and Henriette Cramer. 2019. Augmenting Music Listening Experiences on Voice Assistants.. In ISMIR. 303–310.Google ScholarGoogle Scholar
  5. Minsuk Chang, Anh Truong, Oliver Wang, Maneesh Agrawala, and Juho Kim. 2019. How to design voice based navigation for how-to videos. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ming-Ming Cheng, Shuai Zheng, Wen-Yan Lin, Vibhav Vineet, Paul Sturgess, Nigel Crook, Niloy J Mitra, and Philip Torr. 2014. ImageSpirit: Verbal guided image parsing. ACM Transactions on Graphics (TOG) 34, 1 (2014), 1–11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Eric Corbett and Astrid Weber. 2016. What can I say? Addressing user experience challenges of a mobile voice user interface for accessibility. In Proceedings of the 18th international conference on human-computer interaction with mobile devices and services. 72–82.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chris Crockford and Harry Agius. 2006. An Empirical Investigation into User Navigation of Digital Video Using the VCR-like Control Set. (2006), 340–355.Google ScholarGoogle Scholar
  9. Wei Ding and Gary Marchionini. 1998. A study on video browsing strategies. Technical Report.Google ScholarGoogle Scholar
  10. Pierre Dragicevic, Gonzalo Ramos, Jacobo Bibliowitcz, Derek Nowrouzezahrai, Ravin Balakrishnan, and Karan Singh. 2008. Video browsing by direct manipulation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 237–246.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Adam Fourney, Richard Mann, and Michael Terry. 2011. Characterizing the usability of interactive applications through query log analysis. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1817–1826.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. HM Government. [n.d.]. Working safely during COVID-19 in labs and research facilities. https://assets.publishing.service.gov.uk/media/5eb9752086650c2799a57ac5/working-safely-during-covid-19-labs-research-facilities-200910.pdf.Google ScholarGoogle Scholar
  13. Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Advances in psychology. Vol. 52. Elsevier, 139–183.Google ScholarGoogle Scholar
  14. Jenn Im. 2018. Everyday Drugstore Makeup Tutorial. https://www.youtube.com/watch?v=09HfTthoGEwGoogle ScholarGoogle Scholar
  15. Juho Kim. 2013. Toolscape: enhancing the learning experience of how-to videos. In CHI’13 Extended Abstracts on Human Factors in Computing Systems. 2707–2712.Google ScholarGoogle Scholar
  16. Juho Kim, Philip J Guo, Carrie J Cai, Shang-Wen Li, Krzysztof Z Gajos, and Robert C Miller. 2014. Data-driven interaction techniques for improving navigation of educational videos. In Proceedings of the 27th annual ACM symposium on User interface software and technology. 563–572.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yea-Seul Kim, Mira Dontcheva, Eytan Adar, and Jessica Hullman. 2019. Vocal shortcuts for creative experts. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Preppy Kitchen. 2019. Amazing Caramel Cake Recipe. https://www.youtube.com/watch?v=CHbrXX23ctoGoogle ScholarGoogle Scholar
  19. Preppy Kitchen. 2020. Amazing Hot Cross Buns Recipe. https://www.youtube.com/watch?v=XCf2zZ-_SwoGoogle ScholarGoogle Scholar
  20. Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From Word Embeddings to Document Distances. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML’15). JMLR.org, 957–966.Google ScholarGoogle Scholar
  21. Benjamin Lafreniere, Tovi Grossman, and George Fitzmaurice. 2013. Community enhanced tutorials: improving tutorials with multiple demonstrations. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1779–1788.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gierad P Laput, Mira Dontcheva, Gregg Wilensky, Walter Chang, Aseem Agarwala, Jason Linder, and Eytan Adar. 2013. Pixeltone: A multimodal interface for image editing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2185–2194.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2012. Swift: Reducing the Effects of Latency in Online Video Scrubbing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Austin, Texas, USA) (CHI ’12). Association for Computing Machinery, New York, NY, USA, 637–646. https://doi.org/10.1145/2207676.2207766Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2013. Patina: Dynamic heatmaps for visualizing application usage. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 3227–3236.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2013. Swifter: Improved Online Video Scrubbing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Paris, France) (CHI ’13). Association for Computing Machinery, New York, NY, USA, 1159–1168. https://doi.org/10.1145/2470654.2466149Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sarah McRoberts, Joshua Wissbroecker, Ruotong Wang, and F Maxwell Harper. 2019. Exploring Interactions with Voice-Controlled TV. (2019). arXiv:1905.05851Google ScholarGoogle Scholar
  27. Christine Murad, Cosmin Munteanu, Leigh Clark, and Benjamin R Cowan. 2018. Design guidelines for hands-free speech interaction. In Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct. 269–276.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Chelsea Myers, Anushay Furqan, Jessica Nebolsky, Karina Caro, and Jichen Zhu. 2018. Patterns for how users overcome obstacles in voice user interfaces. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Chelsea M Myers. 2019. Adaptive suggestions to increase learnability for voice user interfaces. In Proceedings of the 24th International Conference on Intelligent User Interfaces: Companion. 159–160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Cuong Nguyen and Feng Liu. 2015. Making Software Tutorial Video Responsive. In CHI.Google ScholarGoogle Scholar
  31. Jakob Nielsen. [n.d.]. 10 Usability Heuristics for User Interface Design.[Online] 1995.Google ScholarGoogle Scholar
  32. Amy Pavel, Dan B Goldman, Björn Hartmann, and Maneesh Agrawala. 2015. Sceneskim: Searching and browsing movies using synchronized captions, scripts and plot summaries. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. 181–190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Amy Pavel, Colorado Reed, Björn Hartmann, and Maneesh Agrawala. 2014. Video Digests: A Browsable, Skimmable Format for Informational Lecture Videos. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology(UIST ’14). Association for Computing Machinery, New York, NY, USA, 573–582. https://doi.org/10.1145/2642918.2647400Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Suporn Pongnumkul, Mira Dontcheva, Wilmot Li, Jue Wang, Lubomir Bourdev, Shai Avidan, and Michael F. Cohen. 2011. Pause-and-play: Automatically Linking Screencast Video Tutorials with Applications. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology(Santa Barbara, California, USA) (UIST ’11). ACM, New York, NY, USA, 135–144. https://doi.org/10.1145/2047196.2047213Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Marguerite Roy and Michelene TH Chi. 2005. The self-explanation principle in multimedia learning. The Cambridge handbook of multimedia learning (2005), 271–286.Google ScholarGoogle ScholarCross RefCross Ref
  36. Yale Song, Miriam Redi, Jordi Vallmitjana, and Alejandro Jaimes. 2016. To click or not to click: Automatic selection of beautiful thumbnails from videos. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 659–668.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Arjun Srinivasan, Mira Dontcheva, Eytan Adar, and Seth Walker. 2019. Discovering natural language commands in multimodal interfaces. In Proceedings of the 24th International Conference on Intelligent User Interfaces. 661–672.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Daniel Steinbock. [n.d.]. TagCrowd. https://tagcrowd.com.Google ScholarGoogle Scholar
  39. Ottokar Tilk and Tanel Alumäe. 2016. Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration. In Interspeech 2016.Google ScholarGoogle ScholarCross RefCross Ref
  40. Xu Wang, Benjamin J. Lafreniere, and Tovi Grossman. 2018. Leveraging Community-Generated Videos and Command Logs to Classify and Recommend Software Workflows. In CHI.Google ScholarGoogle Scholar
  41. Kuldeep Yadav, Kundan Shrivastava, S Mohana Prasad, Harish Arsikere, Sonal Patil, Ranjeet Kumar, and Om Deshmukh. 2015. Content-driven multi-modal techniques for non-linear video navigation. In Proceedings of the 20th International Conference on Intelligent User Interfaces. 333–344.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Matin Yarmand, Dongwook Yoon, Samuel Dodson, Ido Roll, and Sidney S Fels. 2019. ” Can you believe [1: 21]?!” Content and Time-Based Reference Patterns in Video Comments. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jeffrey Zacks, Barbara Tversky, and Gowri Iyer. 2001. Perceiving, remembering, and communicating structure in events. Journal of experimental psychology. General 130 (04 2001), 29–58. https://doi.org/10.1037//0096-3445.130.1.29Google ScholarGoogle Scholar
  44. Han Zhang, Maosong Sun, Xiaochen Wang, Zhengyang Song, Jie Tang, and Jimeng Sun. 2017. Smart jump: Automated navigation suggestion for videos in moocs. In Proceedings of the 26th international conference on world wide web companion. 331–339.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. RubySlippers: Supporting Content-based Voice Navigation for How-to Videos
              Index terms have been assigned to the content through auto-classification.

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems
                May 2021
                10862 pages
                ISBN:9781450380966
                DOI:10.1145/3411764

                Copyright © 2021 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 7 May 2021

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed limited

                Acceptance Rates

                Overall Acceptance Rate6,199of26,314submissions,24%

                Upcoming Conference

                CHI '24
                CHI Conference on Human Factors in Computing Systems
                May 11 - 16, 2024
                Honolulu , HI , USA

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              HTML Format

              View this article in HTML Format .

              View HTML Format