ABSTRACT
Directly manipulating the timeline, such as scrubbing for thumbnails, is the standard way of controlling how-to videos. However, when how-to videos involve physical activities, people inconveniently alternate between controlling the video and performing the tasks. Adopting a voice user interface allows people to control the video with voice while performing the tasks with hands. However, naively translating timeline manipulation into voice user interfaces (VUI) results in temporal referencing (e.g. “rewind 20 seconds”), which requires a different mental model for navigation and thereby limiting users’ ability to peek into the content. We present RubySlippers, a system that supports efficient content-based voice navigation through keyword-based queries. Our computational pipeline automatically detects referenceable elements in the video, and finds the video segmentation that minimizes the number of needed navigational commands. Our evaluation (N=12) shows that participants could perform three representative navigation tasks with fewer commands and less frustration using RubySlippers than the conventional voice-enabled video interface.
Supplemental Material
- Abir Al-Hajri, Gregor Miller, Matthew Fong, and Sidney S Fels. 2014. Visualization of personal history for video navigation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1187–1196.Google ScholarDigital Library
- Roxette Arisa. 2020. MAKEUP FOR PASSPORT PHOTOS/ID PICTURES *no flashback, smooth skin* | Roxette Arisa. https://www.youtube.com/watch?v=9qoDdXFwBdoGoogle Scholar
- Zahra Ashktorab, Mohit Jain, Q Vera Liao, and Justin D Weisz. 2019. Resilient chatbots: repair strategy preferences for conversational breakdowns. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarDigital Library
- Morteza Behrooz, Sarah Mennicken, Jennifer Thom, Rohit Kumar, and Henriette Cramer. 2019. Augmenting Music Listening Experiences on Voice Assistants.. In ISMIR. 303–310.Google Scholar
- Minsuk Chang, Anh Truong, Oliver Wang, Maneesh Agrawala, and Juho Kim. 2019. How to design voice based navigation for how-to videos. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11.Google ScholarDigital Library
- Ming-Ming Cheng, Shuai Zheng, Wen-Yan Lin, Vibhav Vineet, Paul Sturgess, Nigel Crook, Niloy J Mitra, and Philip Torr. 2014. ImageSpirit: Verbal guided image parsing. ACM Transactions on Graphics (TOG) 34, 1 (2014), 1–11.Google ScholarDigital Library
- Eric Corbett and Astrid Weber. 2016. What can I say? Addressing user experience challenges of a mobile voice user interface for accessibility. In Proceedings of the 18th international conference on human-computer interaction with mobile devices and services. 72–82.Google ScholarDigital Library
- Chris Crockford and Harry Agius. 2006. An Empirical Investigation into User Navigation of Digital Video Using the VCR-like Control Set. (2006), 340–355.Google Scholar
- Wei Ding and Gary Marchionini. 1998. A study on video browsing strategies. Technical Report.Google Scholar
- Pierre Dragicevic, Gonzalo Ramos, Jacobo Bibliowitcz, Derek Nowrouzezahrai, Ravin Balakrishnan, and Karan Singh. 2008. Video browsing by direct manipulation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 237–246.Google ScholarDigital Library
- Adam Fourney, Richard Mann, and Michael Terry. 2011. Characterizing the usability of interactive applications through query log analysis. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1817–1826.Google ScholarDigital Library
- HM Government. [n.d.]. Working safely during COVID-19 in labs and research facilities. https://assets.publishing.service.gov.uk/media/5eb9752086650c2799a57ac5/working-safely-during-covid-19-labs-research-facilities-200910.pdf.Google Scholar
- Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Advances in psychology. Vol. 52. Elsevier, 139–183.Google Scholar
- Jenn Im. 2018. Everyday Drugstore Makeup Tutorial. https://www.youtube.com/watch?v=09HfTthoGEwGoogle Scholar
- Juho Kim. 2013. Toolscape: enhancing the learning experience of how-to videos. In CHI’13 Extended Abstracts on Human Factors in Computing Systems. 2707–2712.Google Scholar
- Juho Kim, Philip J Guo, Carrie J Cai, Shang-Wen Li, Krzysztof Z Gajos, and Robert C Miller. 2014. Data-driven interaction techniques for improving navigation of educational videos. In Proceedings of the 27th annual ACM symposium on User interface software and technology. 563–572.Google ScholarDigital Library
- Yea-Seul Kim, Mira Dontcheva, Eytan Adar, and Jessica Hullman. 2019. Vocal shortcuts for creative experts. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.Google ScholarDigital Library
- Preppy Kitchen. 2019. Amazing Caramel Cake Recipe. https://www.youtube.com/watch?v=CHbrXX23ctoGoogle Scholar
- Preppy Kitchen. 2020. Amazing Hot Cross Buns Recipe. https://www.youtube.com/watch?v=XCf2zZ-_SwoGoogle Scholar
- Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From Word Embeddings to Document Distances. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML’15). JMLR.org, 957–966.Google Scholar
- Benjamin Lafreniere, Tovi Grossman, and George Fitzmaurice. 2013. Community enhanced tutorials: improving tutorials with multiple demonstrations. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1779–1788.Google ScholarDigital Library
- Gierad P Laput, Mira Dontcheva, Gregg Wilensky, Walter Chang, Aseem Agarwala, Jason Linder, and Eytan Adar. 2013. Pixeltone: A multimodal interface for image editing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2185–2194.Google ScholarDigital Library
- Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2012. Swift: Reducing the Effects of Latency in Online Video Scrubbing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Austin, Texas, USA) (CHI ’12). Association for Computing Machinery, New York, NY, USA, 637–646. https://doi.org/10.1145/2207676.2207766Google ScholarDigital Library
- Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2013. Patina: Dynamic heatmaps for visualizing application usage. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 3227–3236.Google ScholarDigital Library
- Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2013. Swifter: Improved Online Video Scrubbing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Paris, France) (CHI ’13). Association for Computing Machinery, New York, NY, USA, 1159–1168. https://doi.org/10.1145/2470654.2466149Google ScholarDigital Library
- Sarah McRoberts, Joshua Wissbroecker, Ruotong Wang, and F Maxwell Harper. 2019. Exploring Interactions with Voice-Controlled TV. (2019). arXiv:1905.05851Google Scholar
- Christine Murad, Cosmin Munteanu, Leigh Clark, and Benjamin R Cowan. 2018. Design guidelines for hands-free speech interaction. In Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct. 269–276.Google ScholarDigital Library
- Chelsea Myers, Anushay Furqan, Jessica Nebolsky, Karina Caro, and Jichen Zhu. 2018. Patterns for how users overcome obstacles in voice user interfaces. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–7.Google ScholarDigital Library
- Chelsea M Myers. 2019. Adaptive suggestions to increase learnability for voice user interfaces. In Proceedings of the 24th International Conference on Intelligent User Interfaces: Companion. 159–160.Google ScholarDigital Library
- Cuong Nguyen and Feng Liu. 2015. Making Software Tutorial Video Responsive. In CHI.Google Scholar
- Jakob Nielsen. [n.d.]. 10 Usability Heuristics for User Interface Design.[Online] 1995.Google Scholar
- Amy Pavel, Dan B Goldman, Björn Hartmann, and Maneesh Agrawala. 2015. Sceneskim: Searching and browsing movies using synchronized captions, scripts and plot summaries. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. 181–190.Google ScholarDigital Library
- Amy Pavel, Colorado Reed, Björn Hartmann, and Maneesh Agrawala. 2014. Video Digests: A Browsable, Skimmable Format for Informational Lecture Videos. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology(UIST ’14). Association for Computing Machinery, New York, NY, USA, 573–582. https://doi.org/10.1145/2642918.2647400Google ScholarDigital Library
- Suporn Pongnumkul, Mira Dontcheva, Wilmot Li, Jue Wang, Lubomir Bourdev, Shai Avidan, and Michael F. Cohen. 2011. Pause-and-play: Automatically Linking Screencast Video Tutorials with Applications. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology(Santa Barbara, California, USA) (UIST ’11). ACM, New York, NY, USA, 135–144. https://doi.org/10.1145/2047196.2047213Google ScholarDigital Library
- Marguerite Roy and Michelene TH Chi. 2005. The self-explanation principle in multimedia learning. The Cambridge handbook of multimedia learning (2005), 271–286.Google ScholarCross Ref
- Yale Song, Miriam Redi, Jordi Vallmitjana, and Alejandro Jaimes. 2016. To click or not to click: Automatic selection of beautiful thumbnails from videos. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 659–668.Google ScholarDigital Library
- Arjun Srinivasan, Mira Dontcheva, Eytan Adar, and Seth Walker. 2019. Discovering natural language commands in multimodal interfaces. In Proceedings of the 24th International Conference on Intelligent User Interfaces. 661–672.Google ScholarDigital Library
- Daniel Steinbock. [n.d.]. TagCrowd. https://tagcrowd.com.Google Scholar
- Ottokar Tilk and Tanel Alumäe. 2016. Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration. In Interspeech 2016.Google ScholarCross Ref
- Xu Wang, Benjamin J. Lafreniere, and Tovi Grossman. 2018. Leveraging Community-Generated Videos and Command Logs to Classify and Recommend Software Workflows. In CHI.Google Scholar
- Kuldeep Yadav, Kundan Shrivastava, S Mohana Prasad, Harish Arsikere, Sonal Patil, Ranjeet Kumar, and Om Deshmukh. 2015. Content-driven multi-modal techniques for non-linear video navigation. In Proceedings of the 20th International Conference on Intelligent User Interfaces. 333–344.Google ScholarDigital Library
- Matin Yarmand, Dongwook Yoon, Samuel Dodson, Ido Roll, and Sidney S Fels. 2019. ” Can you believe [1: 21]?!” Content and Time-Based Reference Patterns in Video Comments. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarDigital Library
- Jeffrey Zacks, Barbara Tversky, and Gowri Iyer. 2001. Perceiving, remembering, and communicating structure in events. Journal of experimental psychology. General 130 (04 2001), 29–58. https://doi.org/10.1037//0096-3445.130.1.29Google Scholar
- Han Zhang, Maosong Sun, Xiaochen Wang, Zhengyang Song, Jie Tang, and Jimeng Sun. 2017. Smart jump: Automated navigation suggestion for videos in moocs. In Proceedings of the 26th international conference on world wide web companion. 331–339.Google ScholarDigital Library
Index Terms
- RubySlippers: Supporting Content-based Voice Navigation for How-to Videos
Recommendations
How to Design Voice Based Navigation for How-To Videos
CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing SystemsWhen watching how-to videos related to physical tasks, users' hands are often occupied by the task, making voice input a natural fit. To better understand the design space of voice interactions for how-to video navigation, we conducted three think-aloud ...
Exploring Audio Icons for Content-Based Navigation in Voice User Interfaces
CUI '23: Proceedings of the 5th International Conference on Conversational User InterfacesVoice interaction is an increasingly popular technology, allowing users to control devices and applications without the need for physical interaction or ocular attention. Augmented voice playback control features, such as audio icons, have the potential ...
KinVoices: Using Voices of Friends and Family in Voice Interfaces
CSCW2With voice user interfaces (VUIs) becoming ubiquitous and speech synthesis technology maturing, it is possible to synthesise voices to resemble our friends and relatives (which we will collectively call 'kin') and use them on VUIs. However, designing ...
Comments