research-article

RubySlippers: Supporting Content-based Voice Navigation for How-to Videos

Authors:
Minsuk Chang

School of Computing KAIST, Korea, Republic of

School of Computing KAIST, Korea, Republic of
View Profile

,
Mina Huh

School of Computing KAIST, Korea, Republic of

School of Computing KAIST, Korea, Republic of
View Profile

,
Juho Kim

School of Computing KAIST, Korea, Republic of

School of Computing KAIST, Korea, Republic of
View Profile

CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing SystemsMay 2021Article No.: 97Pages 1–14https://doi.org/10.1145/3411764.3445131

Published:07 May 2021Publication History

CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

Pages 1–14

ABSTRACT

Directly manipulating the timeline, such as scrubbing for thumbnails, is the standard way of controlling how-to videos. However, when how-to videos involve physical activities, people inconveniently alternate between controlling the video and performing the tasks. Adopting a voice user interface allows people to control the video with voice while performing the tasks with hands. However, naively translating timeline manipulation into voice user interfaces (VUI) results in temporal referencing (e.g. “rewind 20 seconds”), which requires a different mental model for navigation and thereby limiting users’ ability to peek into the content. We present RubySlippers, a system that supports efficient content-based voice navigation through keyword-based queries. Our computational pipeline automatically detects referenceable elements in the video, and finds the video segmentation that minimizes the number of needed navigational commands. Our evaluation (N=12) shows that participants could perform three representative navigation tasks with fewer commands and less frustration using RubySlippers than the conventional voice-enabled video interface.

Supplemental Material

3411764.3445131_videofigure.mp4

mp4

191.1 MB

Download

References

Abir Al-Hajri, Gregor Miller, Matthew Fong, and Sidney S Fels. 2014. Visualization of personal history for video navigation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1187–1196.Google ScholarDigital Library
Roxette Arisa. 2020. MAKEUP FOR PASSPORT PHOTOS/ID PICTURES *no flashback, smooth skin* | Roxette Arisa. https://www.youtube.com/watch?v=9qoDdXFwBdoGoogle Scholar
Zahra Ashktorab, Mohit Jain, Q Vera Liao, and Justin D Weisz. 2019. Resilient chatbots: repair strategy preferences for conversational breakdowns. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarDigital Library
Morteza Behrooz, Sarah Mennicken, Jennifer Thom, Rohit Kumar, and Henriette Cramer. 2019. Augmenting Music Listening Experiences on Voice Assistants.. In ISMIR. 303–310.Google Scholar
Minsuk Chang, Anh Truong, Oliver Wang, Maneesh Agrawala, and Juho Kim. 2019. How to design voice based navigation for how-to videos. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11.Google ScholarDigital Library
Ming-Ming Cheng, Shuai Zheng, Wen-Yan Lin, Vibhav Vineet, Paul Sturgess, Nigel Crook, Niloy J Mitra, and Philip Torr. 2014. ImageSpirit: Verbal guided image parsing. ACM Transactions on Graphics (TOG) 34, 1 (2014), 1–11.Google ScholarDigital Library
Eric Corbett and Astrid Weber. 2016. What can I say? Addressing user experience challenges of a mobile voice user interface for accessibility. In Proceedings of the 18th international conference on human-computer interaction with mobile devices and services. 72–82.Google ScholarDigital Library
Chris Crockford and Harry Agius. 2006. An Empirical Investigation into User Navigation of Digital Video Using the VCR-like Control Set. (2006), 340–355.Google Scholar
Wei Ding and Gary Marchionini. 1998. A study on video browsing strategies. Technical Report.Google Scholar
Pierre Dragicevic, Gonzalo Ramos, Jacobo Bibliowitcz, Derek Nowrouzezahrai, Ravin Balakrishnan, and Karan Singh. 2008. Video browsing by direct manipulation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 237–246.Google ScholarDigital Library
Adam Fourney, Richard Mann, and Michael Terry. 2011. Characterizing the usability of interactive applications through query log analysis. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1817–1826.Google ScholarDigital Library
HM Government. [n.d.]. Working safely during COVID-19 in labs and research facilities. https://assets.publishing.service.gov.uk/media/5eb9752086650c2799a57ac5/working-safely-during-covid-19-labs-research-facilities-200910.pdf.Google Scholar
Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Advances in psychology. Vol. 52. Elsevier, 139–183.Google Scholar
Jenn Im. 2018. Everyday Drugstore Makeup Tutorial. https://www.youtube.com/watch?v=09HfTthoGEwGoogle Scholar
Juho Kim. 2013. Toolscape: enhancing the learning experience of how-to videos. In CHI’13 Extended Abstracts on Human Factors in Computing Systems. 2707–2712.Google Scholar
Juho Kim, Philip J Guo, Carrie J Cai, Shang-Wen Li, Krzysztof Z Gajos, and Robert C Miller. 2014. Data-driven interaction techniques for improving navigation of educational videos. In Proceedings of the 27th annual ACM symposium on User interface software and technology. 563–572.Google ScholarDigital Library
Yea-Seul Kim, Mira Dontcheva, Eytan Adar, and Jessica Hullman. 2019. Vocal shortcuts for creative experts. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.Google ScholarDigital Library
Preppy Kitchen. 2019. Amazing Caramel Cake Recipe. https://www.youtube.com/watch?v=CHbrXX23ctoGoogle Scholar
Preppy Kitchen. 2020. Amazing Hot Cross Buns Recipe. https://www.youtube.com/watch?v=XCf2zZ-_SwoGoogle Scholar
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From Word Embeddings to Document Distances. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML’15). JMLR.org, 957–966.Google Scholar
Benjamin Lafreniere, Tovi Grossman, and George Fitzmaurice. 2013. Community enhanced tutorials: improving tutorials with multiple demonstrations. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1779–1788.Google ScholarDigital Library
Gierad P Laput, Mira Dontcheva, Gregg Wilensky, Walter Chang, Aseem Agarwala, Jason Linder, and Eytan Adar. 2013. Pixeltone: A multimodal interface for image editing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2185–2194.Google ScholarDigital Library
Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2012. Swift: Reducing the Effects of Latency in Online Video Scrubbing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Austin, Texas, USA) (CHI ’12). Association for Computing Machinery, New York, NY, USA, 637–646. https://doi.org/10.1145/2207676.2207766Google ScholarDigital Library
Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2013. Patina: Dynamic heatmaps for visualizing application usage. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 3227–3236.Google ScholarDigital Library
Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2013. Swifter: Improved Online Video Scrubbing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Paris, France) (CHI ’13). Association for Computing Machinery, New York, NY, USA, 1159–1168. https://doi.org/10.1145/2470654.2466149Google ScholarDigital Library
Sarah McRoberts, Joshua Wissbroecker, Ruotong Wang, and F Maxwell Harper. 2019. Exploring Interactions with Voice-Controlled TV. (2019). arXiv:1905.05851Google Scholar
Christine Murad, Cosmin Munteanu, Leigh Clark, and Benjamin R Cowan. 2018. Design guidelines for hands-free speech interaction. In Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct. 269–276.Google ScholarDigital Library
Chelsea Myers, Anushay Furqan, Jessica Nebolsky, Karina Caro, and Jichen Zhu. 2018. Patterns for how users overcome obstacles in voice user interfaces. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–7.Google ScholarDigital Library
Chelsea M Myers. 2019. Adaptive suggestions to increase learnability for voice user interfaces. In Proceedings of the 24th International Conference on Intelligent User Interfaces: Companion. 159–160.Google ScholarDigital Library
Cuong Nguyen and Feng Liu. 2015. Making Software Tutorial Video Responsive. In CHI.Google Scholar
Jakob Nielsen. [n.d.]. 10 Usability Heuristics for User Interface Design.[Online] 1995.Google Scholar
Amy Pavel, Dan B Goldman, Björn Hartmann, and Maneesh Agrawala. 2015. Sceneskim: Searching and browsing movies using synchronized captions, scripts and plot summaries. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. 181–190.Google ScholarDigital Library
Amy Pavel, Colorado Reed, Björn Hartmann, and Maneesh Agrawala. 2014. Video Digests: A Browsable, Skimmable Format for Informational Lecture Videos. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology(UIST ’14). Association for Computing Machinery, New York, NY, USA, 573–582. https://doi.org/10.1145/2642918.2647400Google ScholarDigital Library
Suporn Pongnumkul, Mira Dontcheva, Wilmot Li, Jue Wang, Lubomir Bourdev, Shai Avidan, and Michael F. Cohen. 2011. Pause-and-play: Automatically Linking Screencast Video Tutorials with Applications. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology(Santa Barbara, California, USA) (UIST ’11). ACM, New York, NY, USA, 135–144. https://doi.org/10.1145/2047196.2047213Google ScholarDigital Library
Marguerite Roy and Michelene TH Chi. 2005. The self-explanation principle in multimedia learning. The Cambridge handbook of multimedia learning (2005), 271–286.Google ScholarCross Ref
Yale Song, Miriam Redi, Jordi Vallmitjana, and Alejandro Jaimes. 2016. To click or not to click: Automatic selection of beautiful thumbnails from videos. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 659–668.Google ScholarDigital Library
Arjun Srinivasan, Mira Dontcheva, Eytan Adar, and Seth Walker. 2019. Discovering natural language commands in multimodal interfaces. In Proceedings of the 24th International Conference on Intelligent User Interfaces. 661–672.Google ScholarDigital Library
Daniel Steinbock. [n.d.]. TagCrowd. https://tagcrowd.com.Google Scholar
Ottokar Tilk and Tanel Alumäe. 2016. Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration. In Interspeech 2016.Google ScholarCross Ref
Xu Wang, Benjamin J. Lafreniere, and Tovi Grossman. 2018. Leveraging Community-Generated Videos and Command Logs to Classify and Recommend Software Workflows. In CHI.Google Scholar
Kuldeep Yadav, Kundan Shrivastava, S Mohana Prasad, Harish Arsikere, Sonal Patil, Ranjeet Kumar, and Om Deshmukh. 2015. Content-driven multi-modal techniques for non-linear video navigation. In Proceedings of the 20th International Conference on Intelligent User Interfaces. 333–344.Google ScholarDigital Library
Matin Yarmand, Dongwook Yoon, Samuel Dodson, Ido Roll, and Sidney S Fels. 2019. ” Can you believe [1: 21]?!” Content and Time-Based Reference Patterns in Video Comments. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarDigital Library
Jeffrey Zacks, Barbara Tversky, and Gowri Iyer. 2001. Perceiving, remembering, and communicating structure in events. Journal of experimental psychology. General 130 (04 2001), 29–58. https://doi.org/10.1037//0096-3445.130.1.29Google Scholar
Han Zhang, Maosong Sun, Xiaochen Wang, Zhengyang Song, Jie Tang, and Jimeng Sun. 2017. Smart jump: Automated navigation suggestion for videos in moocs. In Proceedings of the 26th international conference on world wide web companion. 331–339.Google ScholarDigital Library

Index Terms

RubySlippers: Supporting Content-based Voice Navigation for How-to Videos

Index terms have been assigned to the content through auto-classification.

Recommendations

How to Design Voice Based Navigation for How-To Videos
CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

When watching how-to videos related to physical tasks, users' hands are often occupied by the task, making voice input a natural fit. To better understand the design space of voice interactions for how-to video navigation, we conducted three think-aloud ...
Read More
Exploring Audio Icons for Content-Based Navigation in Voice User Interfaces
CUI '23: Proceedings of the 5th International Conference on Conversational User Interfaces

Voice interaction is an increasingly popular technology, allowing users to control devices and applications without the need for physical interaction or ocular attention. Augmented voice playback control features, such as audio icons, have the potential ...
Read More
KinVoices: Using Voices of Friends and Family in Voice Interfaces
CSCW2

With voice user interfaces (VUIs) becoming ubiquitous and speech synthesis technology maturing, it is possible to synthesise voices to resemble our friends and relatives (which we will collectively call 'kin') and use them on VUIs. However, designing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems
May 2021
10862 pages
ISBN:9781450380966
DOI:10.1145/3411764
General Chairs:
Yoshifumi Kitamura
Tohoku University, Japan
,
Aaron Quigley
University of New South Wales, Australia
,
Program Chairs:
Katherine Isbister
University of California Santa Cruz, USA
,
Takeo Igarashi
The University of Tokyo, Japan
,
Publications Chairs:
Pernille Bjørn
University of Copenhagen, Denmark
,
Steven Drucker
Microsoft Research, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 May 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
How-to Videos
Video Navigation
Voice User Interface
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate6,199of26,314submissions,24%
Upcoming Conference
CHI '24

Sponsor:

sigchi

CHI Conference on Human Factors in Computing Systems

May 11 - 16, 2024

Honolulu , HI , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 789
  Total Downloads
- Downloads (Last 12 months)254
- Downloads (Last 6 weeks)27
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

RubySlippers: Supporting Content-based Voice Navigation for How-to Videos

CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

How to Design Voice Based Navigation for How-To Videos

Exploring Audio Icons for Content-Based Navigation in Voice User Interfaces

KinVoices: Using Voices of Friends and Family in Voice Interfaces