skip to main content
10.1145/3173574.3174141acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article

Identifying Speech Input Errors Through Audio-Only Interaction

Published: 21 April 2018 Publication History

Abstract

Speech has become an increasingly common means of text input, from smartphones and smartwatches to voice-based intelligent personal assistants. However, reviewing the recognized text to identify and correct errors is a challenge when no visual feedback is available. In this paper, we first quantify and describe the speech recognition errors that users are prone to miss, and investigate how to better support this error identification task by manipulating pauses between words, speech rate, and speech repetition. To achieve these goals, we conducted a series of four studies. Study 1, an in-lab study, showed that participants missed identifying over 50% of speech recognition errors when listening to audio output of the recognized text. Building on this result, Studies 2 to 4 were conducted using an online crowdsourcing platform and showed that adding a pause between words improves error identification compared to no pause, the ability to identify errors degrades with higher speech rates (300 WPM), and repeating the speech output does not improve error identification. We derive implications for the design of audio-only speech dictation.

References

[1]
Shiri Azenkot and Nicole B Lee. 2013. Exploring the use of speech input by blind people on mobile devices. Proceedings of the ACM SIGACCESS Conference on Computers and Accessibility, ACM, Article No. 11.
[2]
Ann R Bradlow and Jennifer A Alexander. 2007. Semantic and phonetic enhancements for speech-innoise recognition by native and non-native listeners. The Journal of the Acoustical Society of America 121, 4: 2339--2349.
[3]
Junhwi Choi, Kyungduk Kim, Sungjin Lee, et al. 2012. Seamless error correction interface for voice word processor. Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, 4973--4976.
[4]
W Feng. 1994. Using handwriting and gesture recognition to correct speech recognition errors. Urbana 51: 61801.
[5]
Arnout R H Fischer, Kathleen J Price, and Andrew Sears. 2005. Speech-based text entry for mobile handheld devices: an analysis of efficacy and error correction techniques for server-based solutions. International Journal of Human-Computer Interaction 19, 3: 279--304.
[6]
Kazuki Fujiwara. 2016. Error Correction of Speech Recognition by Custom Phonetic Alphabet Input for Ultra-Small Devices. Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, 104--109.
[7]
Beth G Greene. 1986. Perception of synthetic speech by nonnative speakers of English. Proceedings of the Human Factors Society Annual Meeting, 1340--1343.
[8]
David Huggins-Daines and Alexander I Rudnicky. 2008. Interactive asr error correction for touchscreen devices. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session, 17--19.
[9]
Esther Janse. 2004. Word perception in fast speech: artificially time-compressed vs. naturally produced fast speech. Speech Communication 42, 2: 155--173.
[10]
Hui Jiang. 2005. Confidence measures for speech recognition: A survey. Speech communication 45, 4: 455--470.
[11]
Caroline Jones, Lynn Berry, and Catherine Stevens. 2007. Synthesized speech intelligibility and persuasion: Speech rate and non-native listeners. Computer Speech&Language 21, 4: 641--651.
[12]
Clare-Marie Karat, Christine Halverson, Daniel Horn, and John Karat. 1999. Patterns of entry and correction in large vocabulary continuous speech recognition systems. Proceedings of the SIGCHI conference on Human Factors in Computing Systems, 568--575.
[13]
Mary LaLomia. 1994. User Acceptance of Handwritten Recognition Accuracy. Conference Companion on Human Factors in Computing Systems, ACM, 107--108.
[14]
Yuan Liang, Koji Iwano, and Koichi Shinoda. 2014. Simple gesture-based error correction interface for smartphone speech recognition. INTERSPEECH, 1194--1198.
[15]
Iain A McCowan, Darren Moore, John Dines, et al. 2004. On the use of information retrieval measures for speech recognition evaluation.
[16]
Anja Moos and Jürgen Trouvain. 2007. Comprehension of ultra-fast speech--blind vs."normally hearing" persons. Proceedings of the 16th International Congress of Phonetic Sciences, 677--680.
[17]
Jun Ogata and Masataka Goto. 2005. Speech repair: quick error correction just by using selection operation for speech input interfaces. INTERSPEECH, 133--136.
[18]
Antti Oulasvirta, Sakari Tamminen, Virpi Roto, and Jaana Kuorelahti. 2005. Interaction in 4-second bursts: the fragmented nature of attentional resources in mobile HCI. Proceedings of the SIGCHI conference on Human factors in computing systems, 919--928.
[19]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an ASR corpus based on public domain audio books. Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, 5206--5210.
[20]
Konstantinos Papadopoulos and Eleni Koustriava. 2015. Comprehension of Synthetic and Natural Speech: Differences among Sighted and Visually Impaired Young Adults. Enabling Access for Persons with Visual Impairment: 147.
[21]
Sherry Ruan, Jacob O Wobbrock, Kenny Liou, Andrew Ng, and James Landay. 2016. Speech Is 3x Faster than Typing for English and Mandarin Text Entry on Mobile Devices. arXiv preprint arXiv:1608.07323.
[22]
Amanda Stent, Ann Syrdal, and Taniya Mishra. 2011. On the intelligibility of fast synthesized speech for individuals with early-onset blindness. The proceedings of the 13th international ACM SIGACCESS conference on Computers and accessibility, 211--218.
[23]
Bernhard Suhm, Brad Myers, and Alex Waibel. 2001. Multimodal error correction for speech user interfaces. ACM transactions on computer-human interaction (TOCHI) 8, 1: 60--98.
[24]
Brenda Sutton, Julia King, Karen Hux, and David Beukelman. 1995. Younger and older adults' rate performance when listening to synthetic speech. Augmentative and Alternative Communication 11, 3: 147--153.
[25]
TheMSsoundeffects. City sound effect 1 downtown. Retrieved from https://youtu.be/LZbEIxhiJRM.
[26]
Simon Tucker and Steve Whittaker. 2005. Novel techniques for time-compressing speech: an exploratory study. Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP'05). IEEE International Conference on, I--477.
[27]
Lijuan Wang, Tao Hu, Peng Liu, and Frank K Soong. 2008. Efficient handwriting correction of speech recognition errors with template constrained posterior (TCP). INTERSPEECH, 2659--2662.
[28]
Zhirong Wang, Tanja Schultz, and Alex Waibel. 2003. Comparison of acoustic model adaptation techniques on non-native speech. Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003 IEEE International Conference on, I--I.
[29]
Stephen J Winters and David B Pisoni. 2004. Perception and comprehension of synthetic speech. Progress Report Research on Spoken Language Processing 26: 1--44.
[30]
Jacob O Wobbrock, Leah Findlater, Darren Gergle, and James J Higgins. 2011. The Aligned Rank Transform for Nonparametric Factorial Analyses Using Only Anova Procedures. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ACM, 143--146.
[31]
Wayne Xiong, Jasha Droppo, Xuedong Huang, et al. 2017. The Microsoft 2016 conversational speech recognition system. Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 5255--5259.
[32]
Hanlu Ye, Meethu Malu, Uran Oh, and Leah Findlater. 2014. Current and future mobile and wearable device use by people with visual impairments. Proceedings of the 32nd annual ACM conference on Human factors in computing systems - CHI '14: 3123--3132.
[33]
Geoffrey Zweig, Chengzhu Yu, Jasha Droppo, and Andreas Stolcke. 2017. Advances in all-neural speech recognition. Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 4805--4809.

Cited By

View all
  • (2024)The State of Pilot Study Reporting in Crowdsourcing: A Reflection on Best Practices and GuidelinesProceedings of the ACM on Human-Computer Interaction10.1145/36410238:CSCW1(1-45)Online publication date: 26-Apr-2024
  • (2024)Uncovering Human Traits in Determining Real and Spoofed Audio: Insights from Blind and Sighted IndividualsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642817(1-14)Online publication date: 11-May-2024
  • (2021)Towards Augmented Reality Driven Human-City Interaction: Current Research on Mobile Headsets and Future ChallengesACM Computing Surveys10.1145/346796354:8(1-38)Online publication date: 4-Oct-2021
  • Show More Cited By

Index Terms

  1. Identifying Speech Input Errors Through Audio-Only Interaction

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHI '18: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems
    April 2018
    8489 pages
    ISBN:9781450356206
    DOI:10.1145/3173574
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 April 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. audio-only interaction
    2. error correction
    3. eyes-free use
    4. speech dictation
    5. synthesized speech
    6. text entry

    Qualifiers

    • Research-article

    Conference

    CHI '18
    Sponsor:

    Acceptance Rates

    CHI '18 Paper Acceptance Rate 666 of 2,590 submissions, 26%;
    Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

    Upcoming Conference

    CHI 2025
    ACM CHI Conference on Human Factors in Computing Systems
    April 26 - May 1, 2025
    Yokohama , Japan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)33
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)The State of Pilot Study Reporting in Crowdsourcing: A Reflection on Best Practices and GuidelinesProceedings of the ACM on Human-Computer Interaction10.1145/36410238:CSCW1(1-45)Online publication date: 26-Apr-2024
    • (2024)Uncovering Human Traits in Determining Real and Spoofed Audio: Insights from Blind and Sighted IndividualsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642817(1-14)Online publication date: 11-May-2024
    • (2021)Towards Augmented Reality Driven Human-City Interaction: Current Research on Mobile Headsets and Future ChallengesACM Computing Surveys10.1145/346796354:8(1-38)Online publication date: 4-Oct-2021
    • (2021)Press-n-Paste: Copy-and-Paste Operations with Pressure-sensitive Caret Navigation for Miniaturized Surface in Mobile Augmented RealityProceedings of the ACM on Human-Computer Interaction10.1145/34571465:EICS(1-29)Online publication date: 29-May-2021
    • (2021)An Extensible Cloud Based Avatar: Implementation and EvaluationRecent Advances in Technologies for Inclusive Well-Being10.1007/978-3-030-59608-8_27(503-522)Online publication date: 17-Mar-2021
    • (2020)VectorEntryACM Transactions on Accessible Computing10.1145/340653713:3(1-29)Online publication date: 3-Aug-2020
    • (2020)Auto-annotation for Voice-enabled Entertainment SystemsProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401241(1557-1560)Online publication date: 25-Jul-2020
    • (2020)Reviewing Speech Input with AudioACM Transactions on Accessible Computing10.1145/338203913:1(1-28)Online publication date: 21-Apr-2020
    • (2020)Mobile Voice Query Reformulation by Visually Impaired PeopleProceedings of the 2020 Conference on Human Information Interaction and Retrieval10.1145/3343413.3377950(519-522)Online publication date: 14-Mar-2020
    • (2020)Voice+Tactile: Augmenting In-vehicle Voice User Interface with Tactile Touchpad InteractionProceedings of the 2020 CHI Conference on Human Factors in Computing Systems10.1145/3313831.3376863(1-12)Online publication date: 21-Apr-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media