skip to main content
10.1145/2899475.2899478acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesw4aConference Proceedingsconference-collections
research-article

The effects of automatic speech recognition quality on human transcription latency

Published: 11 April 2016 Publication History

Abstract

Transcription makes speech accessible to deaf and hard of hearing people. This conversion of speech to text is still done manually by humans, despite high cost, because the quality of automated speech recognition (ASR) is still too low in real-world settings. Manual conversion can require more than 5 times the original audio time, which also introduces significant latency. Giving transcriptionists ASR output as a starting point seems like a reasonable approach to making humans more efficient and thereby reducing this cost, but the effectiveness of this approach is clearly related to the quality of the speech recognition output. At high error rates, fixing inaccurate speech recognition output may take longer than producing the transcription from scratch, and transcriptionists may not realize when transcription output is too inaccurate to be useful. In this paper, we empirically explore how the latency of transcriptions created by participants recruited on Amazon Mechanical Turk vary based on the accuracy of speech recognition output. We present results from 2 studies which indicate that starting with the ASR output is worse unless it is sufficiently accurate (Word Error Rate of under 30%).

References

[1]
Y. C. Beatrice Liem, Haoqi Zhang. An iterative dual pathway structure for speech-to-text transcription. In Proceedings of the 3rd Workshop on Human Computation (HCOMP '11), HCOMP '11, 2011.
[2]
J. P. Bigham, M. Bernstein, and E. Adar. Human-computer interaction and collective intelligence, 2015.
[3]
J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, and T. Yeh. Vizwiz: Nearly real-time answers to visual questions. In Proceedings of the 23Nd Annual ACM Symposium on User Interface Software and Technology, UIST '10, pages 333--342, New York, NY, USA, 2010. ACM.
[4]
J. P. Bigham, R. E. Ladner, and Y. Borodin. The design of human-powered access technology. In Proc. of Computers and Accessibility, ASSETS '11, pages 3--10, 2011.
[5]
E. Brady and J. Bigham. Crowdsourcing accessibility: Human-powered access technologies. Foundations and Trends in Human-Computer Interaction, 8(4):273--372, 2015.
[6]
E. Brady, M. R. Morris, and J. P. Bigham. Gauging receptiveness to social microvolunteering. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '15, New York, NY, USA, 2015. ACM.
[7]
M. A. Burton, E. Brady, R. Brewer, C. Neylan, J. P. Bigham, and A. Hurst. Crowdsourcing subjective fashion advice using vizwiz: Challenges and opportunities. In Proceedings of the 14th International ACM SIGA CCESS Conference on Computers and Accessibility, ASSETS '12, pages 135--142, New York, NY, USA, 2012. ACM.
[8]
C. Callison-Burch and M. Dredze. Creating speech and language data with amazon's mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, CSLDAMT '10, pages 1--12, Stroudsburg, PA; U.S.A., 2010. Association for Computational Linguistics.
[9]
R. Dufour and Y. Esteve. Correcting asr outputs: Specific solutions to specific errors in french. In Spoken Language Technology Workshop, 2008. SLT 2008. IEEE, pages 213--216, Dec 2008.
[10]
Y. Gaur, F. Metze, Y. Miao, and J. P. Bigham. Using keyword spotting to help humans correct captioning faster. In Proc. INTERSPEECH, Dresden, Germany, Sept. 2015. ISCA.
[11]
M. Harper. The automatic speech recognition in reverberant environments (ASpIRE) challenge. In Proc. ASRU, Scottsdale, AZ; U.S.A., Dec. 2015. IEEE.
[12]
R. P. Harrington and G. C. Vanderheiden. Crowd caption correction (ccc). In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS '13, pages 45:1--45:2, New York, NY, USA, 2013. ACM.
[13]
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82--97, 2012.
[14]
X. Huang, J. Baker, and R. Reddy. A historical perspective of speech recognition. Communications of the ACM, 57(1):94--103, 2014.
[15]
D. Huggins-Daines and A. I. Rudnicky. Interactive asr error correction for touchscreen devices. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session, HLT-Demonstrations '08, pages 17--19, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.
[16]
H. Kolkhorst, K. Kilgour, S. Stüker, and A. Waibel. Evaluation of interactive user corrections for lecture transcription. In Proc. IWSLT, pages 217--221, 2012.
[17]
A. Kumar, F. Metze, and M. Kam. Enabling the rapid development and adoption of speech-user interfaces. IEEE Computer Magazine, 46(1), Jan. 2014.
[18]
A. Kumar, F. Metze, W. Wang, and M. Kam. Formalizing expert knowledge for developing accurate speech recognizers. In Proc. INTERSPEECH, Lyon; France, Sept. 2013. ISCA.
[19]
W. Lasecki, C. Miller, A. Sadilek, A. Abumoussa, D. Borrello, R. Kushalnagar, and J. Bigham. Real-time captioning by groups of non-experts. In Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology, UIST '12, pages 23--34, New York, NY, USA, 2012. ACM.
[20]
W. S. Lasecki and J. P. Bigham. Online quality control for real-time crowd captioning. In Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS '12, pages 143--150, New York, NY, USA, 2012. ACM.
[21]
W. S. Lasecki, C. D. Miller, and J. P. Bigham. Warping time for more effective real-time crowdsourcing. In Proc. SIGCHI Conference on Human Factors in Computing Systems, CHI '13, pages 2033--2036, New York, NY, USA, 2013. ACM.
[22]
W. S. Lasecki, C. D. Miller, R. Kushalnagar, and J. P. Bigham. Legion scribe: Real-time captioning by the non-experts. In Proc. 10th Int. Cross-Disciplinary Conference on Web Accessibility, W4A '13, pages 22:1--22:2, New York, NY, USA, 2013. ACM.
[23]
W. S. Lasecki, P. Thiha, Y. Zhong, E. Brady, and J. P. Bigham. Answering visual questions with conversational crowd assistants. In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS '13, pages 18:1--18:8, New York, NY, USA, 2013. ACM.
[24]
C.-y. Lee and J. R. Glass. A transcription task for crowdsourcing with automatic quality control. Florence; Italy, Aug. 2011. ISCA.
[25]
X. Lei, A. Senior, A. Gruenstein, and J. Sorensen. Accurate and compact large vocabulary speech recognition on mobile devices. In INTERSPEECH, pages 662--665, 2013.
[26]
F. Metze, A. Gandhe, Y. Miao, Z. Sheikh, Y. Wang, D. Xu, H. Zhang, J. Kim, I. Lane, W. K. L. ee, S. Stüker, and M. Müller. Semi-supervised training in low-resource ASR and KWS. In Proc. ICASSP, Brisbane; Australia, Apr. 2015. IEEE.
[27]
Y. Miao, M. Gowayyed, and F. Metze. EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding. In Proc. ASRU, Scottsdale, AZ; U.S.A., Dec. 2015. IEEE. https://github.com/srvk/eesen.
[28]
R. K. Moore. Progress and prospects for speech technology: Results from three sexennial surveys. In INTERSPEECH, pages 1533--1536, 2011.
[29]
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlíček, Y. Qian, P. Schwarz, et al. The Kaldi speech recognition toolkit. In Proc. ASRU, Hawaii, HI; U.S.A., Dec. 2011. IEEE.
[30]
A. Rousseau, P. Deléglise, and Y. Estève. Ted-lium: an automatic speech recognition dedicated corpus. In Proc. LREC, pages 125--129, 2012.
[31]
G. Saon and J.-T. Chien. Large-vocabulary continuous speech recognition systems: A look at some recent advances. Signal Processing Magazine, IEEE, 29(6):18--33, Nov 2012.
[32]
S. C. Shapiro. ENCYCLOPEDIA OF ARTIFICIAL INTELLIGENCE SECOND EDITION. New Jersey: A Wiley Interscience Publication, 1992.
[33]
L. von Ahn. Human computation. In Design Automation Conference, 2009. DAC '09. 46th ACM/IEEE, pages 418--419, July 2009.
[34]
M. Wald. Crowdsourcing correction of speech recognition captioning errors. 2011.
[35]
Y.-Y. Wang, A. Acero, and C. Chelba. Is word error rate a good indicator for spoken language understanding accuracy. In Automatic Speech Recognition and Understanding, 2003. ASRU'03. 2003 IEEE Workshop on, pages 577--582. IEEE, 2003.
[36]
K. Zyskowski, M. R. Morris, J. P. Bigham, M. L. Gray, and S. Kane. Accessible crowdwork? understanding the value in and challenge of microtask employment for people with disabilities. ACM, March 2015.

Cited By

View all
  • (2024)Enhancing Personalized Mental Health Support Through Artificial Intelligence: Advances in Speech and Text Analysis Within Online Therapy PlatformsInformation10.3390/info1512081315:12(813)Online publication date: 18-Dec-2024
  • (2024)A Culturally-Aware AI Tool for Crowdworkers: Leveraging Chronemics to Support Diverse Work StylesProceedings of the ACM on Human-Computer Interaction10.1145/36868998:CSCW2(1-34)Online publication date: 8-Nov-2024
  • (2024)SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and RatingsExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650738(1-9)Online publication date: 11-May-2024
  • Show More Cited By

Index Terms

  1. The effects of automatic speech recognition quality on human transcription latency

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      W4A '16: Proceedings of the 13th International Web for All Conference
      April 2016
      223 pages
      ISBN:9781450341387
      DOI:10.1145/2899475
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      • Intuit: Intuit Inc.
      • Google Inc.
      • Canvas Network: Canvas Network
      • TPG: The Paciello Group
      • IBM: IBM

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 April 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. automatic speech recognition
      2. captioning
      3. crowd programming
      4. human computation

      Qualifiers

      • Research-article

      Conference

      W4A'16
      Sponsor:
      • Intuit
      • Canvas Network
      • TPG
      • IBM
      W4A'16: International Web for All Conference
      April 11 - 13, 2016
      Montreal, Canada

      Acceptance Rates

      Overall Acceptance Rate 171 of 371 submissions, 46%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)59
      • Downloads (Last 6 weeks)12
      Reflects downloads up to 18 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Enhancing Personalized Mental Health Support Through Artificial Intelligence: Advances in Speech and Text Analysis Within Online Therapy PlatformsInformation10.3390/info1512081315:12(813)Online publication date: 18-Dec-2024
      • (2024)A Culturally-Aware AI Tool for Crowdworkers: Leveraging Chronemics to Support Diverse Work StylesProceedings of the ACM on Human-Computer Interaction10.1145/36868998:CSCW2(1-34)Online publication date: 8-Nov-2024
      • (2024)SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and RatingsExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650738(1-9)Online publication date: 11-May-2024
      • (2024)Understanding Choice Independence and Error Types in Human-AI CollaborationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3641946(1-19)Online publication date: 11-May-2024
      • (2024)The AI Act in a law enforcement context: The case of automatic speech recognition for transcribing investigative interviewsForensic Science International: Synergy10.1016/j.fsisyn.2024.1005639(100563)Online publication date: 2024
      • (2023)Using Open-Source Automatic Speech Recognition Tools for the Annotation of Dutch Infant-Directed SpeechMultimodal Technologies and Interaction10.3390/mti70700687:7(68)Online publication date: 3-Jul-2023
      • (2023)Incorporating automatic speech recognition methods into the transcription of police-suspect interviews: factors affecting automatic performanceFrontiers in Communication10.3389/fcomm.2023.11652338Online publication date: 13-Jul-2023
      • (2023)MaNIACS: Approximate Mining of Frequent Subgraph Patterns through SamplingACM Transactions on Intelligent Systems and Technology10.1145/358725414:3(1-29)Online publication date: 13-Apr-2023
      • (2023) DatAFLow: Toward a Data-Flow-Guided FuzzerACM Transactions on Software Engineering and Methodology10.1145/358715632:5(1-31)Online publication date: 21-Jul-2023
      • (2023)Structured Theorem for Quantum Programs and its ApplicationsACM Transactions on Software Engineering and Methodology10.1145/358715432:4(1-35)Online publication date: 26-May-2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media