skip to main content
10.1145/2380116.2380122acmconferencesArticle/Chapter ViewAbstractPublication PagesuistConference Proceedingsconference-collections
research-article

Real-time captioning by groups of non-experts

Published:07 October 2012Publication History

ABSTRACT

Real-time captioning provides deaf and hard of hearing people immediate access to spoken language and enables participation in dialogue with others. Low latency is critical because it allows speech to be paired with relevant visual cues. Currently, the only reliable source of real-time captions are expensive stenographers who must be recruited in advance and who are trained to use specialized keyboards. Automatic speech recognition (ASR) is less expensive and available on-demand, but its low accuracy, high noise sensitivity, and need for training beforehand render it unusable in real-world situations. In this paper, we introduce a new approach in which groups of non-expert captionists (people who can hear and type) collectively caption speech in real-time on-demand. We present Legion:Scribe, an end-to-end system that allows deaf people to request captions at any time. We introduce an algorithm for merging partial captions into a single output stream in real-time, and a captioning interface designed to encourage coverage of the entire audio stream. Evaluation with 20 local participants and 18 crowd workers shows that non-experts can provide an effective solution for captioning, accurately covering an average of 93.2% of an audio stream with only 10 workers and an average per-word latency of 2.9 seconds. More generally, our model in which multiple workers contribute partial inputs that are automatically merged in real-time may be extended to allow dynamic groups to surpass constituent individuals (even experts) on a variety of human performance tasks.

References

  1. Y. C. Beatrice Liem, H. Zhang. An iterative dual pathway structure for speech-to-text transcription. In Proc. of the 3rd Workshop on Human Computation, HCOMP 2011. 2011.Google ScholarGoogle Scholar
  2. 2. M. S. Bernstein, J. R. Brandt, R. C. Miller, and D. R. Karger. Crowds in two seconds: Enabling realtime crowd-poweredGoogle ScholarGoogle Scholar
  3. interfaces. In Proc. of the 24th annual ACM Symp. on User Interface Software and Technology, UIST '11, p33--42. 2011.Google ScholarGoogle Scholar
  4. 3. M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell, and K. Panovich. Soylent: a word processor with a crowd inside. In Proc. of the 23rd Annual ACM Symp. on User Interface Software and Technology, UIST '10, p313--322. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 4. J. P. Bigham, R. E. Ladner, and Y. Borodin. The Design of the Human-Backed Access Technology Conf. on Computers and Accessibility, ASSETS 2011, p3--10. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 5. J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, and T. Yeh. Vizwiz: nearly real-time answers to visual questions. In Proc. of the 23rd Annual ACM Symp. on User Interface Software and Technology, UIST '10, p333--342. 20 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 6. L. Chilton. Seaweed: A web application for designingGoogle ScholarGoogle Scholar
  8. economic games. Master's thesis, MIT, 2009.Google ScholarGoogle Scholar
  9. 7. M. Cooke, P. Green, L. Josifovski, and A. Vizinho. Robust automatic speech recognition with missing and unreliable acoustic data. Speech Comm., 34(3):267--285, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 8. S. Cooper, F. Khatib, A. Treuille, J. Barbero, J. Lee,Google ScholarGoogle Scholar
  11. M. Beenen, A. Leaver-Fay, D. Baker, Z. Popovic, and F. Players. Predicting protein structures with a multiplayer online game. Nature, 466(7307):756--760, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  12. 9. X. Cui, L. Gu, B. Xiang, W. Zhang, and Y. Gao. Developing high performance asr in the IBM multilingual speech-to-speech translation system. In IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, ICASSP 2008, p5121--5124. 2008.Google ScholarGoogle Scholar
  13. 10. F. J. Damerau. A technique for computer detection and correction of spelling errors. In Commun. ACM., 7(3):171--176. March 1964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 11. R. Edgar. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research, 32(5):1792--1797. 2004.Google ScholarGoogle Scholar
  15. 12. L. B. Elliot, M. S. Stinson, D. Easton, and J. Bourgeois. College Students Learning With C-Print's Education Software and Automatic Speech Recognition. In American Ed. Research Assoc. Annual Meeting, 2008.Google ScholarGoogle Scholar
  16. 13. J. Felsenstein. Inferring phytogenies. Sinauer Associates, Sunderland, Massachusetts. 2004.Google ScholarGoogle Scholar
  17. 14. J. L. Flowerdew. Salience in the performance of one speech act: the case of definitions. Discource Processes, 15(2):165--181. April-June 1992.Google ScholarGoogle ScholarCross RefCross Ref
  18. 15. J. Holt, S. Hotto, and K. Cole. Demographic Aspects of Hearing Impairment: Questions and Answers. 1994. http://research.gallaudet.edu/Demographics/factsheet.php.Google ScholarGoogle Scholar
  19. 16. T. Imai, A. Matsui, S. Homma, T. Kobayakawa, K. Onoe, S. Sato, and A. Ando. Speech recognition with a re-speak method for subtitling live broadcasts. In Intl. Conf. on Spoken Lang. Processing, ICSLP-2002, p1757--1760. 2002.Google ScholarGoogle Scholar
  20. 17. C. Jensema, R. McCann, S. Ramsey. Closed-captioned television presentation speed and vocabulary. In Am AnnGoogle ScholarGoogle Scholar
  21. Deaf. 141(4):284--92. October 1996.Google ScholarGoogle Scholar
  22. 18. H. Kadri, M. Davy, A. Rabaoui, Z. Lachiri, N. Ellouze, et al. Robust audio speaker segmentation using one class SVMs. In Proc of the European Signal Processing Conf., EUSIPCO 2008. 2008.Google ScholarGoogle Scholar
  23. 19. A. Kittur, B. Smus, S. Khamkar and R. E. Kraut. Crowdforge: Crowdsourcing complex work. In Proc. of the 24th Symp. on User Interface Software and Technology, UIST '11, p43--52. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. 20. W. S. Lasecki, K. I. Murray, S. White, R. C. Miller, and J. P. Bigham. Real-time crowd control of existing interfaces. In Proceedings of the 24th ACM Symp. on User Interface Software and Technology, UIST '11, p23--32. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. 21. W. S. Lasecki, S. White, K. I. Murray, and J. P. Bigham. Crowd memory: Learning in the collective. In Proc. of Collective Intelligence 2012, CI 2012. 2012.Google ScholarGoogle Scholar
  26. 22. G. Little, L. B. Chilton, M. Goldman, and R. C. Miller. Turkit: human computation algorithms on mechanical turk. In Proc. of the 23rd ACM Symp. on User interface software and technology, UIST '10, p57--66. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. 23. T. Matthews, S. Carter, C. Pai, J. Fong, and J. Mankoff. Scribe4me: evaluating a mobile sound transcription tool for the deaf. In Proc. of the 8th Intl. Conf. on Ubiquitous Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Computing, UbiComp '06, p159--176. 2006.Google ScholarGoogle Scholar
  29. 24. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. In Journal of Molecular Biology. 48 (3):443--53. 1970.Google ScholarGoogle ScholarCross RefCross Ref
  30. 25. C. Sutton and A. McCallum. An introduction to conditional random fields for relational learning. Introduction to statistical relational learning. MIT Press, 2006.Google ScholarGoogle Scholar
  31. 26. A. Tritschler and R. Gopinath. Improved speaker segmentation and segments clustering using the bayesian information criterion. In Sixth European Conf. on Speech Communication and Technology, 1999.Google ScholarGoogle Scholar
  32. 27. C. Van Den Brink, M. Tijhuis, G. Van Den Bos, S. Giampaoli, P. Kivinen, A. Nissinen, and D. Kromhout. Effect of widowhood on disability onset in elderly men from three european countries. Journal of the American Geriatrics Society, 52(3):353--3 2004.Google ScholarGoogle ScholarCross RefCross Ref
  33. 28. L. von Ahn. Human Computation. Ph.D. Thesis. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. 29. L. von Ahn and L. Dabbish. Labeling images with a computer game. In Proc. of the Conf. on Human Factors in Computing Systems, CHI '04, p319--326. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. 30. M. Wald. Creating accessible educational multimediaGoogle ScholarGoogle Scholar
  36. through editing automatic speech recognition captioning in real time. Interactive Technology and Smart Education, 3(2):131--141. 2006.Google ScholarGoogle ScholarCross RefCross Ref
  37. 31. A. A. Ye-Yi Wang and C. Chelba. Is word error rate a good indicator for spoken language understanding accuracy. In IEEE Workshop on Automatic Speech Recognition and Understanding. 2003.Google ScholarGoogle Scholar

Index Terms

  1. Real-time captioning by groups of non-experts

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      UIST '12: Proceedings of the 25th annual ACM symposium on User interface software and technology
      October 2012
      608 pages
      ISBN:9781450315807
      DOI:10.1145/2380116

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 October 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate842of3,967submissions,21%

      Upcoming Conference

      UIST '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader