research-article

Real-time captioning by groups of non-experts

Authors:
Walter Lasecki

University of Rochester, Rochester, New York, USA

University of Rochester, Rochester, New York, USA
View Profile

,
Christopher Miller

University of Rochester, Rochester, New York, USA

University of Rochester, Rochester, New York, USA
View Profile

,
Adam Sadilek

University of Rochester, Rochester, New York, USA

University of Rochester, Rochester, New York, USA
View Profile

,
Andrew Abumoussa

University of Rochester, Rochester, New York, USA

University of Rochester, Rochester, New York, USA
View Profile

,
Donato Borrello

University of Rochester, Rochester, New York, USA

University of Rochester, Rochester, New York, USA
View Profile

,
Raja Kushalnagar

Rochester Institute of Technology, Rochester, New York, USA

Rochester Institute of Technology, Rochester, New York, USA
View Profile

,
Jeffrey Bigham

University of Rochester, Rochester, New York, USA

University of Rochester, Rochester, New York, USA
View Profile

UIST '12: Proceedings of the 25th annual ACM symposium on User interface software and technologyOctober 2012Pages 23–34https://doi.org/10.1145/2380116.2380122

Published:07 October 2012Publication History

UIST '12: Proceedings of the 25th annual ACM symposium on User interface software and technology

Pages 23–34

ABSTRACT

Real-time captioning provides deaf and hard of hearing people immediate access to spoken language and enables participation in dialogue with others. Low latency is critical because it allows speech to be paired with relevant visual cues. Currently, the only reliable source of real-time captions are expensive stenographers who must be recruited in advance and who are trained to use specialized keyboards. Automatic speech recognition (ASR) is less expensive and available on-demand, but its low accuracy, high noise sensitivity, and need for training beforehand render it unusable in real-world situations. In this paper, we introduce a new approach in which groups of non-expert captionists (people who can hear and type) collectively caption speech in real-time on-demand. We present Legion:Scribe, an end-to-end system that allows deaf people to request captions at any time. We introduce an algorithm for merging partial captions into a single output stream in real-time, and a captioning interface designed to encourage coverage of the entire audio stream. Evaluation with 20 local participants and 18 crowd workers shows that non-experts can provide an effective solution for captioning, accurately covering an average of 93.2% of an audio stream with only 10 workers and an average per-word latency of 2.9 seconds. More generally, our model in which multiple workers contribute partial inputs that are automatically merged in real-time may be extended to allow dynamic groups to surpass constituent individuals (even experts) on a variety of human performance tasks.

References

Y. C. Beatrice Liem, H. Zhang. An iterative dual pathway structure for speech-to-text transcription. In Proc. of the 3rd Workshop on Human Computation, HCOMP 2011. 2011.Google Scholar
2. M. S. Bernstein, J. R. Brandt, R. C. Miller, and D. R. Karger. Crowds in two seconds: Enabling realtime crowd-poweredGoogle Scholar
interfaces. In Proc. of the 24th annual ACM Symp. on User Interface Software and Technology, UIST '11, p33--42. 2011.Google Scholar
3. M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell, and K. Panovich. Soylent: a word processor with a crowd inside. In Proc. of the 23rd Annual ACM Symp. on User Interface Software and Technology, UIST '10, p313--322. 2010. Google ScholarDigital Library
4. J. P. Bigham, R. E. Ladner, and Y. Borodin. The Design of the Human-Backed Access Technology Conf. on Computers and Accessibility, ASSETS 2011, p3--10. 2011. Google ScholarDigital Library
5. J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, and T. Yeh. Vizwiz: nearly real-time answers to visual questions. In Proc. of the 23rd Annual ACM Symp. on User Interface Software and Technology, UIST '10, p333--342. 20 Google ScholarDigital Library
6. L. Chilton. Seaweed: A web application for designingGoogle Scholar
economic games. Master's thesis, MIT, 2009.Google Scholar
7. M. Cooke, P. Green, L. Josifovski, and A. Vizinho. Robust automatic speech recognition with missing and unreliable acoustic data. Speech Comm., 34(3):267--285, 2001. Google ScholarDigital Library
8. S. Cooper, F. Khatib, A. Treuille, J. Barbero, J. Lee,Google Scholar
M. Beenen, A. Leaver-Fay, D. Baker, Z. Popovic, and F. Players. Predicting protein structures with a multiplayer online game. Nature, 466(7307):756--760, 2010.Google ScholarCross Ref
9. X. Cui, L. Gu, B. Xiang, W. Zhang, and Y. Gao. Developing high performance asr in the IBM multilingual speech-to-speech translation system. In IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, ICASSP 2008, p5121--5124. 2008.Google Scholar
10. F. J. Damerau. A technique for computer detection and correction of spelling errors. In Commun. ACM., 7(3):171--176. March 1964. Google ScholarDigital Library
11. R. Edgar. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research, 32(5):1792--1797. 2004.Google Scholar
12. L. B. Elliot, M. S. Stinson, D. Easton, and J. Bourgeois. College Students Learning With C-Print's Education Software and Automatic Speech Recognition. In American Ed. Research Assoc. Annual Meeting, 2008.Google Scholar
13. J. Felsenstein. Inferring phytogenies. Sinauer Associates, Sunderland, Massachusetts. 2004.Google Scholar
14. J. L. Flowerdew. Salience in the performance of one speech act: the case of definitions. Discource Processes, 15(2):165--181. April-June 1992.Google ScholarCross Ref
15. J. Holt, S. Hotto, and K. Cole. Demographic Aspects of Hearing Impairment: Questions and Answers. 1994. http://research.gallaudet.edu/Demographics/factsheet.php.Google Scholar
16. T. Imai, A. Matsui, S. Homma, T. Kobayakawa, K. Onoe, S. Sato, and A. Ando. Speech recognition with a re-speak method for subtitling live broadcasts. In Intl. Conf. on Spoken Lang. Processing, ICSLP-2002, p1757--1760. 2002.Google Scholar
17. C. Jensema, R. McCann, S. Ramsey. Closed-captioned television presentation speed and vocabulary. In Am AnnGoogle Scholar
Deaf. 141(4):284--92. October 1996.Google Scholar
18. H. Kadri, M. Davy, A. Rabaoui, Z. Lachiri, N. Ellouze, et al. Robust audio speaker segmentation using one class SVMs. In Proc of the European Signal Processing Conf., EUSIPCO 2008. 2008.Google Scholar
19. A. Kittur, B. Smus, S. Khamkar and R. E. Kraut. Crowdforge: Crowdsourcing complex work. In Proc. of the 24th Symp. on User Interface Software and Technology, UIST '11, p43--52. 2011. Google ScholarDigital Library
20. W. S. Lasecki, K. I. Murray, S. White, R. C. Miller, and J. P. Bigham. Real-time crowd control of existing interfaces. In Proceedings of the 24th ACM Symp. on User Interface Software and Technology, UIST '11, p23--32. 2011. Google ScholarDigital Library
21. W. S. Lasecki, S. White, K. I. Murray, and J. P. Bigham. Crowd memory: Learning in the collective. In Proc. of Collective Intelligence 2012, CI 2012. 2012.Google Scholar
22. G. Little, L. B. Chilton, M. Goldman, and R. C. Miller. Turkit: human computation algorithms on mechanical turk. In Proc. of the 23rd ACM Symp. on User interface software and technology, UIST '10, p57--66. 2010. Google ScholarDigital Library
23. T. Matthews, S. Carter, C. Pai, J. Fong, and J. Mankoff. Scribe4me: evaluating a mobile sound transcription tool for the deaf. In Proc. of the 8th Intl. Conf. on Ubiquitous Google ScholarDigital Library
Computing, UbiComp '06, p159--176. 2006.Google Scholar
24. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. In Journal of Molecular Biology. 48 (3):443--53. 1970.Google ScholarCross Ref
25. C. Sutton and A. McCallum. An introduction to conditional random fields for relational learning. Introduction to statistical relational learning. MIT Press, 2006.Google Scholar
26. A. Tritschler and R. Gopinath. Improved speaker segmentation and segments clustering using the bayesian information criterion. In Sixth European Conf. on Speech Communication and Technology, 1999.Google Scholar
27. C. Van Den Brink, M. Tijhuis, G. Van Den Bos, S. Giampaoli, P. Kivinen, A. Nissinen, and D. Kromhout. Effect of widowhood on disability onset in elderly men from three european countries. Journal of the American Geriatrics Society, 52(3):353--3 2004.Google ScholarCross Ref
28. L. von Ahn. Human Computation. Ph.D. Thesis. 2005. Google ScholarDigital Library
29. L. von Ahn and L. Dabbish. Labeling images with a computer game. In Proc. of the Conf. on Human Factors in Computing Systems, CHI '04, p319--326. 2004. Google ScholarDigital Library
30. M. Wald. Creating accessible educational multimediaGoogle Scholar
through editing automatic speech recognition captioning in real time. Interactive Technology and Smart Education, 3(2):131--141. 2006.Google ScholarCross Ref
31. A. A. Ye-Yi Wang and C. Chelba. Is word error rate a good indicator for spoken language understanding accuracy. In IEEE Workshop on Automatic Speech Recognition and Understanding. 2003.Google Scholar

Index Terms

Real-time captioning by groups of non-experts
1. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Legion scribe: real-time captioning by non-experts
ASSETS '14: Proceedings of the 16th international ACM SIGACCESS conference on Computers & accessibility

The promise of affordable, automatic approaches to real-time captioning imagines a future in which deaf and hard of hearing (DHH) users have immediate access to speech in the world around them my simply picking up their phone or other mobile device. ...
Read More
Real-time captioning by non-experts with legion scribe
ASSETS '13: Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility

Real-time captioning provides people who are deaf or hard of hearing access to speech in settings such as classrooms and live events. The most reliable approach to provide these captions is to recruit an expert stenographer who is able to type at ...
Read More
Enhancing the usability of real-time speech recognition captioning through personalised displays and real-time multiple speaker editing and annotation
UAHCI'07: Proceedings of the 4th international conference on Universal access in human-computer interaction: applications and services

Text transcriptions of the spoken word can benefit deaf people and also anyone who needs to review what has been said (e.g. at lectures, presentations, meetings etc.) Real time captioning (i.e. creating a live verbatim transcript of what is being spoken)...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
UIST '12: Proceedings of the 25th annual ACM symposium on User interface software and technology
October 2012
608 pages
ISBN:9781450315807
DOI:10.1145/2380116
General Chair:
Rob Miller
MIT CSAIL, USA
,
Program Chairs:
Hrvoje Benko
Microsoft Research, USA
,
Celine Latulipe
University of North Carolina at Charlotte, USA
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 October 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
captioning
crowdsourcing
deaf
hard of hearing
real-time
text alignment
transcription
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate842of3,967submissions,21%
Upcoming Conference
UIST '24

Sponsor:

sigchi

sigchi

UIST '24: The 37th Annual ACM Symposium on User Interface Software and Technology

October 13 - 16, 2024

Pittsburgh , PA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 127
  Total Citations
  View Citations
- 1,240
  Total Downloads
- Downloads (Last 12 months)63
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Real-time captioning by groups of non-experts

UIST '12: Proceedings of the 25th annual ACM symposium on User interface software and technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Legion scribe: real-time captioning by non-experts

Real-time captioning by non-experts with legion scribe

Enhancing the usability of real-time speech recognition captioning through personalised displays and real-time multiple speaker editing and annotation