ABSTRACT
Real-time captioning provides deaf and hard of hearing people immediate access to spoken language and enables participation in dialogue with others. Low latency is critical because it allows speech to be paired with relevant visual cues. Currently, the only reliable source of real-time captions are expensive stenographers who must be recruited in advance and who are trained to use specialized keyboards. Automatic speech recognition (ASR) is less expensive and available on-demand, but its low accuracy, high noise sensitivity, and need for training beforehand render it unusable in real-world situations. In this paper, we introduce a new approach in which groups of non-expert captionists (people who can hear and type) collectively caption speech in real-time on-demand. We present Legion:Scribe, an end-to-end system that allows deaf people to request captions at any time. We introduce an algorithm for merging partial captions into a single output stream in real-time, and a captioning interface designed to encourage coverage of the entire audio stream. Evaluation with 20 local participants and 18 crowd workers shows that non-experts can provide an effective solution for captioning, accurately covering an average of 93.2% of an audio stream with only 10 workers and an average per-word latency of 2.9 seconds. More generally, our model in which multiple workers contribute partial inputs that are automatically merged in real-time may be extended to allow dynamic groups to surpass constituent individuals (even experts) on a variety of human performance tasks.
- Y. C. Beatrice Liem, H. Zhang. An iterative dual pathway structure for speech-to-text transcription. In Proc. of the 3rd Workshop on Human Computation, HCOMP 2011. 2011.Google Scholar
- 2. M. S. Bernstein, J. R. Brandt, R. C. Miller, and D. R. Karger. Crowds in two seconds: Enabling realtime crowd-poweredGoogle Scholar
- interfaces. In Proc. of the 24th annual ACM Symp. on User Interface Software and Technology, UIST '11, p33--42. 2011.Google Scholar
- 3. M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell, and K. Panovich. Soylent: a word processor with a crowd inside. In Proc. of the 23rd Annual ACM Symp. on User Interface Software and Technology, UIST '10, p313--322. 2010. Google ScholarDigital Library
- 4. J. P. Bigham, R. E. Ladner, and Y. Borodin. The Design of the Human-Backed Access Technology Conf. on Computers and Accessibility, ASSETS 2011, p3--10. 2011. Google ScholarDigital Library
- 5. J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, and T. Yeh. Vizwiz: nearly real-time answers to visual questions. In Proc. of the 23rd Annual ACM Symp. on User Interface Software and Technology, UIST '10, p333--342. 20 Google ScholarDigital Library
- 6. L. Chilton. Seaweed: A web application for designingGoogle Scholar
- economic games. Master's thesis, MIT, 2009.Google Scholar
- 7. M. Cooke, P. Green, L. Josifovski, and A. Vizinho. Robust automatic speech recognition with missing and unreliable acoustic data. Speech Comm., 34(3):267--285, 2001. Google ScholarDigital Library
- 8. S. Cooper, F. Khatib, A. Treuille, J. Barbero, J. Lee,Google Scholar
- M. Beenen, A. Leaver-Fay, D. Baker, Z. Popovic, and F. Players. Predicting protein structures with a multiplayer online game. Nature, 466(7307):756--760, 2010.Google ScholarCross Ref
- 9. X. Cui, L. Gu, B. Xiang, W. Zhang, and Y. Gao. Developing high performance asr in the IBM multilingual speech-to-speech translation system. In IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, ICASSP 2008, p5121--5124. 2008.Google Scholar
- 10. F. J. Damerau. A technique for computer detection and correction of spelling errors. In Commun. ACM., 7(3):171--176. March 1964. Google ScholarDigital Library
- 11. R. Edgar. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research, 32(5):1792--1797. 2004.Google Scholar
- 12. L. B. Elliot, M. S. Stinson, D. Easton, and J. Bourgeois. College Students Learning With C-Print's Education Software and Automatic Speech Recognition. In American Ed. Research Assoc. Annual Meeting, 2008.Google Scholar
- 13. J. Felsenstein. Inferring phytogenies. Sinauer Associates, Sunderland, Massachusetts. 2004.Google Scholar
- 14. J. L. Flowerdew. Salience in the performance of one speech act: the case of definitions. Discource Processes, 15(2):165--181. April-June 1992.Google ScholarCross Ref
- 15. J. Holt, S. Hotto, and K. Cole. Demographic Aspects of Hearing Impairment: Questions and Answers. 1994. http://research.gallaudet.edu/Demographics/factsheet.php.Google Scholar
- 16. T. Imai, A. Matsui, S. Homma, T. Kobayakawa, K. Onoe, S. Sato, and A. Ando. Speech recognition with a re-speak method for subtitling live broadcasts. In Intl. Conf. on Spoken Lang. Processing, ICSLP-2002, p1757--1760. 2002.Google Scholar
- 17. C. Jensema, R. McCann, S. Ramsey. Closed-captioned television presentation speed and vocabulary. In Am AnnGoogle Scholar
- Deaf. 141(4):284--92. October 1996.Google Scholar
- 18. H. Kadri, M. Davy, A. Rabaoui, Z. Lachiri, N. Ellouze, et al. Robust audio speaker segmentation using one class SVMs. In Proc of the European Signal Processing Conf., EUSIPCO 2008. 2008.Google Scholar
- 19. A. Kittur, B. Smus, S. Khamkar and R. E. Kraut. Crowdforge: Crowdsourcing complex work. In Proc. of the 24th Symp. on User Interface Software and Technology, UIST '11, p43--52. 2011. Google ScholarDigital Library
- 20. W. S. Lasecki, K. I. Murray, S. White, R. C. Miller, and J. P. Bigham. Real-time crowd control of existing interfaces. In Proceedings of the 24th ACM Symp. on User Interface Software and Technology, UIST '11, p23--32. 2011. Google ScholarDigital Library
- 21. W. S. Lasecki, S. White, K. I. Murray, and J. P. Bigham. Crowd memory: Learning in the collective. In Proc. of Collective Intelligence 2012, CI 2012. 2012.Google Scholar
- 22. G. Little, L. B. Chilton, M. Goldman, and R. C. Miller. Turkit: human computation algorithms on mechanical turk. In Proc. of the 23rd ACM Symp. on User interface software and technology, UIST '10, p57--66. 2010. Google ScholarDigital Library
- 23. T. Matthews, S. Carter, C. Pai, J. Fong, and J. Mankoff. Scribe4me: evaluating a mobile sound transcription tool for the deaf. In Proc. of the 8th Intl. Conf. on Ubiquitous Google ScholarDigital Library
- Computing, UbiComp '06, p159--176. 2006.Google Scholar
- 24. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. In Journal of Molecular Biology. 48 (3):443--53. 1970.Google ScholarCross Ref
- 25. C. Sutton and A. McCallum. An introduction to conditional random fields for relational learning. Introduction to statistical relational learning. MIT Press, 2006.Google Scholar
- 26. A. Tritschler and R. Gopinath. Improved speaker segmentation and segments clustering using the bayesian information criterion. In Sixth European Conf. on Speech Communication and Technology, 1999.Google Scholar
- 27. C. Van Den Brink, M. Tijhuis, G. Van Den Bos, S. Giampaoli, P. Kivinen, A. Nissinen, and D. Kromhout. Effect of widowhood on disability onset in elderly men from three european countries. Journal of the American Geriatrics Society, 52(3):353--3 2004.Google ScholarCross Ref
- 28. L. von Ahn. Human Computation. Ph.D. Thesis. 2005. Google ScholarDigital Library
- 29. L. von Ahn and L. Dabbish. Labeling images with a computer game. In Proc. of the Conf. on Human Factors in Computing Systems, CHI '04, p319--326. 2004. Google ScholarDigital Library
- 30. M. Wald. Creating accessible educational multimediaGoogle Scholar
- through editing automatic speech recognition captioning in real time. Interactive Technology and Smart Education, 3(2):131--141. 2006.Google ScholarCross Ref
- 31. A. A. Ye-Yi Wang and C. Chelba. Is word error rate a good indicator for spoken language understanding accuracy. In IEEE Workshop on Automatic Speech Recognition and Understanding. 2003.Google Scholar
Index Terms
- Real-time captioning by groups of non-experts
Recommendations
Legion scribe: real-time captioning by non-experts
ASSETS '14: Proceedings of the 16th international ACM SIGACCESS conference on Computers & accessibilityThe promise of affordable, automatic approaches to real-time captioning imagines a future in which deaf and hard of hearing (DHH) users have immediate access to speech in the world around them my simply picking up their phone or other mobile device. ...
Real-time captioning by non-experts with legion scribe
ASSETS '13: Proceedings of the 15th International ACM SIGACCESS Conference on Computers and AccessibilityReal-time captioning provides people who are deaf or hard of hearing access to speech in settings such as classrooms and live events. The most reliable approach to provide these captions is to recruit an expert stenographer who is able to type at ...
Enhancing the usability of real-time speech recognition captioning through personalised displays and real-time multiple speaker editing and annotation
UAHCI'07: Proceedings of the 4th international conference on Universal access in human-computer interaction: applications and servicesText transcriptions of the spoken word can benefit deaf people and also anyone who needs to review what has been said (e.g. at lectures, presentations, meetings etc.) Real time captioning (i.e. creating a live verbatim transcript of what is being spoken)...
Comments