Abstract
Proper capitalization in text is a useful, often mandatory characteristic. Many text processing techniques rely on proper capitalization, and people can more easily read mixed case text. Proper capitalization, however, is often absent in a number of text sources, including automatic speech recognition output and closed caption text. The value of these text sources can be greatly enhanced with proper capitalization. We describe and evaluate a series of techniques that can recover proper capitalization. Our final system is able to recover more than 88% of the capitalized words with better than 90% accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proc. of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, pages 194–218, 1998.
J. Bachenko, J. Daugherty, and E. Fitzpatrick. A parser for real-time speech synthesis of conversational texts. In Proc. of the Third ACL Conf. on Applied NaturalL anguage Processing, pages 25–32, Trento, Italy, 1992.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.
D. Bikel, S. Miller, R. Schwartz, and R. Weischedel. Nymble: a high-performance learning name-finder. In Proc. of the Fifth ACL Conf. on Applied Natural Language Processing, Washington, D.C., 1997.
C. Cieri, D. Graff, M. Liberman, N. Martey, and S. Strassel. The tdt-2 text and speech corpus. In Proc. of the 1999 DARPA Broadcast News Workshop, Herndon, VA, 1999.
A. Coden, N. Haas, and R. Mack. Multi-search of video segments indexed by timealigned annotations of video content. Technical Report RC21444, IBM Research, 1998.
IBM. Ibm intelligent miner for text. Web pages (see http://www.ibm.com/).
F. Kubala, S. Colbath, D. Liu, A. Srivastava, and J. Makhoul. Integrated technologies for indexing spoken language. Comm. of the ACM, 43(2):48–56, Feb. 2000.
F. Kubala, R. Schwartz, R. Stone, and R. Weischedel. Named entity extraction from speech. In Proc. of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, 1998.
C. D. Manning and H. Schutze. Foundations of Statistical Natur alL anguage Processing. MIT Press, 1999.
A. Mikheev. Document centered approach to text normalization. In Proc. of the 23rd Inter. ACM SIGIR Conf. On Res. And Develop. in Information Retrieval, pages 136–143, Athens, Greece, 2000.
D. S. Pallett, J. G. Fiscus, J. S. Garofolo, A. Martin, and M. A. Przybocki. 1998 broadcast news benchmark test results. In Proc. of the 1999 DARPA Broadcast News Workshop, Herndon, VA, 1999.
M. A. Przybocki, J. G. Fiscus, J. S. Garofolo, and D. S. Pallett. 1998 hub-4 information extraction evaluation. In Proc. of the 1999 DARPA Broadcast News Workshop, Herndon, VA, 1999.
Y. Ravin and C. Leacock. Polysemy: Theoreticaland Computational Approaches. Oxford University Press, 2000.
Y. Ravin, N. Wacholder, and M. Choi. Disambiguation of names in text. In Proc. of the Fifth ACL Conf. on Applied Natural Language Processing, pages 202–208, Washington, D.C., 1997.
J. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proc. of the Fifth ACL Conf. on Applied Natural Language Processing, Washington, D.C., 1997.
B. Shahraray and D. Gibbon. Automated authoring of hypermedia documents of video programs. In Proc. of the Third ACM International Conf. on Multimedia, San Francisco, 1995.
A. Waibel, M. Bett, M. Finke, and R. Stiefelhagen. Meeting browser: Tracking and summarizing meetings. In Proc. of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Brown, E.W., Coden, A.R. (2002). Capitalization Recovery for Text. In: Coden, A.R., Brown, E.W., Srinivasan, S. (eds) Information Retrieval Techniques for Speech Applications. IRTSA 2001. Lecture Notes in Computer Science, vol 2273. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45637-6_2
Download citation
DOI: https://doi.org/10.1007/3-540-45637-6_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43156-5
Online ISBN: 978-3-540-45637-7
eBook Packages: Springer Book Archive