Capitalization Recovery for Text

Brown, Eric W.; Coden, Anni R.

doi:10.1007/3-540-45637-6_2

Eric W. Brown⁶ &
Anni R. Coden⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2273))

Included in the following conference series:

Workshop on Information Retrieval Techniques for Speech Applications

272 Accesses
3 Altmetric

Abstract

Proper capitalization in text is a useful, often mandatory characteristic. Many text processing techniques rely on proper capitalization, and people can more easily read mixed case text. Proper capitalization, however, is often absent in a number of text sources, including automatic speech recognition output and closed caption text. The value of these text sources can be greatly enhanced with proper capitalization. We describe and evaluate a series of techniques that can recover proper capitalization. Our final system is able to recover more than 88% of the capitalized words with better than 90% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Restoring Punctuation and Capitalization Using Transformer Models

First Foray into Text Analysis with R

Efficient Information Retrieval: AWS Textract in Action

References

J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proc. of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, pages 194–218, 1998.
Google Scholar
J. Bachenko, J. Daugherty, and E. Fitzpatrick. A parser for real-time speech synthesis of conversational texts. In Proc. of the Third ACL Conf. on Applied NaturalL anguage Processing, pages 25–32, Trento, Italy, 1992.
Google Scholar
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.
Google Scholar
D. Bikel, S. Miller, R. Schwartz, and R. Weischedel. Nymble: a high-performance learning name-finder. In Proc. of the Fifth ACL Conf. on Applied Natural Language Processing, Washington, D.C., 1997.
Google Scholar
C. Cieri, D. Graff, M. Liberman, N. Martey, and S. Strassel. The tdt-2 text and speech corpus. In Proc. of the 1999 DARPA Broadcast News Workshop, Herndon, VA, 1999.
Google Scholar
A. Coden, N. Haas, and R. Mack. Multi-search of video segments indexed by timealigned annotations of video content. Technical Report RC21444, IBM Research, 1998.
Google Scholar
IBM. Ibm intelligent miner for text. Web pages (see http://www.ibm.com/).
F. Kubala, S. Colbath, D. Liu, A. Srivastava, and J. Makhoul. Integrated technologies for indexing spoken language. Comm. of the ACM, 43(2):48–56, Feb. 2000.
Article Google Scholar
F. Kubala, R. Schwartz, R. Stone, and R. Weischedel. Named entity extraction from speech. In Proc. of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, 1998.
Google Scholar
C. D. Manning and H. Schutze. Foundations of Statistical Natur alL anguage Processing. MIT Press, 1999.
Google Scholar
A. Mikheev. Document centered approach to text normalization. In Proc. of the 23rd Inter. ACM SIGIR Conf. On Res. And Develop. in Information Retrieval, pages 136–143, Athens, Greece, 2000.
Google Scholar
D. S. Pallett, J. G. Fiscus, J. S. Garofolo, A. Martin, and M. A. Przybocki. 1998 broadcast news benchmark test results. In Proc. of the 1999 DARPA Broadcast News Workshop, Herndon, VA, 1999.
Google Scholar
M. A. Przybocki, J. G. Fiscus, J. S. Garofolo, and D. S. Pallett. 1998 hub-4 information extraction evaluation. In Proc. of the 1999 DARPA Broadcast News Workshop, Herndon, VA, 1999.
Google Scholar
Y. Ravin and C. Leacock. Polysemy: Theoreticaland Computational Approaches. Oxford University Press, 2000.
Google Scholar
Y. Ravin, N. Wacholder, and M. Choi. Disambiguation of names in text. In Proc. of the Fifth ACL Conf. on Applied Natural Language Processing, pages 202–208, Washington, D.C., 1997.
Google Scholar
J. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proc. of the Fifth ACL Conf. on Applied Natural Language Processing, Washington, D.C., 1997.
Google Scholar
B. Shahraray and D. Gibbon. Automated authoring of hypermedia documents of video programs. In Proc. of the Third ACM International Conf. on Multimedia, San Francisco, 1995.
Google Scholar
A. Waibel, M. Bett, M. Finke, and R. Stiefelhagen. Meeting browser: Tracking and summarizing meetings. In Proc. of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, 1998.
Google Scholar

Download references

Author information

Authors and Affiliations

IBM T.J. Watson Research Center, Yorktown Heights, 10598, NY
Eric W. Brown & Anni R. Coden

Authors

Eric W. Brown
View author publications
You can also search for this author in PubMed Google Scholar
Anni R. Coden
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IBM T.J. Watson Research Center, P.O.Box 704, 10598, Yorktown Heights, NY, USA
Anni R. Coden & Eric W. Brown &
IBM Almaden Research Center, 650 Harry Road, 95120, San Jose, CA, USA
Savitha Srinivasan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brown, E.W., Coden, A.R. (2002). Capitalization Recovery for Text. In: Coden, A.R., Brown, E.W., Srinivasan, S. (eds) Information Retrieval Techniques for Speech Applications. IRTSA 2001. Lecture Notes in Computer Science, vol 2273. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45637-6_2

Download citation

DOI: https://doi.org/10.1007/3-540-45637-6_2
Published: 22 January 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43156-5
Online ISBN: 978-3-540-45637-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics