Skip to main content

Capitalization Recovery for Text

  • Conference paper
  • First Online:
Information Retrieval Techniques for Speech Applications (IRTSA 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2273))

Included in the following conference series:

Abstract

Proper capitalization in text is a useful, often mandatory characteristic. Many text processing techniques rely on proper capitalization, and people can more easily read mixed case text. Proper capitalization, however, is often absent in a number of text sources, including automatic speech recognition output and closed caption text. The value of these text sources can be greatly enhanced with proper capitalization. We describe and evaluate a series of techniques that can recover proper capitalization. Our final system is able to recover more than 88% of the capitalized words with better than 90% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proc. of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, pages 194–218, 1998.

    Google Scholar 

  2. J. Bachenko, J. Daugherty, and E. Fitzpatrick. A parser for real-time speech synthesis of conversational texts. In Proc. of the Third ACL Conf. on Applied NaturalL anguage Processing, pages 25–32, Trento, Italy, 1992.

    Google Scholar 

  3. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.

    Google Scholar 

  4. D. Bikel, S. Miller, R. Schwartz, and R. Weischedel. Nymble: a high-performance learning name-finder. In Proc. of the Fifth ACL Conf. on Applied Natural Language Processing, Washington, D.C., 1997.

    Google Scholar 

  5. C. Cieri, D. Graff, M. Liberman, N. Martey, and S. Strassel. The tdt-2 text and speech corpus. In Proc. of the 1999 DARPA Broadcast News Workshop, Herndon, VA, 1999.

    Google Scholar 

  6. A. Coden, N. Haas, and R. Mack. Multi-search of video segments indexed by timealigned annotations of video content. Technical Report RC21444, IBM Research, 1998.

    Google Scholar 

  7. IBM. Ibm intelligent miner for text. Web pages (see http://www.ibm.com/).

  8. F. Kubala, S. Colbath, D. Liu, A. Srivastava, and J. Makhoul. Integrated technologies for indexing spoken language. Comm. of the ACM, 43(2):48–56, Feb. 2000.

    Article  Google Scholar 

  9. F. Kubala, R. Schwartz, R. Stone, and R. Weischedel. Named entity extraction from speech. In Proc. of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, 1998.

    Google Scholar 

  10. C. D. Manning and H. Schutze. Foundations of Statistical Natur alL anguage Processing. MIT Press, 1999.

    Google Scholar 

  11. A. Mikheev. Document centered approach to text normalization. In Proc. of the 23rd Inter. ACM SIGIR Conf. On Res. And Develop. in Information Retrieval, pages 136–143, Athens, Greece, 2000.

    Google Scholar 

  12. D. S. Pallett, J. G. Fiscus, J. S. Garofolo, A. Martin, and M. A. Przybocki. 1998 broadcast news benchmark test results. In Proc. of the 1999 DARPA Broadcast News Workshop, Herndon, VA, 1999.

    Google Scholar 

  13. M. A. Przybocki, J. G. Fiscus, J. S. Garofolo, and D. S. Pallett. 1998 hub-4 information extraction evaluation. In Proc. of the 1999 DARPA Broadcast News Workshop, Herndon, VA, 1999.

    Google Scholar 

  14. Y. Ravin and C. Leacock. Polysemy: Theoreticaland Computational Approaches. Oxford University Press, 2000.

    Google Scholar 

  15. Y. Ravin, N. Wacholder, and M. Choi. Disambiguation of names in text. In Proc. of the Fifth ACL Conf. on Applied Natural Language Processing, pages 202–208, Washington, D.C., 1997.

    Google Scholar 

  16. J. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proc. of the Fifth ACL Conf. on Applied Natural Language Processing, Washington, D.C., 1997.

    Google Scholar 

  17. B. Shahraray and D. Gibbon. Automated authoring of hypermedia documents of video programs. In Proc. of the Third ACM International Conf. on Multimedia, San Francisco, 1995.

    Google Scholar 

  18. A. Waibel, M. Bett, M. Finke, and R. Stiefelhagen. Meeting browser: Tracking and summarizing meetings. In Proc. of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, 1998.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Brown, E.W., Coden, A.R. (2002). Capitalization Recovery for Text. In: Coden, A.R., Brown, E.W., Srinivasan, S. (eds) Information Retrieval Techniques for Speech Applications. IRTSA 2001. Lecture Notes in Computer Science, vol 2273. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45637-6_2

Download citation

  • DOI: https://doi.org/10.1007/3-540-45637-6_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43156-5

  • Online ISBN: 978-3-540-45637-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics