skip to main content
10.1145/345508.345564acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article
Free Access

Document centered approach to text normalization

Authors Info & Claims
Published:01 July 2000Publication History

ABSTRACT

In this paper we present an approach to tackle three important problems of text normalization: sentence boundary disambiguation, disambiguation of capitalized words when they are used in positions where capitalization is expected, and identification of abbreviations. The main feature of our approach is that it uses a minimum of pre-built resources, instead dynamically inferring disambiguation clues from the entire document itself. This makes it domain independent, closely targeted to each individual document and portable to other languages. We thoroughly evaluated this approach on several corpora and it showed high accuracy.

References

  1. 1.J. Aberdeen, J Burger, D. Day, L. Hirschman, P. Robinson and M. Vilain. Mitre: Description of the alembic system used for muc-6. In The Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, Maryland, 1995. Morgan Kanfmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.B. Baldwin, C. Doran, J. Reynar, M. Niv, B. Srinivas and M. Wasson. Eagle: An extensible architecture for general linguistic engineering. In Proceedings of RIAO '97, Montreal, June 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.Kenneth W. Church. One term or two? In Proceedings of the 18th Annual Internationals ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.P. Clarkson and A.J. Robinson. Language model adaptation using mixtures and an exponentially decaying cache. In Proceedings IEEE International Conference on Speech and Signal Processing, Munich, Germany, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.W. Gale, K. Church and D. Yarowsky. One sense per discourse. In Proceedings of the 4th DARPA Speech and Natural Language Workshop, pages 233-237, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.R. Kuhn and R. de Mori. A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 12, pages 570-583, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.I. Mani and T.R. MacMillan. Identifying unknown proper names in newswire text. In B. Boguraev and J. Pustejovsky (editors), Corpus Processing for Lexical Acquisition. MIT Press, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.Mitchell Marcus, Mary Ann Marcinkiewicz and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, Volume 19, Number 2, pages 313-329, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.A. Mikheev. Automatic rule induction for unknown word guessing. Computational Linguistics, Volume 23, Number 3, pages 405-423, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.A. Mikheev. A knowledge-free method for capitalized word disambiguation. In Proceedings of the 37th Conference of the Association for Computational Linguistics (ACL'99), pages 159-168. University of Maryland, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.D. D. Palmer and M. A. Hearst. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.M.D. Riley. Some applications of tree-based modelling to speech and language indexing. In Proceedings of the DARPA Speech and Natural Language Workshop, pages 339-352. Morgan Kaufman, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.K. Seymore, S. Chen and R. Rosenfeld. Nonlinear interpolation of topic models for language model adaptation. In Proceedings of ICSLP98, 1998.Google ScholarGoogle Scholar

Index Terms

  1. Document centered approach to text normalization

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
              July 2000
              396 pages
              ISBN:1581132263
              DOI:10.1145/345508

              Copyright © 2000 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 July 2000

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate792of3,983submissions,20%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader