ABSTRACT
In this paper we present an approach to tackle three important problems of text normalization: sentence boundary disambiguation, disambiguation of capitalized words when they are used in positions where capitalization is expected, and identification of abbreviations. The main feature of our approach is that it uses a minimum of pre-built resources, instead dynamically inferring disambiguation clues from the entire document itself. This makes it domain independent, closely targeted to each individual document and portable to other languages. We thoroughly evaluated this approach on several corpora and it showed high accuracy.
- 1.J. Aberdeen, J Burger, D. Day, L. Hirschman, P. Robinson and M. Vilain. Mitre: Description of the alembic system used for muc-6. In The Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, Maryland, 1995. Morgan Kanfmann. Google ScholarDigital Library
- 2.B. Baldwin, C. Doran, J. Reynar, M. Niv, B. Srinivas and M. Wasson. Eagle: An extensible architecture for general linguistic engineering. In Proceedings of RIAO '97, Montreal, June 1997.Google ScholarDigital Library
- 3.Kenneth W. Church. One term or two? In Proceedings of the 18th Annual Internationals ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95), 1995. Google ScholarDigital Library
- 4.P. Clarkson and A.J. Robinson. Language model adaptation using mixtures and an exponentially decaying cache. In Proceedings IEEE International Conference on Speech and Signal Processing, Munich, Germany, 1997. Google ScholarDigital Library
- 5.W. Gale, K. Church and D. Yarowsky. One sense per discourse. In Proceedings of the 4th DARPA Speech and Natural Language Workshop, pages 233-237, 1992. Google ScholarDigital Library
- 6.R. Kuhn and R. de Mori. A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 12, pages 570-583, 1998. Google ScholarDigital Library
- 7.I. Mani and T.R. MacMillan. Identifying unknown proper names in newswire text. In B. Boguraev and J. Pustejovsky (editors), Corpus Processing for Lexical Acquisition. MIT Press, 1995. Google ScholarDigital Library
- 8.Mitchell Marcus, Mary Ann Marcinkiewicz and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, Volume 19, Number 2, pages 313-329, 1993. Google ScholarDigital Library
- 9.A. Mikheev. Automatic rule induction for unknown word guessing. Computational Linguistics, Volume 23, Number 3, pages 405-423, 1997. Google ScholarDigital Library
- 10.A. Mikheev. A knowledge-free method for capitalized word disambiguation. In Proceedings of the 37th Conference of the Association for Computational Linguistics (ACL'99), pages 159-168. University of Maryland, 1999. Google ScholarDigital Library
- 11.D. D. Palmer and M. A. Hearst. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 1997. Google ScholarDigital Library
- 12.M.D. Riley. Some applications of tree-based modelling to speech and language indexing. In Proceedings of the DARPA Speech and Natural Language Workshop, pages 339-352. Morgan Kaufman, 1989. Google ScholarDigital Library
- 13.K. Seymore, S. Chen and R. Rosenfeld. Nonlinear interpolation of topic models for language model adaptation. In Proceedings of ICSLP98, 1998.Google Scholar
Index Terms
- Document centered approach to text normalization
Recommendations
A Multilingual Text Normalization Approach
Human Language Technology Challenges for Computer Science and LinguisticsAbstractThe creation of text corpora requires a sequence of processing steps in order to constitute, normalize, and then to directly exploit it by a given application. This paper presents a generic approach for text normalization and concentrates on the ...
Exploiting noun phrases and semantic relationships for text document clustering
Text document clustering plays an important role in providing better document retrieval, document browsing, and text mining. Traditionally, clustering techniques do not consider the semantic relationships between words, such as synonymy and hypernymy. ...
Automatic acquisition of inflectional lexica for morphological normalisation
Due to natural language morphology, words can take on various morphological forms. Morphological normalisation - often used in information retrieval and text mining systems - conflates morphological variants of a word to a single representative form. In ...
Comments