Document centered approach to text normalization

Author:
Andrei Mikheev

LTG, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH8 9LW, UK

LTG, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH8 9LW, UK
View Profile

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrievalJuly 2000Pages 136–143https://doi.org/10.1145/345508.345564

Published:01 July 2000Publication History

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

Pages 136–143

ABSTRACT

In this paper we present an approach to tackle three important problems of text normalization: sentence boundary disambiguation, disambiguation of capitalized words when they are used in positions where capitalization is expected, and identification of abbreviations. The main feature of our approach is that it uses a minimum of pre-built resources, instead dynamically inferring disambiguation clues from the entire document itself. This makes it domain independent, closely targeted to each individual document and portable to other languages. We thoroughly evaluated this approach on several corpora and it showed high accuracy.

References

1.J. Aberdeen, J Burger, D. Day, L. Hirschman, P. Robinson and M. Vilain. Mitre: Description of the alembic system used for muc-6. In The Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, Maryland, 1995. Morgan Kanfmann. Google ScholarDigital Library
2.B. Baldwin, C. Doran, J. Reynar, M. Niv, B. Srinivas and M. Wasson. Eagle: An extensible architecture for general linguistic engineering. In Proceedings of RIAO '97, Montreal, June 1997.Google ScholarDigital Library
3.Kenneth W. Church. One term or two? In Proceedings of the 18th Annual Internationals ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95), 1995. Google ScholarDigital Library
4.P. Clarkson and A.J. Robinson. Language model adaptation using mixtures and an exponentially decaying cache. In Proceedings IEEE International Conference on Speech and Signal Processing, Munich, Germany, 1997. Google ScholarDigital Library
5.W. Gale, K. Church and D. Yarowsky. One sense per discourse. In Proceedings of the 4th DARPA Speech and Natural Language Workshop, pages 233-237, 1992. Google ScholarDigital Library
6.R. Kuhn and R. de Mori. A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 12, pages 570-583, 1998. Google ScholarDigital Library
7.I. Mani and T.R. MacMillan. Identifying unknown proper names in newswire text. In B. Boguraev and J. Pustejovsky (editors), Corpus Processing for Lexical Acquisition. MIT Press, 1995. Google ScholarDigital Library
8.Mitchell Marcus, Mary Ann Marcinkiewicz and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, Volume 19, Number 2, pages 313-329, 1993. Google ScholarDigital Library
9.A. Mikheev. Automatic rule induction for unknown word guessing. Computational Linguistics, Volume 23, Number 3, pages 405-423, 1997. Google ScholarDigital Library
10.A. Mikheev. A knowledge-free method for capitalized word disambiguation. In Proceedings of the 37th Conference of the Association for Computational Linguistics (ACL'99), pages 159-168. University of Maryland, 1999. Google ScholarDigital Library
11.D. D. Palmer and M. A. Hearst. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 1997. Google ScholarDigital Library
12.M.D. Riley. Some applications of tree-based modelling to speech and language indexing. In Proceedings of the DARPA Speech and Natural Language Workshop, pages 339-352. Morgan Kaufman, 1989. Google ScholarDigital Library
13.K. Seymore, S. Chen and R. Rosenfeld. Nonlinear interpolation of topic models for language model adaptation. In Proceedings of ICSLP98, 1998.Google Scholar

Index Terms

Document centered approach to text normalization

Recommendations

A Multilingual Text Normalization Approach
Human Language Technology Challenges for Computer Science and Linguistics
Abstract
The creation of text corpora requires a sequence of processing steps in order to constitute, normalize, and then to directly exploit it by a given application. This paper presents a generic approach for text normalization and concentrates on the ...
Read More
Exploiting noun phrases and semantic relationships for text document clustering

Text document clustering plays an important role in providing better document retrieval, document browsing, and text mining. Traditionally, clustering techniques do not consider the semantic relationships between words, such as synonymy and hypernymy. ...
Read More
Automatic acquisition of inflectional lexica for morphological normalisation

Due to natural language morphology, words can take on various morphological forms. Morphological normalisation - often used in information retrieval and text mining systems - conflates morphological variants of a word to a single representative form. In ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
July 2000
396 pages
ISBN:1581132263
DOI:10.1145/345508
Chairmen:
Emmanuel Yannakoudakis
Athens Univ. of Economics and Business, Greece
,
Nicholas J. Belkin
Rutgers Univ.
,
Mun-Kew Leong
Kent Ridge Digital Labs
,
Peter Ingwersen
Royal School of Library and Information Science
Copyright © 2000 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 2000
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 102
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Document centered approach to text normalization

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Multilingual Text Normalization Approach

Exploiting noun phrases and semantic relationships for text document clustering

Automatic acquisition of inflectional lexica for morphological normalisation