Article

Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Authors:
William W. Cohen

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Sunita Sarawagi

IIT Bombay, Mumbai, India

IIT Bombay, Mumbai, India
View Profile

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2004Pages 89–98https://doi.org/10.1145/1014052.1014065

Published:22 August 2004Publication History

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 89–98

ABSTRACT

We consider the problem of improving named entity recognition (NER) systems by using external dictionaries---more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is difficult because most high-performance named entity recognition systems operate by sequentially classifying words as to whether or not they participate in an entity name; however, the most useful similarity measures score entire candidate names. To correct this mismatch we formalize a semi-Markov extraction process, which is based on sequentially classifying segments of several adjacent words, rather than single words. In addition to allowing a natural way of coupling high-performance NER methods and high-performance similarity functions, this formalism also allows the direct use of other useful entity-level features, and provides a more natural formulation of the NER problem than sequential word classification. Experiments in multiple domains show that the new model can substantially improve extraction performance over previous methods for using external dictionaries in NER.

References

E. Agichtein and L. Gravano. Snowball: Extracting relations from large plaintext collections. In Proceedings of the 5th ACM International Conference on Digital Libraries, 2000. Google ScholarDigital Library
Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In Proceedings of the 20th International Conference on Machine Learning (ICML), 2003.Google Scholar
D. M. Bikel, R. L. Schwartz, and R. M. Weischedel. An algorithm that learns what's in a name. Machine Learning, 34:211--231, 1999. Google ScholarDigital Library
V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic text segmentation for extracting structured records. In Proc. ACM SIGMOD International Conf. on Management of Data, Santa Barabara,USA, 2001. Google ScholarDigital Library
A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Sixth Workshop on Very Large Corpora New Brunswick, New Jersey. Association for Computational Linguistics., 1998.Google Scholar
R. Bunescu, R. Ge, R. J. Kate, E. M. Marcotte, R. J. Mooney, A. K. Ramani, and Y. W. Wong. Learning to extract proteins and their interactions from medline abstracts. Available from http://www.cs.utexas.edu/users/ml/publication/ie.html, 2002.Google Scholar
R. Bunescu, R. Ge, R. J. Mooney, E. Marcotte, and A. K. Ramani. Extracting gene and protein names from biomedical abstracts. Unpublished Technical Note, Available from http://www.cs.utexas.edu/users/ml/publication/ie.html, 2002.Google Scholar
M. E. Califf and R. J. Mooney. Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research, 4:177--210, 2003. Google ScholarDigital Library
W. W. Cohen and P. Ravikumar. Secondstring: An open-source Java toolkit of approximate string-matching techniques. Project web page, http://secondstring.sourceforge.net, 2003.Google Scholar
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03), 2003.Google Scholar
M. Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Empirical Methods in Natural Language Processing (EMNLP), 2002. Google ScholarDigital Library
M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP99), College Park, MD, 1999.Google Scholar
K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. J. Mach. Learn. Res., 3:951--991, 2003. Google ScholarDigital Library
M. Craven and J. Kumlien. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB-99), pages 77--86. AAAI Press, 1999. Google ScholarDigital Library
R. Durban, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis - Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, 1998.Google Scholar
D. Freitag. Multistrategy learning for information extraction. In Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann, 1998. Google ScholarDigital Library
Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. In Computational Learing Theory, pages 209--217, 1998. Google ScholarDigital Library
X. Ge. Segmental Semi-Markov Models and Applications to Sequence Analysis. PhD thesis, University of California, Irvine, December 2002. Google ScholarDigital Library
D. Hanisch, J. Fluck, H. Mevissen, and R. Zimmer. Playing biology's name game: identifying protein names in scientific text. In Pac Symp Biocomput, pages 403--14, 2003.Google Scholar
K. Humphreys, G. Demetriou, and R. Gaizauskas. Two applications of information extraction to biological science journal articles: Enzyme interactions and protein structures. In Proceedings of 2000 the Pacific Symposium on Biocomputing (PSB-2000), pages 502--513, 2000.Google Scholar
D. Klein and C. D. Manning. Conditional structure versus conditional estimation in nlp models. In Workshop on Empirical Methods in Natural Language Processing (EMNLP), 2002. Google ScholarDigital Library
R. E. Kraut, S. R. Fussell, F. J. Lerch, and J. A. Espinosa. Coordination in teams: evi-dence from a simulated management game. To appear in the Journal of Organizational Behavior, 2004.Google Scholar
M. Krauthammer, A. Rzhetsky, P. Morozov, and C. Friedman. Using blast for identifying gene and protein names in journal articles. Gene, 259(1-2):245--52, 2000.Google ScholarCross Ref
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (ICML-2001), Williams, MA, 2001. Google ScholarDigital Library
S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67--71, 1999. Google ScholarDigital Library
N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4), 1988. Google ScholarDigital Library
R. Malouf. Markov models for language-independent named entity recognition. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), 2002. Google ScholarDigital Library
A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information extraction and segmentation. In Proceedings of the International Conference on Machine Learning (ICML-2000), pages 591--598, Palo Alto, CA, 2000. Google ScholarDigital Library
A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval Journal, 3:127--163, 2000. Google ScholarDigital Library
A. Ratnaparkhi. Learning to parse natural language with maximum entropy models. Machine Learning, 34, 1999. Google ScholarDigital Library
E. Riloff and R. Jones. Learning Dictionaries for Information Extraction by Multi-level Boot-strapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, pages 1044--1049, 1999. Google ScholarDigital Library
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada. ACM, 2002. Google ScholarDigital Library
K. Seymore, A. McCallum, and R. Rosenfeld. Learning Hidden Markov Model structure for information extraction. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 37--42, 1999.Google Scholar
F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL, 2003. Google ScholarDigital Library
R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, Austin, Texas, 1990. Morgan Kaufmann. Google ScholarDigital Library
L. Sweeney. Finding lists of people on the web. Technical Report CMU-CS-03-168, CMU-ISRI-03-104, Carnegie Mellon University School of Computer Science, 2003. Available from: http://privacy.cs.cmu.edu/dataprivacy/projects/rosterfinder/.Google Scholar
W. E. Winkler. Matching and record linkage. In Business Survey methods. Wiley, 1995.Google Scholar
R. Y. Winston Lin and R. Grishman. Bootstrapped learning of semantic classes from positive and negative examples. In Proceedings of the ICML Workshop on The Continuum from Labeled to Unlabeled Data, Washington, D.C, August 2003.Google Scholar

Index Terms

Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Dictionaries

Recommendations

Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Read More
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Read More
Automatic gazette creation for named entity recognition and application to resume processing
COMPUTE '12: Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologies

Named entities are important content-carrying units within documents. Consequently named entity recognition (NER) is an important part of information extraction. One fast and accurate approach to NER uses a list or gazette consisting of known instances. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2004
874 pages
ISBN:1581138881
DOI:10.1145/1014052
General Chairs:
Won Kim
Cyber Database Solutions
,
Ronny Kohavi
Amazon.com
,
Program Chairs:
Johannes Gehrke
Cornell University
,
William DuMouchel
AT&T Labs Research
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 August 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data integration
information extraction
learning
named entity recognition
sequential learning
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 126
  Total Citations
  View Citations
- 1,706
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning multilingual named entity recognition from Wikipedia

Two-stage approach to named entity recognition using Wikipedia and DBpedia

Automatic gazette creation for named entity recognition and application to resume processing