skip to main content
10.1145/1031171.1031194acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Acquisition of categorized named entities for web search

Published: 13 November 2004 Publication History

Abstract

The recognition of names and their associated categories within unstructured text traditionally relies on semantic lexicons and gazetteers. The amount of effort required to assemble large lexicons confines the recognition to either a limited domain (e.g., <i>medical imaging</i>), or a small set of pre-defined, broader categories of interest (e.g., <i>persons</i>, <i>countries</i>, <i>organizations</i>, <i>products</i>). This constitutes a serious limitation in an information seeking context. In this case, the categories of potential interest to users are more diverse (<i>universities</i>, <i>agencies</i>, <i>retailers</i>, <i>celebrities</i>), often refined (e.g., <i>SLR digital cameras</i>, <i>programming languages</i>, <i>multinational oil companies</i>), and usually overlapping (e.g., the same entity may be concurrently a <i>brand name</i>, a <i>technology company</i>, and an <i>industry leader</i>). We present a lightly supervised method for acquiring named entities in arbitrary categories. The method applies lightweight lexico-syntactic extraction patterns to the unstructured text of Web documents. The method is a departure from traditional approaches to named entity recognition in that: 1) it does not require any start-up seed names or training; 2) it does not encode any domain knowledge in its extraction patterns; 3) it is only lightly supervised, and data-driven; 4) it does not impose any a-priori restriction on the categories of extracted names. We illustrate applications of the method in Web search, and describe experiments on 500 million Web documents and news articles.

References

[1]
E. Agichtein and L. Gravano. Snowball: Extracting relations from large plaintext collections. In Proceedings of the 5th ACM International Conference on Digital Libraries (DL-00), San Antonio, Texas, 2000.
[2]
T. Brants. TnT - a statistical part of speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing (ANLP-00), pages 224--231, Seattle, Washington, 2000.
[3]
E. Brill and P. Resnik. A transformation-based approach to prepositional phrase attachment disambiguation. In Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), pages 1198--1204, Kyoto, Japan, 1994.
[4]
S. Brin. Extracting patterns and relations from the World Wide Web. In Proceedings of the 6th International Conference on Extending Database Technology (EDBT-98), Workshop on the Web and Databases, pages 172--183, Valencia, Spain, 1998.
[5]
S. Caraballo. Automatic construction of a hypernym-labeled noun hierarchy from text. In Proceedings of the 37th International Conference on Computational Linguistics (ACL-99), pages 120--126, College Park, Maryland, 1999.
[6]
N. Chinchor and E. Marsh. MUC-7 information extraction task definition, version 5.1. In Proceedings of the 7th Message Understanding Conference (MUC-7), 1998.
[7]
M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the 1999 Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-99), pages 189--196, College Park, Maryland, 1999.
[8]
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118:69--113, 2000.
[9]
S. Cucerzan and D. Yarowsky. Language independent named entity recognition combining morphological and contextual evidence. In Proceedings of the 1999 Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-99), pages 90--99, College Park, Maryland, 1999.
[10]
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in KnowItAll. In Proceedings of the 13th World Wide Web Conference (WWW-04), New York, 2004.
[11]
C. Fellbaum, editor. WordNet: An Electronic Lexical Database and Some of its Applications. MIT Press, 1998.
[12]
S. Flank. A layered approach to nlp-based information retrieval. In Proceedings of the 17th International Conference on Computational Linguistics and the 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL-98), pages 397--403, Montreal, Quebec, 1998.
[13]
S. Green. Automatically generating hypertext in newspaper articles by computing semantic relatedness. In Proceedings of the 2nd Conference on Computational Language Learning (CoNLL-98), pages 101--110, Sydney, Australia, 1998.
[14]
M. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), pages 539--545, Nantes, France, 1992.
[15]
B. Jansen. The effect of query complexity on Web searching results. Information Research, 6(1), October 2000.
[16]
G. Krupka and K. Hausman. IsoQuest, Inc.: Description of the NetOwl extractor system as used for MUC-7. In Proceedings of the 7th Message Understanding Conference (MUC-7), Fairfax, Virginia, 1998.
[17]
M. Marcus, B. Santorini, and M. Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313--330, June 1993.
[18]
K. McCarthy and W. Lehnert. Using decision trees for coreference resolution. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), pages 1050--1055, Montreal, Quebec, 1995.
[19]
A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-99), pages 1--8, Bergen, Norway, 1999.
[20]
P. Pantel and D. Ravichandran. Automatically labeling semantic classes. In Proceedings of the 2004 Human Language Technology Conference (HLT-NAACL-04), pages 321--328, Boston, Massachusetts, 2004.
[21]
W. Phillips and E. Riloff. Exploiting strong syntactic heuristics and co-training to learn semantic lexicons. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-02), pages 125--132, Philadelphia, Pennsylvania, 2002.
[22]
D. Ravichandran and E. Hovy. Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics (ACL-02), Philadelphia, Pennsylvania, 2002.
[23]
E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99), pages 474--479, Orlando, Florida, 1999.
[24]
S. Russell and P. Norvig. Artificial Intelligence: a Modern Approach. Prentice Hall, 2nd edition, 2003.
[25]
K. Shinzato and K. Torisawa. Acquiring hyponymy relations from web documents. In Proceedings of the 2004 Human Language Technology Conference (HLT-NAACL-04), pages 73--80, Boston, Massachusetts, 2004.
[26]
M. Stevenson and R. Gaizauskas. Using corpus-derived name lists for named entity recognition. In Proceedings of the 6th Conference on Applied Natural Language Processing (ANLP-00), Seattle, Washington, 2000.
[27]
N. Stokes and J. Carthy. First story detection using a composite document representation. In Proceedings of the 1st International Conference on Human Language Technology Research (HLT-01), San Diego, California, 2001.
[28]
E. Voorhees. Using WordNet for text retrieval. In WordNet, An Electronic Lexical Database, pages 285--303. The MIT Press, 1998.

Cited By

View all
  • (2022)Cardinality estimation of approximate substring queries using deep learningProceedings of the VLDB Endowment10.14778/3551793.355185915:11(3145-3157)Online publication date: 1-Jul-2022
  • (2022)Fine-Grained Entity Typing with a Type Taxonomy: a Systematic ReviewIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3148980(1-1)Online publication date: 2022
  • (2022)A novel end-to-end neural network for simultaneous filtering of task-unrelated named entities and fine-grained typing of task-related named entitiesExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.117498204:COnline publication date: 15-Oct-2022
  • Show More Cited By

Recommendations

Reviews

Alexander Gelbukh

The first thing expected from a program intelligently dealing with natural language text is the ability to relate some words with others, for example, to know that France is a European country, so that a user searching on the Internet for "tours to Europe" would get information on visiting France. Dictionaries (ontologies) providing such information are the driving force of modern information retrieval, language processing, and electronic commerce research. Since the manual compilation of such ontologies is too costly, many researchers have suggested methods for their automatic creation, by parsing texts. Pasca, of Google, demonstrates how to extract such an ontology, with a surprisingly simple, robust, and extensible unsupervised algorithm, looking for expressions like "France and other European countries" in a huge collection of Web pages; the system automatically learns new instances of such patterns. Though the idea is not novel [1], its evaluation on such a large corpus provides yet another example of exploiting the enormous redundancy of the Web to extract very specific knowledge, on very broad subjects, with very basic, simplistic algorithms. The paper discusses numerous applications of the extracted ontology in information retrieval and computational lexicography (which is also not new, but tutorial), and possible generalizations of the method to extract other types of data. The paper is motivating for information retrieval and language engineering specialists, and, due to its good introduction, clear style, numerous examples, and simple algorithms, it will be encouraging for novices, and for developers dealing with natural language data. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management
November 2004
678 pages
ISBN:1581138741
DOI:10.1145/1031171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. information integration
  2. lightweight text processing
  3. named entity extraction
  4. related names and categories
  5. web information retrieval

Qualifiers

  • Article

Conference

CIKM04
Sponsor:
CIKM04: Conference on Information and Knowledge Management
November 8 - 13, 2004
D.C., Washington, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Cardinality estimation of approximate substring queries using deep learningProceedings of the VLDB Endowment10.14778/3551793.355185915:11(3145-3157)Online publication date: 1-Jul-2022
  • (2022)Fine-Grained Entity Typing with a Type Taxonomy: a Systematic ReviewIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3148980(1-1)Online publication date: 2022
  • (2022)A novel end-to-end neural network for simultaneous filtering of task-unrelated named entities and fine-grained typing of task-related named entitiesExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.117498204:COnline publication date: 15-Oct-2022
  • (2021)Entity Recommendation for Everyday Digital TasksACM Transactions on Computer-Human Interaction10.1145/345891928:5(1-41)Online publication date: 20-Aug-2021
  • (2021)Substring Similarity Search with Synonyms2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00191(2003-2008)Online publication date: Apr-2021
  • (2020)Cluster-based mention typing for named entity disambiguationNatural Language Engineering10.1017/S135132492000044328:1(1-37)Online publication date: 20-Aug-2020
  • (2019)PAYMA: A Tagged Corpus of Persian Named EntitiesSignal and Data Processing10.29252/jsdp.16.1.9116:1(91-110)Online publication date: 1-May-2019
  • (2018)Taking account of the actions of others in value-based reasoningArtificial Intelligence10.1016/j.artint.2017.09.002254:C(1-20)Online publication date: 1-Jan-2018
  • (2016)EgoSetProceedings of the Ninth ACM International Conference on Web Search and Data Mining10.1145/2835776.2835808(645-654)Online publication date: 8-Feb-2016
  • (2015)Who With Whom And How?Proceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806582(1491-1500)Online publication date: 17-Oct-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media