Article

Acquisition of categorized named entities for web search

Author:

Marius PascaAuthors Info & Claims

CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

Pages 137 - 145

https://doi.org/10.1145/1031171.1031194

Published: 13 November 2004 Publication History

Get Access

Abstract

The recognition of names and their associated categories within unstructured text traditionally relies on semantic lexicons and gazetteers. The amount of effort required to assemble large lexicons confines the recognition to either a limited domain (e.g., medical imaging), or a small set of pre-defined, broader categories of interest (e.g., persons, countries, organizations, products). This constitutes a serious limitation in an information seeking context. In this case, the categories of potential interest to users are more diverse (universities, agencies, retailers, celebrities), often refined (e.g., SLR digital cameras, programming languages, multinational oil companies), and usually overlapping (e.g., the same entity may be concurrently a brand name, a technology company, and an industry leader). We present a lightly supervised method for acquiring named entities in arbitrary categories. The method applies lightweight lexico-syntactic extraction patterns to the unstructured text of Web documents. The method is a departure from traditional approaches to named entity recognition in that: 1) it does not require any start-up seed names or training; 2) it does not encode any domain knowledge in its extraction patterns; 3) it is only lightly supervised, and data-driven; 4) it does not impose any a-priori restriction on the categories of extracted names. We illustrate applications of the method in Web search, and describe experiments on 500 million Web documents and news articles.

References

[1]

E. Agichtein and L. Gravano. Snowball: Extracting relations from large plaintext collections. In Proceedings of the 5th ACM International Conference on Digital Libraries (DL-00), San Antonio, Texas, 2000.

Digital Library

Google Scholar

[2]

T. Brants. TnT - a statistical part of speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing (ANLP-00), pages 224--231, Seattle, Washington, 2000.

Digital Library

Google Scholar

[3]

E. Brill and P. Resnik. A transformation-based approach to prepositional phrase attachment disambiguation. In Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), pages 1198--1204, Kyoto, Japan, 1994.

Digital Library

Google Scholar

[4]

S. Brin. Extracting patterns and relations from the World Wide Web. In Proceedings of the 6th International Conference on Extending Database Technology (EDBT-98), Workshop on the Web and Databases, pages 172--183, Valencia, Spain, 1998.

Digital Library

Google Scholar

[5]

S. Caraballo. Automatic construction of a hypernym-labeled noun hierarchy from text. In Proceedings of the 37th International Conference on Computational Linguistics (ACL-99), pages 120--126, College Park, Maryland, 1999.

Digital Library

Google Scholar

[6]

N. Chinchor and E. Marsh. MUC-7 information extraction task definition, version 5.1. In Proceedings of the 7th Message Understanding Conference (MUC-7), 1998.

Google Scholar

[7]

M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the 1999 Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-99), pages 189--196, College Park, Maryland, 1999.

Google Scholar

[8]

M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118:69--113, 2000.

Digital Library

Google Scholar

[9]

S. Cucerzan and D. Yarowsky. Language independent named entity recognition combining morphological and contextual evidence. In Proceedings of the 1999 Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-99), pages 90--99, College Park, Maryland, 1999.

Google Scholar

[10]

O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in KnowItAll. In Proceedings of the 13th World Wide Web Conference (WWW-04), New York, 2004.

Digital Library

Google Scholar

[11]

C. Fellbaum, editor. WordNet: An Electronic Lexical Database and Some of its Applications. MIT Press, 1998.

Google Scholar

[12]

S. Flank. A layered approach to nlp-based information retrieval. In Proceedings of the 17th International Conference on Computational Linguistics and the 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL-98), pages 397--403, Montreal, Quebec, 1998.

Digital Library

Google Scholar

[13]

S. Green. Automatically generating hypertext in newspaper articles by computing semantic relatedness. In Proceedings of the 2nd Conference on Computational Language Learning (CoNLL-98), pages 101--110, Sydney, Australia, 1998.

Digital Library

Google Scholar

[14]

M. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), pages 539--545, Nantes, France, 1992.

Digital Library

Google Scholar

[15]

B. Jansen. The effect of query complexity on Web searching results. Information Research, 6(1), October 2000.

Google Scholar

[16]

G. Krupka and K. Hausman. IsoQuest, Inc.: Description of the NetOwl extractor system as used for MUC-7. In Proceedings of the 7th Message Understanding Conference (MUC-7), Fairfax, Virginia, 1998.

Google Scholar

[17]

M. Marcus, B. Santorini, and M. Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313--330, June 1993.

Digital Library

Google Scholar

[18]

K. McCarthy and W. Lehnert. Using decision trees for coreference resolution. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), pages 1050--1055, Montreal, Quebec, 1995.

Digital Library

Google Scholar

[19]

A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-99), pages 1--8, Bergen, Norway, 1999.

Digital Library

Google Scholar

[20]

P. Pantel and D. Ravichandran. Automatically labeling semantic classes. In Proceedings of the 2004 Human Language Technology Conference (HLT-NAACL-04), pages 321--328, Boston, Massachusetts, 2004.

Google Scholar

[21]

W. Phillips and E. Riloff. Exploiting strong syntactic heuristics and co-training to learn semantic lexicons. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-02), pages 125--132, Philadelphia, Pennsylvania, 2002.

Digital Library

Google Scholar

[22]

D. Ravichandran and E. Hovy. Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics (ACL-02), Philadelphia, Pennsylvania, 2002.

Digital Library

Google Scholar

[23]

E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99), pages 474--479, Orlando, Florida, 1999.

Digital Library

Google Scholar

[24]

S. Russell and P. Norvig. Artificial Intelligence: a Modern Approach. Prentice Hall, 2nd edition, 2003.

Digital Library

Google Scholar

[25]

K. Shinzato and K. Torisawa. Acquiring hyponymy relations from web documents. In Proceedings of the 2004 Human Language Technology Conference (HLT-NAACL-04), pages 73--80, Boston, Massachusetts, 2004.

Google Scholar

[26]

M. Stevenson and R. Gaizauskas. Using corpus-derived name lists for named entity recognition. In Proceedings of the 6th Conference on Applied Natural Language Processing (ANLP-00), Seattle, Washington, 2000.

Digital Library

Google Scholar

[27]

N. Stokes and J. Carthy. First story detection using a composite document representation. In Proceedings of the 1st International Conference on Human Language Technology Research (HLT-01), San Diego, California, 2001.

Digital Library

Google Scholar

[28]

E. Voorhees. Using WordNet for text retrieval. In WordNet, An Electronic Lexical Database, pages 285--303. The MIT Press, 1998.

Google Scholar

Cited By

View all

Kwon SJung WShim K(2022)Cardinality estimation of approximate substring queries using deep learningProceedings of the VLDB Endowment10.14778/3551793.355185915:11(3145-3157)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.14778/3551793.3551859
Wang RHou FCahan SChen LJia XJi W(2022)Fine-Grained Entity Typing with a Type Taxonomy: a Systematic ReviewIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3148980(1-1)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3148980
Li QMao KLi PXu YLo E(2022)A novel end-to-end neural network for simultaneous filtering of task-unrelated named entities and fine-grained typing of task-related named entitiesExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.117498204:COnline publication date: 15-Oct-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.117498
Show More Cited By

Index Terms

Acquisition of categorized named entities for web search
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
    1. Learning settings
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing
  2. World Wide Web
    1. Web applications
    2. Web services

Recommendations

Weakly-supervised discovery of named entities using web search queries
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

A seed-based framework for textual information extraction allows for weakly supervised extraction of named entities from anonymized Web search queries. The extraction is guided by a small set of seed named entities, without any need for handcrafted ...
Comparison of Methods to Annotate Named Entity Corpora

The authors compared two methods for annotating a corpus for the named entity (NE) recognition task using non-expert annotators: (i) revising the results of an existing NE recognizer and (ii) manually annotating the NEs completely. The annotation time, ...
Automatic gazette creation for named entity recognition and application to resume processing
COMPUTE '12: Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologies

Named entities are important content-carrying units within documents. Consequently named entity recognition (NER) is an important part of information extraction. One fast and accurate approach to NER uses a list or gazette consisting of known instances. ...

Reviews

Reviewer: Alexander Gelbukh

The first thing expected from a program intelligently dealing with natural language text is the ability to relate some words with others, for example, to know that France is a European country, so that a user searching on the Internet for "tours to Europe" would get information on visiting France. Dictionaries (ontologies) providing such information are the driving force of modern information retrieval, language processing, and electronic commerce research. Since the manual compilation of such ontologies is too costly, many researchers have suggested methods for their automatic creation, by parsing texts. Pasca, of Google, demonstrates how to extract such an ontology, with a surprisingly simple, robust, and extensible unsupervised algorithm, looking for expressions like "France and other European countries" in a huge collection of Web pages; the system automatically learns new instances of such patterns. Though the idea is not novel [1], its evaluation on such a large corpus provides yet another example of exploiting the enormous redundancy of the Web to extract very specific knowledge, on very broad subjects, with very basic, simplistic algorithms. The paper discusses numerous applications of the extracted ontology in information retrieval and computational lexicography (which is also not new, but tutorial), and possible generalizations of the method to extract other types of data. The paper is motivating for information retrieval and language engineering specialists, and, due to its good introduction, clear style, numerous examples, and simple algorithms, it will be encouraging for novices, and for developers dealing with natural language data. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

November 2004

678 pages

ISBN:1581138741

DOI:10.1145/1031171

General Chair:
David Grossman
Illinois Institute of Technology
,
Program Chairs:
Luis Gravano
Columbia University
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign
,
Otthein Herzog
University of Bremen, Germany
,
David A. Evans
Clairvoyance Corporation

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CIKM04

Sponsor:

CIKM04: Conference on Information and Knowledge Management

November 8 - 13, 2004

D.C., Washington, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

74
Total Citations
View Citations
1,923
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Kwon SJung WShim K(2022)Cardinality estimation of approximate substring queries using deep learningProceedings of the VLDB Endowment10.14778/3551793.355185915:11(3145-3157)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.14778/3551793.3551859
Wang RHou FCahan SChen LJia XJi W(2022)Fine-Grained Entity Typing with a Type Taxonomy: a Systematic ReviewIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3148980(1-1)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3148980
Li QMao KLi PXu YLo E(2022)A novel end-to-end neural network for simultaneous filtering of task-unrelated named entities and fine-grained typing of task-related named entitiesExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.117498204:COnline publication date: 15-Oct-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.117498
Jacucci GDaee PVuong TAndolina SKlouche KSjöberg MRuotsalo TKaski S(2021)Entity Recommendation for Everyday Digital TasksACM Transactions on Computer-Human Interaction10.1145/345891928:5(1-41)Online publication date: 20-Aug-2021
https://dl.acm.org/doi/10.1145/3458919
Song GShim KLee H(2021)Substring Similarity Search with Synonyms2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00191(2003-2008)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00191
Çelebi AÖzgür A(2020)Cluster-based mention typing for named entity disambiguationNatural Language Engineering10.1017/S135132492000044328:1(1-37)Online publication date: 20-Aug-2020
https://doi.org/10.1017/S1351324920000443
Shahshahani MMohseni MShakery AFaili H(2019)PAYMA: A Tagged Corpus of Persian Named EntitiesSignal and Data Processing10.29252/jsdp.16.1.9116:1(91-110)Online publication date: 1-May-2019
https://doi.org/10.29252/jsdp.16.1.91
Atkinson KBench-Capon T(2018)Taking account of the actions of others in value-based reasoningArtificial Intelligence10.1016/j.artint.2017.09.002254:C(1-20)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.1016/j.artint.2017.09.002
Rong XChen ZMei QAdar EBennett PJosifovski VNeville JRadlinski F(2016)EgoSetProceedings of the Ninth ACM International Conference on Web Search and Data Mining10.1145/2835776.2835808(645-654)Online publication date: 8-Feb-2016
https://dl.acm.org/doi/10.1145/2835776.2835808
Siersdorfer SKemkes PAckermann HZerr SBailey JMoffat AAggarwal Cde Rijke MKumar RMurdock VSellis TYu J(2015)Who With Whom And How?Proceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806582(1491-1500)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1145/2806416.2806582
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Weakly-supervised discovery of named entities using web search queries

Comparison of Methods to Annotate Named Entity Corpora

Automatic gazette creation for named entity recognition and application to resume processing

Reviews

Access critical reviews of Computing literature here

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Weakly-supervised discovery of named entities using web search queries

Comparison of Methods to Annotate Named Entity Corpora

Automatic gazette creation for named entity recognition and application to resume processing

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations