research-article

Helping editors choose better seed sets for entity set expansion

Authors:
Vishnu Vyas

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Patrick Pantel

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Eric Crestan

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementNovember 2009Pages 225–234https://doi.org/10.1145/1645953.1645984

Published:02 November 2009Publication History

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Pages 225–234

ABSTRACT

Sets of named entities are used heavily at commercial search engines such as Google, Yahoo and Bing. Acquiring sets of entities typically consists of combining semi-supervised expansion algorithms with manual cleaning of the resulting expanded sets. In this paper, we study the effects of different seed sets in a state-of-the-art semi-supervised expansion system and show a tremendous variation in expansion performance depending on the choice of seeds. We further show that human editors, in general, provide very bad seed sets, which perform well-below the average random seed set. We identify three factors of seed set composition, namely prototypicality, ambiguity and coverage, and we investigate their effects on expansion performance. Finally, we propose various automatic systems for improving editor-generated seed sets, which seek to remove ambiguous and other error-prone seed instances. An extensive experimental analysis shows that expansion quality, measured in R-precision, can be improved on average by a maximum of 46% by removing the right seeds from a seed set. Our automatic methods outperform the human editors seed sets and on average improve expansion performance by up to 34% over the original seed sets.

References

S. Abney and S. P. Abney. Parsing by chunks. In Principle-Based Parsing, pages 257--278. Kluwer Academic Publishers, 1991.Google Scholar
J. A. Aslam and E. Yilmaz. A geometric interpretation and analysis of r-precision. In CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 664--671, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation, 2001.Google Scholar
E. Brill. Transformation based error driven learning and natural language processing : A case study in part of speech tagging. Computational Linguistics, 24(4):543--565, 1995. Google ScholarDigital Library
H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li. Context--aware query suggestion by mining click-through and session data. In Proceedings of KDD-08, pages 875--883, 2008. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and D. Xin. Exploiting web search to generate synonyms for entities. In Proceedings of WWW-09, pages 151--160, 2009. Google ScholarDigital Library
I. Dagan and S. P. Engelson. Selective sampling in natural language learning. In IJCAI95 Workshop On New Approaches to Learning for Natural Language Processing, 1995.Google Scholar
D. Downey, M. Broadhead, and O. Etzioni. Locating complex named entities in web text. In Proc. of IJCAI, 2007. Google ScholarDigital Library
R. Florian,, R. Florian, A. Ittycheriah, H. Jing, and T. Zhang. Named entity recognition through classiffer combination. In Proceedings of CoNLL-2003, pages 168--171, 2003. Google ScholarDigital Library
M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (Coling 1992), pages 539--545, Nantes, France, August 1992. Google ScholarDigital Library
J. Hu, G. Wang, F. Lochovsky, J. tao Sun, and Z. Chen. Understanding user's query intent with Wikipedia. In Proceedings of WWW-09, pages 471--480, 2009. Google ScholarDigital Library
A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pages 188--191, Morristown, NJ, USA, 2003. Association for Computational Linguistics. Google ScholarDigital Library
M. Pasca. Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 101--110, New York, NY, USA, 2007. ACM Press. Google ScholarDigital Library
M. Pasca. Weakly-supervised discovery of named entities using web search queries. In Proceedings of CIKM-07, pages 683--690, New York, NY, USA, 2007. Google ScholarDigital Library
M. Pasca. Weakly-supervised discovery of named entities using web search queries. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 683--690, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
M. Pasca and B. Van Durme. Weakly-supervised acquisition of open-domain classes and class attributes from web documents and query logs. In Proceedings of ACL-08: HLT, pages 19--27, Columbus, Ohio, June 2008. Association for Computational Linguistics.Google Scholar
P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web scaled distributional similarity applied to entity set extraction. In EMNLP '09, Singapore, 2009. Google ScholarDigital Library
P. Pantel and D. Lin. Discovering word senses from text. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 613--619, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
P. Pantel and V. Vyas. A joint information model for n-best ranking. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 681--688, Manchester, UK, August 2008. Coling 2008 Organizing Committee. Google ScholarDigital Library
P. A. Pantel. Clustering by committee. PhD thesis, University of Alberta, Edmonton, Alta., Canada, 2003. Adviser-Lin, Dekang. Google ScholarDigital Library
E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of The Sixteenth National Conference on Artificial Intelligence (AAAI-99), 1999. Google ScholarDigital Library
E. Riloff and J. Shepherd. A corpus-based approach for building semantic lexicons. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pages 117--124, 1997.Google Scholar
T. C. Rindeisch, L. Tanabe, and J. N. . Weinstein. Edgar: Extraction of drugs, genes and relations from the biomedical literature. In Proceedings of Pacific Symposium of Biocomputing, pages 502--513, 2000.Google Scholar
E. Rosch. Cognitive representations of semantic categories. Journal of Experimental Psychology: General,, 104(3):192--233, 1975.Google ScholarCross Ref
E. Rosch. Classiffcation of real-world objects: Origins and representation in cognition. pages 212--222, 1977.Google Scholar
E. Rosch. Principles of categorization. pages 27--48, 1978.Google Scholar
B. Tan and F. Peng. Unsupervised query segmentation using generative language models and wikipedia. In Proceedings of WWW-06, pages 1400--1405, 2006. Google ScholarDigital Library
V. Vyas and P. Pantel. Semi-automatic entity set refinement. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 290--298, Boulder, Colorado, June 2009. Association for Computational Linguistics. Google ScholarDigital Library
R. C. Wang, N. Schlaefer, W. W. Cohen, and E. Nyberg. Automatic set expansion for list question answering. In EMNLP, pages 947--954. ACL, 2008. Google ScholarDigital Library

Index Terms

Helping editors choose better seed sets for entity set expansion
1. Computing methodologies
  1. Machine learning

Recommendations

Entity Set Expansion via Knowledge Graphs
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

The entity set expansion problem is to expand a small set of seed entities to a more complete set of similar entities. It can be applied in applications such as web search, item recommendation and query expansion. Traditionally, people solve this ...
Read More
Community membership identification from small seed sets
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

In many applications we have a social network of people and would like to identify the members of an interesting but unlabeled group or community. We start with a small number of exemplar group members -- they may be followers of a political ideology or ...
Read More
Entity Set Expansion from Twitter
ICTIR '18: Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval

Online social media yields a large-scale corpora which is fairly informative and sometimes includes many up-to-date entities. The challenging task of expanding entity sets on social media text is to extract more uncommon entities only using several ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
November 2009
2162 pages
ISBN:9781605585123
DOI:10.1145/1645953
General Chairs:
David Cheung
University of Hong Kong, Hong Kong
,
Il-Yeol Song
Drexel University, USA
,
Program Chairs:
Wesley Chu
UCLA, USA
,
Xiaohua Hu
Drexel University, USA
,
Jimmy Lin
University of Maryland, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 November 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information extraction
seed set expansion
seed set refinement
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 343
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Helping editors choose better seed sets for entity set expansion

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Entity Set Expansion via Knowledge Graphs

Community membership identification from small seed sets

Entity Set Expansion from Twitter

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Helping editors choose better seed sets for entity set expansion

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Entity Set Expansion via Knowledge Graphs

Community membership identification from small seed sets

Entity Set Expansion from Twitter

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media