ABSTRACT
Sets of named entities are used heavily at commercial search engines such as Google, Yahoo and Bing. Acquiring sets of entities typically consists of combining semi-supervised expansion algorithms with manual cleaning of the resulting expanded sets. In this paper, we study the effects of different seed sets in a state-of-the-art semi-supervised expansion system and show a tremendous variation in expansion performance depending on the choice of seeds. We further show that human editors, in general, provide very bad seed sets, which perform well-below the average random seed set. We identify three factors of seed set composition, namely prototypicality, ambiguity and coverage, and we investigate their effects on expansion performance. Finally, we propose various automatic systems for improving editor-generated seed sets, which seek to remove ambiguous and other error-prone seed instances. An extensive experimental analysis shows that expansion quality, measured in R-precision, can be improved on average by a maximum of 46% by removing the right seeds from a seed set. Our automatic methods outperform the human editors seed sets and on average improve expansion performance by up to 34% over the original seed sets.
- S. Abney and S. P. Abney. Parsing by chunks. In Principle-Based Parsing, pages 257--278. Kluwer Academic Publishers, 1991.Google Scholar
- J. A. Aslam and E. Yilmaz. A geometric interpretation and analysis of r-precision. In CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 664--671, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation, 2001.Google Scholar
- E. Brill. Transformation based error driven learning and natural language processing : A case study in part of speech tagging. Computational Linguistics, 24(4):543--565, 1995. Google ScholarDigital Library
- H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li. Context--aware query suggestion by mining click-through and session data. In Proceedings of KDD-08, pages 875--883, 2008. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and D. Xin. Exploiting web search to generate synonyms for entities. In Proceedings of WWW-09, pages 151--160, 2009. Google ScholarDigital Library
- I. Dagan and S. P. Engelson. Selective sampling in natural language learning. In IJCAI95 Workshop On New Approaches to Learning for Natural Language Processing, 1995.Google Scholar
- D. Downey, M. Broadhead, and O. Etzioni. Locating complex named entities in web text. In Proc. of IJCAI, 2007. Google ScholarDigital Library
- R. Florian,, R. Florian, A. Ittycheriah, H. Jing, and T. Zhang. Named entity recognition through classiffer combination. In Proceedings of CoNLL-2003, pages 168--171, 2003. Google ScholarDigital Library
- M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (Coling 1992), pages 539--545, Nantes, France, August 1992. Google ScholarDigital Library
- J. Hu, G. Wang, F. Lochovsky, J. tao Sun, and Z. Chen. Understanding user's query intent with Wikipedia. In Proceedings of WWW-09, pages 471--480, 2009. Google ScholarDigital Library
- A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pages 188--191, Morristown, NJ, USA, 2003. Association for Computational Linguistics. Google ScholarDigital Library
- M. Pasca. Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 101--110, New York, NY, USA, 2007. ACM Press. Google ScholarDigital Library
- M. Pasca. Weakly-supervised discovery of named entities using web search queries. In Proceedings of CIKM-07, pages 683--690, New York, NY, USA, 2007. Google ScholarDigital Library
- M. Pasca. Weakly-supervised discovery of named entities using web search queries. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 683--690, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- M. Pasca and B. Van Durme. Weakly-supervised acquisition of open-domain classes and class attributes from web documents and query logs. In Proceedings of ACL-08: HLT, pages 19--27, Columbus, Ohio, June 2008. Association for Computational Linguistics.Google Scholar
- P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web scaled distributional similarity applied to entity set extraction. In EMNLP '09, Singapore, 2009. Google ScholarDigital Library
- P. Pantel and D. Lin. Discovering word senses from text. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 613--619, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
- P. Pantel and V. Vyas. A joint information model for n-best ranking. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 681--688, Manchester, UK, August 2008. Coling 2008 Organizing Committee. Google ScholarDigital Library
- P. A. Pantel. Clustering by committee. PhD thesis, University of Alberta, Edmonton, Alta., Canada, 2003. Adviser-Lin, Dekang. Google ScholarDigital Library
- E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of The Sixteenth National Conference on Artificial Intelligence (AAAI-99), 1999. Google ScholarDigital Library
- E. Riloff and J. Shepherd. A corpus-based approach for building semantic lexicons. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pages 117--124, 1997.Google Scholar
- T. C. Rindeisch, L. Tanabe, and J. N. . Weinstein. Edgar: Extraction of drugs, genes and relations from the biomedical literature. In Proceedings of Pacific Symposium of Biocomputing, pages 502--513, 2000.Google Scholar
- E. Rosch. Cognitive representations of semantic categories. Journal of Experimental Psychology: General,, 104(3):192--233, 1975.Google ScholarCross Ref
- E. Rosch. Classiffcation of real-world objects: Origins and representation in cognition. pages 212--222, 1977.Google Scholar
- E. Rosch. Principles of categorization. pages 27--48, 1978.Google Scholar
- B. Tan and F. Peng. Unsupervised query segmentation using generative language models and wikipedia. In Proceedings of WWW-06, pages 1400--1405, 2006. Google ScholarDigital Library
- V. Vyas and P. Pantel. Semi-automatic entity set refinement. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 290--298, Boulder, Colorado, June 2009. Association for Computational Linguistics. Google ScholarDigital Library
- R. C. Wang, N. Schlaefer, W. W. Cohen, and E. Nyberg. Automatic set expansion for list question answering. In EMNLP, pages 947--954. ACL, 2008. Google ScholarDigital Library
Index Terms
- Helping editors choose better seed sets for entity set expansion
Recommendations
Entity Set Expansion via Knowledge Graphs
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalThe entity set expansion problem is to expand a small set of seed entities to a more complete set of similar entities. It can be applied in applications such as web search, item recommendation and query expansion. Traditionally, people solve this ...
Community membership identification from small seed sets
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data miningIn many applications we have a social network of people and would like to identify the members of an interesting but unlabeled group or community. We start with a small number of exemplar group members -- they may be followers of a political ideology or ...
Entity Set Expansion from Twitter
ICTIR '18: Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information RetrievalOnline social media yields a large-scale corpora which is fairly informative and sometimes includes many up-to-date entities. The challenging task of expanding entity sets on social media text is to extract more uncommon entities only using several ...
Comments