skip to main content
10.1145/1645953.1645984acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Helping editors choose better seed sets for entity set expansion

Published:02 November 2009Publication History

ABSTRACT

Sets of named entities are used heavily at commercial search engines such as Google, Yahoo and Bing. Acquiring sets of entities typically consists of combining semi-supervised expansion algorithms with manual cleaning of the resulting expanded sets. In this paper, we study the effects of different seed sets in a state-of-the-art semi-supervised expansion system and show a tremendous variation in expansion performance depending on the choice of seeds. We further show that human editors, in general, provide very bad seed sets, which perform well-below the average random seed set. We identify three factors of seed set composition, namely prototypicality, ambiguity and coverage, and we investigate their effects on expansion performance. Finally, we propose various automatic systems for improving editor-generated seed sets, which seek to remove ambiguous and other error-prone seed instances. An extensive experimental analysis shows that expansion quality, measured in R-precision, can be improved on average by a maximum of 46% by removing the right seeds from a seed set. Our automatic methods outperform the human editors seed sets and on average improve expansion performance by up to 34% over the original seed sets.

References

  1. S. Abney and S. P. Abney. Parsing by chunks. In Principle-Based Parsing, pages 257--278. Kluwer Academic Publishers, 1991.Google ScholarGoogle Scholar
  2. J. A. Aslam and E. Yilmaz. A geometric interpretation and analysis of r-precision. In CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 664--671, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation, 2001.Google ScholarGoogle Scholar
  4. E. Brill. Transformation based error driven learning and natural language processing : A case study in part of speech tagging. Computational Linguistics, 24(4):543--565, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li. Context--aware query suggestion by mining click-through and session data. In Proceedings of KDD-08, pages 875--883, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chaudhuri, V. Ganti, and D. Xin. Exploiting web search to generate synonyms for entities. In Proceedings of WWW-09, pages 151--160, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. I. Dagan and S. P. Engelson. Selective sampling in natural language learning. In IJCAI95 Workshop On New Approaches to Learning for Natural Language Processing, 1995.Google ScholarGoogle Scholar
  8. D. Downey, M. Broadhead, and O. Etzioni. Locating complex named entities in web text. In Proc. of IJCAI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Florian,, R. Florian, A. Ittycheriah, H. Jing, and T. Zhang. Named entity recognition through classiffer combination. In Proceedings of CoNLL-2003, pages 168--171, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (Coling 1992), pages 539--545, Nantes, France, August 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Hu, G. Wang, F. Lochovsky, J. tao Sun, and Z. Chen. Understanding user's query intent with Wikipedia. In Proceedings of WWW-09, pages 471--480, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pages 188--191, Morristown, NJ, USA, 2003. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Pasca. Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 101--110, New York, NY, USA, 2007. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Pasca. Weakly-supervised discovery of named entities using web search queries. In Proceedings of CIKM-07, pages 683--690, New York, NY, USA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Pasca. Weakly-supervised discovery of named entities using web search queries. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 683--690, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Pasca and B. Van Durme. Weakly-supervised acquisition of open-domain classes and class attributes from web documents and query logs. In Proceedings of ACL-08: HLT, pages 19--27, Columbus, Ohio, June 2008. Association for Computational Linguistics.Google ScholarGoogle Scholar
  17. P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web scaled distributional similarity applied to entity set extraction. In EMNLP '09, Singapore, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Pantel and D. Lin. Discovering word senses from text. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 613--619, New York, NY, USA, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Pantel and V. Vyas. A joint information model for n-best ranking. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 681--688, Manchester, UK, August 2008. Coling 2008 Organizing Committee. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. A. Pantel. Clustering by committee. PhD thesis, University of Alberta, Edmonton, Alta., Canada, 2003. Adviser-Lin, Dekang. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of The Sixteenth National Conference on Artificial Intelligence (AAAI-99), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. Riloff and J. Shepherd. A corpus-based approach for building semantic lexicons. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pages 117--124, 1997.Google ScholarGoogle Scholar
  23. T. C. Rindeisch, L. Tanabe, and J. N. . Weinstein. Edgar: Extraction of drugs, genes and relations from the biomedical literature. In Proceedings of Pacific Symposium of Biocomputing, pages 502--513, 2000.Google ScholarGoogle Scholar
  24. E. Rosch. Cognitive representations of semantic categories. Journal of Experimental Psychology: General,, 104(3):192--233, 1975.Google ScholarGoogle ScholarCross RefCross Ref
  25. E. Rosch. Classiffcation of real-world objects: Origins and representation in cognition. pages 212--222, 1977.Google ScholarGoogle Scholar
  26. E. Rosch. Principles of categorization. pages 27--48, 1978.Google ScholarGoogle Scholar
  27. B. Tan and F. Peng. Unsupervised query segmentation using generative language models and wikipedia. In Proceedings of WWW-06, pages 1400--1405, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. Vyas and P. Pantel. Semi-automatic entity set refinement. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 290--298, Boulder, Colorado, June 2009. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. C. Wang, N. Schlaefer, W. W. Cohen, and E. Nyberg. Automatic set expansion for list question answering. In EMNLP, pages 947--954. ACL, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Helping editors choose better seed sets for entity set expansion

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
      November 2009
      2162 pages
      ISBN:9781605585123
      DOI:10.1145/1645953

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 November 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader