skip to main content
10.1145/2505515.2505602acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Identifying salient entities in web pages

Published:27 October 2013Publication History

ABSTRACT

We propose a system that determines the salience of entities within web documents. Many recent advances in commercial search engines leverage the identification of entities in web pages. However, for many pages, only a small subset of entities are central to the document, which can lead to degraded relevance for entity triggered experiences. We address this problem by devising a system that scores each entity on a web page according to its centrality to the page content. We propose salience classification functions that incorporate various cues from document content, web search logs, and a large web graph. To cost-effectively train the models, we introduce a soft labeling methodology that generates a set of annotations based on user behaviors observed in web search logs. We evaluate several variations of our model via a large-scale empirical study conducted over a test set, which we release publicly to the research community. We demonstrate that our methods significantly outperform competitive baselines and the previous state of the art, while keeping the human annotation cost to a minimum.

References

  1. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. the Journal of machine Learning research, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. D. Bruza, D. W. Song, and K. F. Wong. Aboutness from a commonsense perspective. Journal of the American Society for Information Science, 51:1090--1105, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Cai, S. Yu, J. Wen, and W. Ma. Extracting content structure for web pages based on visual representation. Web Technologies and Applications, pages 406--417, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  4. M. Collins. Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In Proceedings of EMNLP, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon, S. Keerthi, and S. Merugu. A web of concepts. In Proceedings of PODS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189--1232, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  7. M. Gamon, T. Yano, X. Song, J. Apacible, and P. Pantel. Understanding Document Aboutness - Step One: Identifying Salient Entities. Technical Report MSR-TR-2013-73, Microsoft Research, 2013.Google ScholarGoogle Scholar
  8. B. Hjørland. Towards a theory of aboutness, subject, topicality, theme, domain, field, content... and relevance. Journal of the American Society for Information Science and Technology, 52(9):774--778, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Holland, M. Ester, and W. Kießling. Preference mining: A novel approach on mining user preferences for personalized applications. Knowledge Discovery in Databases: PKDD 2003, pages 204--216, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  10. E. Hovy and C. Y. Lin. Automated text summarization and the summarist system. In Proceedings of a workshop on held at Baltimore, Maryland: October 13--15, 1998, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of EMNLP, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Hutchins. On the problem of 'aboutness' in document analysis. Journal of Informatics, 1(1):17--35, 1977.Google ScholarGoogle Scholar
  13. U. Irmak, V. V. Brzeski, and R. Kraft. Contextual ranking of keywords using click data. In Proceedings of ICDE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of SIGKDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of SIGIR, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Komachi and H. Suzuki. Minimally supervised learning of semantic knowledge from query logs. In Proceedings of IJCNLP, 2008.Google ScholarGoogle Scholar
  17. J. Kupiec, J. O. Pedersen, and F. Chen. A trainable document summarizer. In Proceedings of SIGIR, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Landauer and S. Dumais. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211--240, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  19. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Marcu. From discourse structures to text summaries. In Proceedings of ACL, 1997.Google ScholarGoogle Scholar
  21. D. Paranjpe. Learning document aboutness from implicit user feedback and document structure. In Proceedings of CIKM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Paşca and B. V. Durme. What you seek is what you get: Extraction of class attributes from query logs. In Proceedings of IJCAI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Putnam. Formalization of the concept 'About'. Philosophy of Science, 25(2):125--130, 1958.Google ScholarGoogle ScholarCross RefCross Ref
  24. F. Radlinski and T. Joachims. Query Chains: Learning to rank from implicit feedback. In Proceedings of SIGKDD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Salton, J. Allan, and C. Buckley. Approaches to passage retrieval in full text information systems. In Proceedings of SIGIR, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513--523, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613--620, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Song, H. Liu, J. Wen, and W. Ma. Learning block importance models for web pages. In Proceedings of WWW, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Xu, S. Yang, and H. Li. Named entity mining from click-through data using weakly supervised latent dirichlet allocation. In Proceedings of SIGKDD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. W. Yih, J. Goodman, and V. Carvalho. Finding advertising keywords on web pages. In Proceedings of WWW, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. X. Zhu. Semi-Supervised Learning Literature Survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.Google ScholarGoogle Scholar

Index Terms

  1. Identifying salient entities in web pages

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
        October 2013
        2612 pages
        ISBN:9781450322638
        DOI:10.1145/2505515

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 October 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        CIKM '13 Paper Acceptance Rate143of848submissions,17%Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader