ABSTRACT
We propose a system that determines the salience of entities within web documents. Many recent advances in commercial search engines leverage the identification of entities in web pages. However, for many pages, only a small subset of entities are central to the document, which can lead to degraded relevance for entity triggered experiences. We address this problem by devising a system that scores each entity on a web page according to its centrality to the page content. We propose salience classification functions that incorporate various cues from document content, web search logs, and a large web graph. To cost-effectively train the models, we introduce a soft labeling methodology that generates a set of annotations based on user behaviors observed in web search logs. We evaluate several variations of our model via a large-scale empirical study conducted over a test set, which we release publicly to the research community. We demonstrate that our methods significantly outperform competitive baselines and the previous state of the art, while keeping the human annotation cost to a minimum.
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. the Journal of machine Learning research, 3:993--1022, 2003. Google ScholarDigital Library
- P. D. Bruza, D. W. Song, and K. F. Wong. Aboutness from a commonsense perspective. Journal of the American Society for Information Science, 51:1090--1105, 2000. Google ScholarDigital Library
- D. Cai, S. Yu, J. Wen, and W. Ma. Extracting content structure for web pages based on visual representation. Web Technologies and Applications, pages 406--417, 2003. Google ScholarCross Ref
- M. Collins. Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In Proceedings of EMNLP, 2002. Google ScholarDigital Library
- N. N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon, S. Keerthi, and S. Merugu. A web of concepts. In Proceedings of PODS, 2009. Google ScholarDigital Library
- J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189--1232, 1999.Google ScholarCross Ref
- M. Gamon, T. Yano, X. Song, J. Apacible, and P. Pantel. Understanding Document Aboutness - Step One: Identifying Salient Entities. Technical Report MSR-TR-2013-73, Microsoft Research, 2013.Google Scholar
- B. Hjørland. Towards a theory of aboutness, subject, topicality, theme, domain, field, content... and relevance. Journal of the American Society for Information Science and Technology, 52(9):774--778, 2001. Google ScholarDigital Library
- S. Holland, M. Ester, and W. Kießling. Preference mining: A novel approach on mining user preferences for personalized applications. Knowledge Discovery in Databases: PKDD 2003, pages 204--216, 2003.Google ScholarCross Ref
- E. Hovy and C. Y. Lin. Automated text summarization and the summarist system. In Proceedings of a workshop on held at Baltimore, Maryland: October 13--15, 1998, 1998. Google ScholarDigital Library
- A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of EMNLP, 2003. Google ScholarDigital Library
- W. Hutchins. On the problem of 'aboutness' in document analysis. Journal of Informatics, 1(1):17--35, 1977.Google Scholar
- U. Irmak, V. V. Brzeski, and R. Kraft. Contextual ranking of keywords using click data. In Proceedings of ICDE, 2009. Google ScholarDigital Library
- T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of SIGKDD, 2002. Google ScholarDigital Library
- T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of SIGIR, 2005. Google ScholarDigital Library
- M. Komachi and H. Suzuki. Minimally supervised learning of semantic knowledge from query logs. In Proceedings of IJCNLP, 2008.Google Scholar
- J. Kupiec, J. O. Pedersen, and F. Chen. A trainable document summarizer. In Proceedings of SIGIR, 1995. Google ScholarDigital Library
- T. Landauer and S. Dumais. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211--240, 1997.Google ScholarCross Ref
- C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
- D. Marcu. From discourse structures to text summaries. In Proceedings of ACL, 1997.Google Scholar
- D. Paranjpe. Learning document aboutness from implicit user feedback and document structure. In Proceedings of CIKM, 2009. Google ScholarDigital Library
- M. Paşca and B. V. Durme. What you seek is what you get: Extraction of class attributes from query logs. In Proceedings of IJCAI, 2007. Google ScholarDigital Library
- H. Putnam. Formalization of the concept 'About'. Philosophy of Science, 25(2):125--130, 1958.Google ScholarCross Ref
- F. Radlinski and T. Joachims. Query Chains: Learning to rank from implicit feedback. In Proceedings of SIGKDD, 2005. Google ScholarDigital Library
- G. Salton, J. Allan, and C. Buckley. Approaches to passage retrieval in full text information systems. In Proceedings of SIGIR, 1993. Google ScholarDigital Library
- G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513--523, 1988. Google ScholarDigital Library
- G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613--620, 1975. Google ScholarDigital Library
- R. Song, H. Liu, J. Wen, and W. Ma. Learning block importance models for web pages. In Proceedings of WWW, 2004. Google ScholarDigital Library
- G. Xu, S. Yang, and H. Li. Named entity mining from click-through data using weakly supervised latent dirichlet allocation. In Proceedings of SIGKDD, 2009. Google ScholarDigital Library
- W. Yih, J. Goodman, and V. Carvalho. Finding advertising keywords on web pages. In Proceedings of WWW, 2006. Google ScholarDigital Library
- X. Zhu. Semi-Supervised Learning Literature Survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.Google Scholar
Index Terms
- Identifying salient entities in web pages
Recommendations
Gathering web pages of entities with high precision
A search engine like Yahoo looks for entities such as specific people, places, or things on web pages with search queries. Depending on the granularity of query keywords and performance of a search engine, the retrieved web pages may be in very large ...
Automated News Suggestions for Populating Wikipedia Entity Pages
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge ManagementWikipedia entity pages are a valuable source of information for direct consumption and for knowledge-base construction, update and maintenance. Facts in these entity pages are typically supported by references. Recent studies show that as much as 20% of ...
Contextualizing Trending Entities in News Stories
WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data MiningTrends are those keywords, phrases, or names that are mentioned most often on social media or in news in a particular timeframe.They are an effective way for human news readers to both discover and stay focused on the most relevant information of the ...
Comments