research-article

Identifying salient entities in web pages

Authors:
Michael Gamon

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

,
Tae Yano

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Xinying Song

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

,
Johnson Apacible

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

,
Patrick Pantel

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementOctober 2013Pages 2375–2380https://doi.org/10.1145/2505515.2505602

Published:27 October 2013Publication History

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pages 2375–2380

ABSTRACT

We propose a system that determines the salience of entities within web documents. Many recent advances in commercial search engines leverage the identification of entities in web pages. However, for many pages, only a small subset of entities are central to the document, which can lead to degraded relevance for entity triggered experiences. We address this problem by devising a system that scores each entity on a web page according to its centrality to the page content. We propose salience classification functions that incorporate various cues from document content, web search logs, and a large web graph. To cost-effectively train the models, we introduce a soft labeling methodology that generates a set of annotations based on user behaviors observed in web search logs. We evaluate several variations of our model via a large-scale empirical study conducted over a test set, which we release publicly to the research community. We demonstrate that our methods significantly outperform competitive baselines and the previous state of the art, while keeping the human annotation cost to a minimum.

References

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. the Journal of machine Learning research, 3:993--1022, 2003. Google ScholarDigital Library
P. D. Bruza, D. W. Song, and K. F. Wong. Aboutness from a commonsense perspective. Journal of the American Society for Information Science, 51:1090--1105, 2000. Google ScholarDigital Library
D. Cai, S. Yu, J. Wen, and W. Ma. Extracting content structure for web pages based on visual representation. Web Technologies and Applications, pages 406--417, 2003. Google ScholarCross Ref
M. Collins. Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In Proceedings of EMNLP, 2002. Google ScholarDigital Library
N. N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon, S. Keerthi, and S. Merugu. A web of concepts. In Proceedings of PODS, 2009. Google ScholarDigital Library
J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189--1232, 1999.Google ScholarCross Ref
M. Gamon, T. Yano, X. Song, J. Apacible, and P. Pantel. Understanding Document Aboutness - Step One: Identifying Salient Entities. Technical Report MSR-TR-2013-73, Microsoft Research, 2013.Google Scholar
B. Hjørland. Towards a theory of aboutness, subject, topicality, theme, domain, field, content... and relevance. Journal of the American Society for Information Science and Technology, 52(9):774--778, 2001. Google ScholarDigital Library
S. Holland, M. Ester, and W. Kießling. Preference mining: A novel approach on mining user preferences for personalized applications. Knowledge Discovery in Databases: PKDD 2003, pages 204--216, 2003.Google ScholarCross Ref
E. Hovy and C. Y. Lin. Automated text summarization and the summarist system. In Proceedings of a workshop on held at Baltimore, Maryland: October 13--15, 1998, 1998. Google ScholarDigital Library
A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of EMNLP, 2003. Google ScholarDigital Library
W. Hutchins. On the problem of 'aboutness' in document analysis. Journal of Informatics, 1(1):17--35, 1977.Google Scholar
U. Irmak, V. V. Brzeski, and R. Kraft. Contextual ranking of keywords using click data. In Proceedings of ICDE, 2009. Google ScholarDigital Library
T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of SIGKDD, 2002. Google ScholarDigital Library
T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of SIGIR, 2005. Google ScholarDigital Library
M. Komachi and H. Suzuki. Minimally supervised learning of semantic knowledge from query logs. In Proceedings of IJCNLP, 2008.Google Scholar
J. Kupiec, J. O. Pedersen, and F. Chen. A trainable document summarizer. In Proceedings of SIGIR, 1995. Google ScholarDigital Library
T. Landauer and S. Dumais. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211--240, 1997.Google ScholarCross Ref
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
D. Marcu. From discourse structures to text summaries. In Proceedings of ACL, 1997.Google Scholar
D. Paranjpe. Learning document aboutness from implicit user feedback and document structure. In Proceedings of CIKM, 2009. Google ScholarDigital Library
M. Paşca and B. V. Durme. What you seek is what you get: Extraction of class attributes from query logs. In Proceedings of IJCAI, 2007. Google ScholarDigital Library
H. Putnam. Formalization of the concept 'About'. Philosophy of Science, 25(2):125--130, 1958.Google ScholarCross Ref
F. Radlinski and T. Joachims. Query Chains: Learning to rank from implicit feedback. In Proceedings of SIGKDD, 2005. Google ScholarDigital Library
G. Salton, J. Allan, and C. Buckley. Approaches to passage retrieval in full text information systems. In Proceedings of SIGIR, 1993. Google ScholarDigital Library
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513--523, 1988. Google ScholarDigital Library
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613--620, 1975. Google ScholarDigital Library
R. Song, H. Liu, J. Wen, and W. Ma. Learning block importance models for web pages. In Proceedings of WWW, 2004. Google ScholarDigital Library
G. Xu, S. Yang, and H. Li. Named entity mining from click-through data using weakly supervised latent dirichlet allocation. In Proceedings of SIGKDD, 2009. Google ScholarDigital Library
W. Yih, J. Goodman, and V. Carvalho. Finding advertising keywords on web pages. In Proceedings of WWW, 2006. Google ScholarDigital Library
X. Zhu. Semi-Supervised Learning Literature Survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.Google Scholar

Index Terms

Identifying salient entities in web pages
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection

Recommendations

Gathering web pages of entities with high precision

A search engine like Yahoo looks for entities such as specific people, places, or things on web pages with search queries. Depending on the granularity of query keywords and performance of a search engine, the retrieved web pages may be in very large ...
Read More
Automated News Suggestions for Populating Wikipedia Entity Pages
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

Wikipedia entity pages are a valuable source of information for direct consumption and for knowledge-base construction, update and maintenance. Facts in these entity pages are typically supported by references. Recent studies show that as much as 20% of ...
Read More
Contextualizing Trending Entities in News Stories
WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining

Trends are those keywords, phrases, or names that are mentioned most often on social media or in news in a particular timeframe.They are an effective way for human news readers to both discover and stay focused on the most relevant information of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
October 2013
2612 pages
ISBN:9781450322638
DOI:10.1145/2505515
General Chairs:
Qi He
LinkedIn, USA
,
Arun Iyengar
IBM T.J. Watson Research Center, USA
,
Program Chairs:
Wolfgang Nejdl
L3S Research Center, Germany
,
Jian Pei
Simon Fraser University, Canada
,
Rajeev Rastogi
Amazon, India
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
content analysis
document aboutness
entity salience
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '13 Paper Acceptance Rate143of848submissions,17%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 289
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Identifying salient entities in web pages

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Gathering web pages of entities with high precision

Automated News Suggestions for Populating Wikipedia Entity Pages

Contextualizing Trending Entities in News Stories