research-article

Building enriched web page representations using link paths

Authors:

ChengXiang Zhai,

Jiawei HanAuthors Info & Claims

HT '12: Proceedings of the 23rd ACM conference on Hypertext and social media

Pages 53 - 62

https://doi.org/10.1145/2309996.2310006

Published: 25 June 2012 Publication History

Abstract

Anchor text has a history of enriching documents for a variety of tasks within the World Wide Web. Anchor texts are useful because they are similar to typical Web queries, and because they express the document's context. Therefore, it is a common practice for Web search engines to incorporate incoming anchor text into the document's standard textual representation. However, this approach will not suffice for documents with very few inlinks, and it does not incorporate the document's full context. To mediate these problems, we employ link paths, which contain anchor texts from paths through the Web ending at the document in question. We propose and study several different ways to aggregate anchor text from link paths, and we show that the information from link paths can be used to (1) improve known item search in site-specific search, and (2) map Web pages to database records. We rigorously evaluate our proposed approach on several real world test collections. We find that our approach significantly improves performance over baseline and existing techniques in both tasks.

References

[1]

S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998.

Digital Library

[2]

M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proc. VLDB Endow., 1(1):538--549, 2008.

Digital Library

[3]

M. J. Cafarella, J. Madhavan, and A. Halevy. Web-scale extraction of structured data. SIGMOD Rec., 37(4):55--61, 2008.

Digital Library

[4]

S. Chakrabarti, B. Dom, P. Raghavan, S. R. D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In WWW, pages 65--74, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V.

Digital Library

[5]

N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In SIGIR, pages 250--257, New York, NY, USA, 2001. ACM.

Digital Library

[6]

Z. Dou, R. Song, J.-Y. Nie, and J.-R. Wen. Using anchor texts with their hyperlink structure for web search. In SIGIR, pages 227--234, New York, NY, USA, 2009. ACM.

Digital Library

[7]

N. Eiron and K. S. McCurley. Analysis of anchor text for web search. In SIGIR, pages 459--460, New York, NY, USA, 2003. ACM.

Digital Library

[8]

A. Fujii. Modeling anchor text and classifying queries to enhance web document retrieval. In WWW, pages 337--346, New York, NY, USA, 2008. ACM.

Digital Library

[9]

A. Fujii, K. Itou, T. Akiba, and T. Ishikawa. Exploiting anchor text for the navigationalweb retrieval at ntcir-5. In NTCIR-5 Workshop, 2005.

[10]

V. Harmandas, M. Sanderson, and M. D. Dunlop. Image retrieval by hypertext links. SIGIR Forum, 31(SI):296--303, 1997.

Digital Library

[11]

E. H. Hovy. Natural Language Processing and Information Systems, chapter 1, pages 1--7. Springer Berlin / Heidelberg, 2010.

[12]

R. Jin, A. G. Hauptmann, and C. X. Zhai. Title language model for information retrieval. In SIGIR, pages 42--48, New York, NY, USA, 2002. ACM.

Digital Library

[13]

M. Koolen and J. Kamps. The importance of anchor text for ad hoc search revisited. In SIGIR, pages 122--129, 2010.

Digital Library

[14]

R. Kraft and J. Zien. Mining anchor text for query refinement. In WWW, pages 666--674, New York, NY, USA, 2004. ACM.

Digital Library

[15]

C. X. Lin, B. Zhao, T. Weninger, J. Han, and B. Liu. Entity relation discovery from webtables and links. In WWW. ACM, April 2010.

Digital Library

[16]

B. Liu. Web Data Mining -- Exploring Hyperlinks, Contents and Usage Data. Springer, 2006.

Digital Library

[17]

W.-H. Lu, L.-F. Chien, and H.-J. Lee. Anchor text mining for translation of web queries: A transitive translation approach. ACM Trans. Inf. Syst., 22(2):242--269, 2004.

Digital Library

[18]

O. A. McBryan. Genvl and wwww: tools for taming the web. In WWW, 1994.

[19]

D. Metzler, J. Novak, H. Cui, and S. Reddy. Building enriched document representations using aggregated anchor text. In SIGIR, pages 219--226, New York, NY, USA, 2009. ACM.

Digital Library

[20]

G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the web using tag path clustering. In WWW, pages 981--990, New York, NY, USA, 2009. ACM.

Digital Library

[21]

P. Ogilvie and J. Callan. Combining document representations for known-item search. In SIGIR, pages 143--150, 2003.

Digital Library

[22]

S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR, pages 232--241, New York, NY, USA, 1994. Springer-Verlag New York, Inc.

Digital Library

[23]

D. Shen, J.-T. Sun, Q. Yang, and Z. Chen. A comparison of implicit and explicit links for web page classification. In WWW, pages 643--650, New York, NY, USA, 2006. ACM.

Digital Library

[24]

T. Weninger, F. Fumarola, R. Barber, C. X. Lin, J. Han, and D. Malerba. Growing parallel paths for entity-page discovery. In WWW, 2011.

Digital Library

[25]

T. Westerveld, W. Kraaij, and D. Hiemstra. Retrieving web pages using content, links, urls and anchors. TREC, 10, 2001.

[26]

Y. Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Trans. on Knowl. and Data Eng., 18(12):1614--1628, 2006.

Digital Library

Cited By

Shi BWeninger T(2014)Mining Interesting Meta-Paths from Complex Heterogeneous Information Networks2014 IEEE International Conference on Data Mining Workshop10.1109/ICDMW.2014.25(488-495)Online publication date: Dec-2014
https://doi.org/10.1109/ICDMW.2014.25
Weninger TJohnston THan J(2013)The parallel path framework for entity discovery on the webACM Transactions on the Web10.1145/2516633.25166387:3(1-29)Online publication date: 30-Sep-2013
https://dl.acm.org/doi/10.1145/2516633.2516638
Weninger THan JLeonardi SPanconesi AFerragina PGionis A(2013)Exploring structure and content on the webProceedings of the sixth ACM international conference on Web search and data mining10.1145/2433396.2433499(779-780)Online publication date: 4-Feb-2013
https://dl.acm.org/doi/10.1145/2433396.2433499
Show More Cited By

Index Terms

Building enriched web page representations using link paths
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining

Recommendations

Mapping web pages to database records via link paths
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

In this paper we propose a new knowledge management task which aims to map Web pages to their corresponding records in a structured database. For example, the DBLP database contains records for many computer scientists, and most of these persons have ...
Building enriched document representations using aggregated anchor text
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

It is well known that anchor text plays a critical role in a variety of search tasks performed over hypertextual domains, including enterprise search, wiki search, and web search. It is common practice to enrich a document's standard textual ...
A framework to derive web page context from hyperlink structure

Since an anchor is used in an HTML document to point to a related document/picture/media application, anchor-text becomes a potential resource to extract the information about an associated web page. However, sometimes anchor-texts are either not ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HT '12: Proceedings of the 23rd ACM conference on Hypertext and social media

June 2012

340 pages

ISBN:9781450313353

DOI:10.1145/2309996

General Chair:
Ethan Munson
University of Wisconsin - Milwaukee, USA
,
Program Chair:
Markus Strohmaier
Graz University of Technology, Austria

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HT '12

Sponsor:

SIGWEB

HT '12: 23rd ACM Conference on Hypertext and Social Media

June 25 - 28, 2012

Wisconsin, Milwaukee, USA

Acceptance Rates

HT '12 Paper Acceptance Rate 33 of 120 submissions, 28%;

Overall Acceptance Rate 378 of 1,158 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
219
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shi BWeninger T(2014)Mining Interesting Meta-Paths from Complex Heterogeneous Information Networks2014 IEEE International Conference on Data Mining Workshop10.1109/ICDMW.2014.25(488-495)Online publication date: Dec-2014
https://doi.org/10.1109/ICDMW.2014.25
Weninger TJohnston THan J(2013)The parallel path framework for entity discovery on the webACM Transactions on the Web10.1145/2516633.25166387:3(1-29)Online publication date: 30-Sep-2013
https://dl.acm.org/doi/10.1145/2516633.2516638
Weninger THan JLeonardi SPanconesi AFerragina PGionis A(2013)Exploring structure and content on the webProceedings of the sixth ACM international conference on Web search and data mining10.1145/2433396.2433499(779-780)Online publication date: 4-Feb-2013
https://dl.acm.org/doi/10.1145/2433396.2433499
Yang QNiu ZZhang CHuang S(2013)Building Enhanced Link Context by Logical SitemapKnowledge Science, Engineering and Management10.1007/978-3-642-39787-5_4(36-47)Online publication date: 2013
https://doi.org/10.1007/978-3-642-39787-5_4

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten