skip to main content
10.1145/2309996.2310006acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
research-article

Building enriched web page representations using link paths

Published: 25 June 2012 Publication History

Abstract

Anchor text has a history of enriching documents for a variety of tasks within the World Wide Web. Anchor texts are useful because they are similar to typical Web queries, and because they express the document's context. Therefore, it is a common practice for Web search engines to incorporate incoming anchor text into the document's standard textual representation. However, this approach will not suffice for documents with very few inlinks, and it does not incorporate the document's full context. To mediate these problems, we employ link paths, which contain anchor texts from paths through the Web ending at the document in question. We propose and study several different ways to aggregate anchor text from link paths, and we show that the information from link paths can be used to (1) improve known item search in site-specific search, and (2) map Web pages to database records. We rigorously evaluate our proposed approach on several real world test collections. We find that our approach significantly improves performance over baseline and existing techniques in both tasks.

References

[1]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998.
[2]
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proc. VLDB Endow., 1(1):538--549, 2008.
[3]
M. J. Cafarella, J. Madhavan, and A. Halevy. Web-scale extraction of structured data. SIGMOD Rec., 37(4):55--61, 2008.
[4]
S. Chakrabarti, B. Dom, P. Raghavan, S. R. D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In WWW, pages 65--74, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V.
[5]
N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In SIGIR, pages 250--257, New York, NY, USA, 2001. ACM.
[6]
Z. Dou, R. Song, J.-Y. Nie, and J.-R. Wen. Using anchor texts with their hyperlink structure for web search. In SIGIR, pages 227--234, New York, NY, USA, 2009. ACM.
[7]
N. Eiron and K. S. McCurley. Analysis of anchor text for web search. In SIGIR, pages 459--460, New York, NY, USA, 2003. ACM.
[8]
A. Fujii. Modeling anchor text and classifying queries to enhance web document retrieval. In WWW, pages 337--346, New York, NY, USA, 2008. ACM.
[9]
A. Fujii, K. Itou, T. Akiba, and T. Ishikawa. Exploiting anchor text for the navigationalweb retrieval at ntcir-5. In NTCIR-5 Workshop, 2005.
[10]
V. Harmandas, M. Sanderson, and M. D. Dunlop. Image retrieval by hypertext links. SIGIR Forum, 31(SI):296--303, 1997.
[11]
E. H. Hovy. Natural Language Processing and Information Systems, chapter 1, pages 1--7. Springer Berlin / Heidelberg, 2010.
[12]
R. Jin, A. G. Hauptmann, and C. X. Zhai. Title language model for information retrieval. In SIGIR, pages 42--48, New York, NY, USA, 2002. ACM.
[13]
M. Koolen and J. Kamps. The importance of anchor text for ad hoc search revisited. In SIGIR, pages 122--129, 2010.
[14]
R. Kraft and J. Zien. Mining anchor text for query refinement. In WWW, pages 666--674, New York, NY, USA, 2004. ACM.
[15]
C. X. Lin, B. Zhao, T. Weninger, J. Han, and B. Liu. Entity relation discovery from webtables and links. In WWW. ACM, April 2010.
[16]
B. Liu. Web Data Mining -- Exploring Hyperlinks, Contents and Usage Data. Springer, 2006.
[17]
W.-H. Lu, L.-F. Chien, and H.-J. Lee. Anchor text mining for translation of web queries: A transitive translation approach. ACM Trans. Inf. Syst., 22(2):242--269, 2004.
[18]
O. A. McBryan. Genvl and wwww: tools for taming the web. In WWW, 1994.
[19]
D. Metzler, J. Novak, H. Cui, and S. Reddy. Building enriched document representations using aggregated anchor text. In SIGIR, pages 219--226, New York, NY, USA, 2009. ACM.
[20]
G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the web using tag path clustering. In WWW, pages 981--990, New York, NY, USA, 2009. ACM.
[21]
P. Ogilvie and J. Callan. Combining document representations for known-item search. In SIGIR, pages 143--150, 2003.
[22]
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR, pages 232--241, New York, NY, USA, 1994. Springer-Verlag New York, Inc.
[23]
D. Shen, J.-T. Sun, Q. Yang, and Z. Chen. A comparison of implicit and explicit links for web page classification. In WWW, pages 643--650, New York, NY, USA, 2006. ACM.
[24]
T. Weninger, F. Fumarola, R. Barber, C. X. Lin, J. Han, and D. Malerba. Growing parallel paths for entity-page discovery. In WWW, 2011.
[25]
T. Westerveld, W. Kraaij, and D. Hiemstra. Retrieving web pages using content, links, urls and anchors. TREC, 10, 2001.
[26]
Y. Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Trans. on Knowl. and Data Eng., 18(12):1614--1628, 2006.

Cited By

View all
  • (2014)Mining Interesting Meta-Paths from Complex Heterogeneous Information Networks2014 IEEE International Conference on Data Mining Workshop10.1109/ICDMW.2014.25(488-495)Online publication date: Dec-2014
  • (2013)The parallel path framework for entity discovery on the webACM Transactions on the Web10.1145/2516633.25166387:3(1-29)Online publication date: 30-Sep-2013
  • (2013)Exploring structure and content on the webProceedings of the sixth ACM international conference on Web search and data mining10.1145/2433396.2433499(779-780)Online publication date: 4-Feb-2013
  • Show More Cited By

Index Terms

  1. Building enriched web page representations using link paths

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      HT '12: Proceedings of the 23rd ACM conference on Hypertext and social media
      June 2012
      340 pages
      ISBN:9781450313353
      DOI:10.1145/2309996
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 June 2012

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. anchor text
      2. document indexing
      3. link paths
      4. record linkage
      5. web

      Qualifiers

      • Research-article

      Conference

      HT '12
      Sponsor:
      HT '12: 23rd ACM Conference on Hypertext and Social Media
      June 25 - 28, 2012
      Wisconsin, Milwaukee, USA

      Acceptance Rates

      HT '12 Paper Acceptance Rate 33 of 120 submissions, 28%;
      Overall Acceptance Rate 378 of 1,158 submissions, 33%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 02 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2014)Mining Interesting Meta-Paths from Complex Heterogeneous Information Networks2014 IEEE International Conference on Data Mining Workshop10.1109/ICDMW.2014.25(488-495)Online publication date: Dec-2014
      • (2013)The parallel path framework for entity discovery on the webACM Transactions on the Web10.1145/2516633.25166387:3(1-29)Online publication date: 30-Sep-2013
      • (2013)Exploring structure and content on the webProceedings of the sixth ACM international conference on Web search and data mining10.1145/2433396.2433499(779-780)Online publication date: 4-Feb-2013
      • (2013)Building Enhanced Link Context by Logical SitemapKnowledge Science, Engineering and Management10.1007/978-3-642-39787-5_4(36-47)Online publication date: 2013

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media