Skip to main content

Revisiting Lexical Signatures to (Re-)Discover Web Pages

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5173))

Abstract

A lexical signature (LS) is a small set of terms derived from a document that capture the “aboutness” of that document. A LS generated from a web page can be used to discover that page at a different URL as well as to find relevant pages in the Internet. From a set of randomly selected URLs we took all their copies from the Internet Archive between 1996 and 2007 and generated their LSs. We conducted an overlap analysis of terms in all LSs and found only small overlaps in the early years (1996 − 2000) but increasing numbers in the more recent past (from 2003 on). We measured the performance of all LSs in dependence of the number of terms they consist of. We found that LSs created more recently perform better than early LSs created between 1996 and 2000. All LSs created from year 2000 on show a similar pattern in their performance curve. Our results show that 5-, 6- and 7-term LSs perform best with returning the URLs of interest in the top ten of the result set. In about 50% of all cases these URLs are returned as the number one result and in 30% of all times we considered the URLs as not discoved.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Errorzilla - Useful error pages for Firefox, http://roachfiend.com/archives/2006/08/28/errorzilla-useful-error-pages-for-firefox/

  2. Harrison, T.L., Nelson, M.L.: Just-in-Time Recovery of Missing Web Pages. In: Proceedings of HYPERTEXT 2006, pp. 145–156 (2006)

    Google Scholar 

  3. Henzinger, M., Chang, B.-W., Milch, B., Brin, S.: Query-free News Search. In: Proceedings of WWW 2003, pp. 1–10 (2003)

    Google Scholar 

  4. Jatowt, A., Kawai, Y., Nakamura, S., Kidawara, Y., Tanaka, K.: A Browser for Browsing the Past Web. In: Proceedings of WWW 2006, pp. 877–878 (2006)

    Google Scholar 

  5. Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., Gay, G.: Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations in Web Search. ACM Transactions on Information Systems 25(2), 7 (2007)

    Article  Google Scholar 

  6. Klein, M., Nelson, M.L.: A Comparison of Techniques for Estimating IDF Values for the Web. Technical Report, Old Dominion University, Norfolk, VA, USA (2008)

    Google Scholar 

  7. Klöckner, K., Wirschum, N., Jameson, A.: Depth- and Breadth-First Processing of Search Result Lists. In: Proceedings of CHI 2004, p. 1539 (2004)

    Google Scholar 

  8. McCown, F., Nelson, M.L.: Agreeing to Disagree: Search Engines and their Public Interfaces. In: Proceedings of JCDL 2007, pp. 309–318 (2007)

    Google Scholar 

  9. Nelson, M.L., McCown, F., Smith, J.A., Klein, M.: Using the Web Infrastructure to Preserve Web Pages. IJDL 6(4), 327–349 (2007)

    Article  Google Scholar 

  10. Park, S.-T., Pennock, D.M., Giles, C.L., Krovetz, R.: Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web. ACM Transactions on Information Systems 22(4), 540–572 (2004)

    Article  Google Scholar 

  11. Phelps, T.A., Wilensky, R.: Robust Hyperlinks and Locations. In: D-Lib (2000)

    Google Scholar 

  12. Phelps, T.A., Wilensky, R.: Robust Hyperlinks Cost Just Five Words Each. Technical report, University of California at Berkeley, Berkeley, CA, USA (2000)

    Google Scholar 

  13. Staddon, J., Golle, P., Zimny, B.: Web based inference detection. In: USENIX Security Symposium (2007)

    Google Scholar 

  14. Theall, M.: Methodologies for Crawler Based Web Surveys. Internet Research: Electronic Networking and Applications 12, 124–138 (2002)

    Article  Google Scholar 

  15. Wan, X., Yang, J.: Wordrank-based Lexical Signatures for Finding Lost or Related Web Pages. In: APWeb, pp. 843–849 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Birte Christensen-Dalsgaard Donatella Castelli Bolette Ammitzbøll Jurik Joan Lippincott

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Klein, M., Nelson, M.L. (2008). Revisiting Lexical Signatures to (Re-)Discover Web Pages. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2008. Lecture Notes in Computer Science, vol 5173. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87599-4_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-87599-4_38

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-87598-7

  • Online ISBN: 978-3-540-87599-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics