Article

Sic transit gloria telae: towards an understanding of the web's decay

Authors:
Ziv Bar-Yossef

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Andrei Z. Broder

IBM T. J. Watson Research Center, Hawthorne, NY

IBM T. J. Watson Research Center, Hawthorne, NY
View Profile

,
Ravi Kumar

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Andrew Tomkins

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

WWW '04: Proceedings of the 13th international conference on World Wide WebMay 2004Pages 328–337https://doi.org/10.1145/988672.988716

Published:17 May 2004Publication History

WWW '04: Proceedings of the 13th international conference on World Wide Web

Pages 328–337

ABSTRACT

The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as information resources. Such neighborhoods are identified only by frustrated searchers, seeking a way out of these stale neighborhoods, back to more up-to-date sections of the web; measuring the decay of a page purely on the basis of dead links on the page is too naive to reflect this frustration. In this paper we formalize a strong notion of a decay measure and present algorithms for computing it efficiently. We explore this measure by presenting a number of validations, and use it to identify interesting artifacts on today's web. We then describe a number of applications of such a measure to search engines, web page maintainers, ontologists, and individual users.

References

W. Aiello, F. Chung, and L. Lu. A random graph model for power law graphs. Experimental Mathematics, 10:53--66, 2001.Google ScholarCross Ref
Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D. Weitz. Approximating aggregate queries about web pages via random walks. In Proceedings of the 26th International Conference on Very Large Databases, pages 535--544, 2000. Google ScholarDigital Library
A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286:509--512, 1999.Google ScholarCross Ref
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the Web. In Proceedings of the 7th International World Wide Web Conference, pages 104--111, 1998. Google ScholarDigital Library
K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, 1998. Google ScholarDigital Library
B. Brewington and G. Cybenko. How dynamic is the web? In Proceedings of the Ninth International World Wide Web Conference, pages 257--276, May 2000. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th International World Wide Web Conference, pages 107--117, 1998. Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. In Proceedings of the 6th International World Wide Web Conference, pages 391--404, 1997. Google ScholarDigital Library
A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. WWW9/Computer Networks, 33(1--6):309--320, 2000. Google ScholarDigital Library
A. Z. Broder, R. Lempel, F. Maghoul, and J. Pedersen. Efficient Pagerank approximation via graph aggregation. Manuscript.Google Scholar
S. Chakrabarti, B. Dom, D. Gibson, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Spectral filtering for resource discovery. In Proceedings of the ACM SIGIR Workshop on Hypertext Analysis, pages 13--21, 1998.Google Scholar
S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific web resource discovery. WWW8/Computer Networks, 31(11--16):1623--1640, 1999. Google ScholarDigital Library
J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Databases, pages 200--209, 2000. Google ScholarDigital Library
F. Douglis, A. Feldmann, B. Krishnamurthy, and J. C. Mogul. Rate of change and other metrics: a live study of the world wide web. In USENIX Symposium on Internet Technologies and Systems, 1997. Google ScholarDigital Library
B. Edelman. Domains reregistered for distribution of unrelated content: A case study of "Tina's Free Live Webcam". http://cyber.law.harvard.edu/people/edelman/renewals/, 2002.Google Scholar
D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. In Proceedings of the 12th International World Wide Web Conference, pages 669--678, 2003. Google ScholarDigital Library
R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. RFC2616: Hypertext Transfer Protocol -- HTTP/1.1. http://www.w3.org/Protocols/rfc2616/rfc2616.html, June 1999. Google ScholarDigital Library
T. Haveliwala. Topic-sensitive PageRank. In Proceedings of the 11th International World Wide Web Conference, pages 517--526, 2002. Google ScholarDigital Library
M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform URL sampling. WWW9/Computer Networks, 33(1--6):295--308, 2000. Google ScholarDigital Library
A. Jesdanun. Internet littered with dead web sites. http://story.news.yahoo.com/news tmpl=story&u=/ap/20031102/ap_on_hi_te/% deadwood_online_1, November 2002.Google Scholar
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
W. Koehler. An analysis of web page and web site constancy and permanence. Journal of the American Society for Information Science, 50(2):162--180, 1999. Google ScholarDigital Library
W. Koehler. Digital libraries and world wide web sites and page persistence. Information Research, 4(4), 1999.Google Scholar
K. Kokoszkiewicz (a.k.a. Alectorides Conradus). Vocabula Computatralia Anglico-Latinum. University of Warsaw, Centre for Studies on the Classical Tradition in Poland and East-Central Europe (OBTA). http://www.obta.uw.edu.pl/ draco/docs/voccomp.html.Google Scholar
R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph. In Proceedings of the 41st IEEE Annual Foundations of Computer Science, pages 57--65, 2000. Google ScholarDigital Library
J. Markwell and D. W. Brooks. Broken links: The ephemeral nature of educational WWW hyperlinks. Journal of Science Education and Technology, 11(2):105--108, 2002.Google ScholarCross Ref
J. Markwell and D. W. Brooks. "Link rot" limits the usefulness of web-based educational materials in biochemistry and molecular biology. Biochemistry and Molecular Biology Education, 31(1):69--72, 2003.Google ScholarCross Ref
A. Ntoulas, J. Cho, and C. Olston. What's new on the web? The evolution of the web from a search engine perspective. In Proceedings of the 13th International World Wide Web Conference, 2004. Google ScholarDigital Library
G. Pandurangan, P. Raghavan, and E. Upfal. Using PageRank to characterize web structure. In Computing and Combinatorics: 8th Annual International Conference, pages 330--339, 2002. Google ScholarDigital Library
P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles. Methods for sampling pages uniformly from the world wide web. In Proceedings of the AAAI Fall Symposium on Using Uncertainty Within Computation, pages 121--128, 2001.Google Scholar
J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proceedings of the 11th International World Wide Web Conference, pages 136--147, 2002. Google ScholarDigital Library

Index Terms

Sic transit gloria telae: towards an understanding of the web's decay

Recommendations

Vetting the links of the web
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Many web links mislead human surfers and automated crawlers because they point to changed content, out-of-date information, or invalid URLs. It is a particular problem for large, well-known directories such as the dmoz Open Directory Project, which ...
Read More
Web data mining: exploring hyperlinks, contents, and usage data

This paper presents a review of the book "Web Data Mining - Exploring Hyperlinks, Contents, and Usage Data" by Bing Liu. The review concludes that the breadth and depth of this book makes it a required staple for every Web mining researcher, student, or ...
Read More
Using neighbors to date web documents
WIDM '07: Proceedings of the 9th annual ACM international workshop on Web information and data management

Time has been successfully used as a feature in web information retrieval tasks. In this context, estimating a document's inception date or last update date is a necessary task. Classic approaches have used HTTP header fields to estimate a document's ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '04: Proceedings of the 13th international conference on World Wide Web
May 2004
754 pages
ISBN:158113844X
DOI:10.1145/988672
Conference Chairs:
Stuart Feldman
IBM Research
,
Mike Uretsky
New York University
,
Program Chairs:
Marc Najork
Microsoft Research
,
Craig Wills
Worcester Polytechnic Institute
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 May 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
404 return code
dead links
link analysis
web decay
web information retrieval
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 88
  Total Citations
  View Citations
- 1,094
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Sic transit gloria telae: towards an understanding of the web's decay

WWW '04: Proceedings of the 13th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Vetting the links of the web

Web data mining: exploring hyperlinks, contents, and usage data

Using neighbors to date web documents