ABSTRACT
The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as information resources. Such neighborhoods are identified only by frustrated searchers, seeking a way out of these stale neighborhoods, back to more up-to-date sections of the web; measuring the decay of a page purely on the basis of dead links on the page is too naive to reflect this frustration. In this paper we formalize a strong notion of a decay measure and present algorithms for computing it efficiently. We explore this measure by presenting a number of validations, and use it to identify interesting artifacts on today's web. We then describe a number of applications of such a measure to search engines, web page maintainers, ontologists, and individual users.
- W. Aiello, F. Chung, and L. Lu. A random graph model for power law graphs. Experimental Mathematics, 10:53--66, 2001.Google ScholarCross Ref
- Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D. Weitz. Approximating aggregate queries about web pages via random walks. In Proceedings of the 26th International Conference on Very Large Databases, pages 535--544, 2000. Google ScholarDigital Library
- A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286:509--512, 1999.Google ScholarCross Ref
- K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the Web. In Proceedings of the 7th International World Wide Web Conference, pages 104--111, 1998. Google ScholarDigital Library
- K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, 1998. Google ScholarDigital Library
- B. Brewington and G. Cybenko. How dynamic is the web? In Proceedings of the Ninth International World Wide Web Conference, pages 257--276, May 2000. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th International World Wide Web Conference, pages 107--117, 1998. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. In Proceedings of the 6th International World Wide Web Conference, pages 391--404, 1997. Google ScholarDigital Library
- A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. WWW9/Computer Networks, 33(1--6):309--320, 2000. Google ScholarDigital Library
- A. Z. Broder, R. Lempel, F. Maghoul, and J. Pedersen. Efficient Pagerank approximation via graph aggregation. Manuscript.Google Scholar
- S. Chakrabarti, B. Dom, D. Gibson, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Spectral filtering for resource discovery. In Proceedings of the ACM SIGIR Workshop on Hypertext Analysis, pages 13--21, 1998.Google Scholar
- S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific web resource discovery. WWW8/Computer Networks, 31(11--16):1623--1640, 1999. Google ScholarDigital Library
- J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Databases, pages 200--209, 2000. Google ScholarDigital Library
- F. Douglis, A. Feldmann, B. Krishnamurthy, and J. C. Mogul. Rate of change and other metrics: a live study of the world wide web. In USENIX Symposium on Internet Technologies and Systems, 1997. Google ScholarDigital Library
- B. Edelman. Domains reregistered for distribution of unrelated content: A case study of "Tina's Free Live Webcam". http://cyber.law.harvard.edu/people/edelman/renewals/, 2002.Google Scholar
- D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. In Proceedings of the 12th International World Wide Web Conference, pages 669--678, 2003. Google ScholarDigital Library
- R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. RFC2616: Hypertext Transfer Protocol -- HTTP/1.1. http://www.w3.org/Protocols/rfc2616/rfc2616.html, June 1999. Google ScholarDigital Library
- T. Haveliwala. Topic-sensitive PageRank. In Proceedings of the 11th International World Wide Web Conference, pages 517--526, 2002. Google ScholarDigital Library
- M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform URL sampling. WWW9/Computer Networks, 33(1--6):295--308, 2000. Google ScholarDigital Library
- A. Jesdanun. Internet littered with dead web sites. http://story.news.yahoo.com/news tmpl=story&u=/ap/20031102/ap_on_hi_te/% deadwood_online_1, November 2002.Google Scholar
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
- W. Koehler. An analysis of web page and web site constancy and permanence. Journal of the American Society for Information Science, 50(2):162--180, 1999. Google ScholarDigital Library
- W. Koehler. Digital libraries and world wide web sites and page persistence. Information Research, 4(4), 1999.Google Scholar
- K. Kokoszkiewicz (a.k.a. Alectorides Conradus). Vocabula Computatralia Anglico-Latinum. University of Warsaw, Centre for Studies on the Classical Tradition in Poland and East-Central Europe (OBTA). http://www.obta.uw.edu.pl/ draco/docs/voccomp.html.Google Scholar
- R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph. In Proceedings of the 41st IEEE Annual Foundations of Computer Science, pages 57--65, 2000. Google ScholarDigital Library
- J. Markwell and D. W. Brooks. Broken links: The ephemeral nature of educational WWW hyperlinks. Journal of Science Education and Technology, 11(2):105--108, 2002.Google ScholarCross Ref
- J. Markwell and D. W. Brooks. "Link rot" limits the usefulness of web-based educational materials in biochemistry and molecular biology. Biochemistry and Molecular Biology Education, 31(1):69--72, 2003.Google ScholarCross Ref
- A. Ntoulas, J. Cho, and C. Olston. What's new on the web? The evolution of the web from a search engine perspective. In Proceedings of the 13th International World Wide Web Conference, 2004. Google ScholarDigital Library
- G. Pandurangan, P. Raghavan, and E. Upfal. Using PageRank to characterize web structure. In Computing and Combinatorics: 8th Annual International Conference, pages 330--339, 2002. Google ScholarDigital Library
- P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles. Methods for sampling pages uniformly from the world wide web. In Proceedings of the AAAI Fall Symposium on Using Uncertainty Within Computation, pages 121--128, 2001.Google Scholar
- J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proceedings of the 11th International World Wide Web Conference, pages 136--147, 2002. Google ScholarDigital Library
Index Terms
- Sic transit gloria telae: towards an understanding of the web's decay
Recommendations
Vetting the links of the web
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementMany web links mislead human surfers and automated crawlers because they point to changed content, out-of-date information, or invalid URLs. It is a particular problem for large, well-known directories such as the dmoz Open Directory Project, which ...
Web data mining: exploring hyperlinks, contents, and usage data
This paper presents a review of the book "Web Data Mining - Exploring Hyperlinks, Contents, and Usage Data" by Bing Liu. The review concludes that the breadth and depth of this book makes it a required staple for every Web mining researcher, student, or ...
Using neighbors to date web documents
WIDM '07: Proceedings of the 9th annual ACM international workshop on Web information and data managementTime has been successfully used as a feature in web information retrieval tasks. In this context, estimating a document's inception date or last update date is a necessary task. Classic approaches have used HTTP header fields to estimate a document's ...
Comments