skip to main content
10.1145/1498759.1498837acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

The web changes everything: understanding the dynamics of web content

Published: 09 February 2009 Publication History

Abstract

The Web is a dynamic, ever changing collection of information. This paper explores changes in Web content by analyzing a crawl of 55,000 Web pages, selected to represent different user visitation patterns. Although change over long intervals has been explored on random (and potentially unvisited) samples of Web pages, little is known about the nature of finer grained changes to pages that are actively consumed by users, such as those in our sample. We describe algorithms, analyses, and models for characterizing changes in Web content, focusing on both time (by using hourly and sub-hourly crawls) and structure (by looking at page-, DOM-, and term-level changes). Change rates are higher in our behavior-based sample than found in previous work on randomly sampled pages, with a large portion of pages changing more than hourly. Detailed content and structure analyses identify stable and dynamic content within each page. The understanding of Web change we develop in this paper has implications for tools designed to help people interact with dynamic Web content, such as search engines, advertising, and Web browsers.

References

[1]
Adar, E., M. Dontcheva, J. Fogarty, D. S. Weld. Zoetrope: Interacting with the Ephemeral Web. UIST '08, 239--248, 2008.
[2]
Adar, E., J. Teevan, and S. T. Dumais. Large scale analysis of Web revisitation Patterns. CHI '08, 1197--1206, 2008.
[3]
Bolin, M., M. Webber, P. Rha, T. Wilson, and R. C. Miller, Automation and Customization of Rendered Web Pages, UIST '05, 163--172, 2005.
[4]
Cho, J. and H. Garcia-Molina. The evolution of the Web and implications for an incremental crawler. VLDB '00, 200--209, 2000.
[5]
Cronin-Townsend, S., Y. Zhou, W. B. Croft. Predicting query performance. SIGIR'02, 299--306, 2002.
[6]
Dontcheva, M., S. Drucker, D. Salesin, and M. F. Cohen, Changes in Webpage Structure over Time, TR2007-04-02, UW, CSE, 2007.
[7]
Douglis, F., A. Feldmann, B. Krishnamurthy, and J. Mogul. Rate of change and other metrics: A live study of the World Wide Web. USENIX Symposium on Internet Technologies and Systems, 1997.
[8]
Fetterly, D., M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of Web pages. WWW '03, 669--678, 2003.
[9]
Friedman, J., T. Hastie, and R. Tibshirani, Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(20), 337--407, 2000.
[10]
Grandi, F., Introducing an annotated bibliography on temporal and evolution aspects in the World Wide Web. SIGMOD Records, 33(2), 84--86, 2004.
[11]
Kim, J. K., and S. H. Lee. An empirical study of the change of Web pages. APWeb '05, 632--642, 2005.
[12]
Koehler, W. Web page change and persistence: A four-year longitudinal study. JASIST, 53(2), 162--171, 2002.
[13]
Kwon, S. H., S. H. Lee, and S. J. Kim. Effective criteria for Web page changes. APWeb '06, 837--842, 2006.
[14]
Ntoulas, A., Cho, J., and Olston, C. What's new on the Web? The evolution of the Web from a search engine perspective. WWW '04, 1--12, 2004.
[15]
Olston, C. and Pandey, S. Recrawl scheduling based on information longevity. WWW '08, 437--446, 2008.
[16]
Pitkow, J. and Pirolli, P. Life, death, and lawfulness on the electronic frontier. CHI '97, 383--390, 1997.
[17]
Ramaswamy, L., A. Iyengar, L. Liu, and F. Douglis, Automatic Detection of Fragments in Dynamically Generated Web Pages, WWW'04, 443--454.
[18]
Sebastiani, F. Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1--47, 2002.
[19]
Selberg, E. and Etzioni, O. On the instability of Web search engines. In Proceedings of RIAO '00, 2000.
[20]
Teevan, J., E. Adar, R. Jones, and M. A. Potts. Information re-retrieval: repeat queries in Yahoo's logs. SIGIR '07, 151--158, 2007.

Cited By

View all
  • (2024)Evaluation of Temporal Change in IR Test CollectionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672530(3-13)Online publication date: 2-Aug-2024
  • (2024)Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web ArchivesProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00021(71-81)Online publication date: 26-Jun-2024
  • (2023)Improving the Exploration/Exploitation Trade-Off in Web Content DiscoveryCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587574(1183-1189)Online publication date: 30-Apr-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining
February 2009
314 pages
ISBN:9781605583907
DOI:10.1145/1498759
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. change
  2. re-finding
  3. web page dynamics

Qualifiers

  • Research-article

Conference

WSDM'09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)4
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Evaluation of Temporal Change in IR Test CollectionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672530(3-13)Online publication date: 2-Aug-2024
  • (2024)Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web ArchivesProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00021(71-81)Online publication date: 26-Jun-2024
  • (2023)Improving the Exploration/Exploitation Trade-Off in Web Content DiscoveryCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587574(1183-1189)Online publication date: 30-Apr-2023
  • (2023)Look back, look around: A systematic analysis of effective predictors for new outlinks in focused Web crawlingKnowledge-Based Systems10.1016/j.knosys.2022.110126260(110126)Online publication date: Jan-2023
  • (2022)A Web Information Extraction Framework with Adaptive and Failure Prediction FeatureJournal of Data and Information Quality10.1145/349500814:2(1-21)Online publication date: 23-Mar-2022
  • (2022)Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web LogsJournal of Data and Information Quality10.1145/349039214:2(1-17)Online publication date: 23-Mar-2022
  • (2022)Time Masking for Temporal Language ModelsProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498529(833-841)Online publication date: 11-Feb-2022
  • (2021)The Problem of Reference Rot in Spatial Metadata CataloguesISPRS International Journal of Geo-Information10.3390/ijgi1101002711:1(27)Online publication date: 31-Dec-2021
  • (2021)Towards Realistic and ReproducibleWeb Crawl MeasurementsProceedings of the Web Conference 202110.1145/3442381.3450050(80-91)Online publication date: 19-Apr-2021
  • (2021)CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common CrawlProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463246(2398-2404)Online publication date: 11-Jul-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media