ABSTRACT
Mirroring Web sites is a well-known technique commonly used in the Web community. A mirror site should be updated frequently to ensure that it reflects the content of the original site. Existing mirroring tools apply page-level strategies to check each page of a site, which is inefficient and expensive. In this paper, we propose a novel site-level mirror maintenance strategy. Our approach studies the evolution of Web directorystructures and mines association rules between ancestor-descendant Web directories. Discovered rules indicate the evolution correlations between Web directories. Thus, when maintaining the mirror of a Web site (directory), we can optimally skipsubdirectories which are negatively correlated with it in undergoing significant changes. The preliminary experimental results show that our approach improves the efficiency of the mirror maintenance process significantly while sacrificing slightly in keeping the "freshness" of the mirrors.
- Rsync Rsync. In http://samba.anu.edu.au/rsync/.Google Scholar
- Web mirroring project. In http://www.l3s.de/lchen/mirror.pdf.Google Scholar
- K. Bharat and A. Z. Broder. Mirror, mirror on the web: A study of host pairs with replicated content.Computer Networks, 1999. Google ScholarDigital Library
- A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In WWW, 2004. Google ScholarDigital Library
- Y. Wang, D. J. DeWitt, and J.-Y. Cai. X-diff: An effective change detection algorithm for xml documents. In ICDE, 2003.Google ScholarCross Ref
Index Terms
- Mirror site maintenance based on evolution associations of web directories
Recommendations
Web evolution and Web Science
This paper examines the evolution of the World Wide Web as a network of networks and discusses the emergence of Web Science as an interdisciplinary area that can provide us with insights on how the Web developed, and how it has affected and is affected ...
Estimating the size and evolution of categorised topics in web directories
In this paper a statistical approach for estimating the evolution of categorized web page populations in web directories is proposed. The proposal is based on the capture-recapture method used in wildlife biological studies and it is modified according ...
On the evolution of clusters of near-duplicate web pages
This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basisover the span of 11 weeks. We then determined which of these pages are near-...
Comments