Abstract
Web pages change frequently and thus crawlers have to download them often. Various policies have been proposed for refreshing local copies of web pages. In this paper, we introduce a new sampling method that excels over other change detection methods in experiment. Change Frequency (CF) is a method that predicts the change frequency of the pages and, in the long run, achieves an optimal efficiency in comparison with the sampling method. Here, we propose a new hybrid method that is a combination of our new sampling approach and CF and show how our hybrid method improves the efficiency of change detection.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proc. WWW conf. (April 1998)
Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Proc. 26th VLDB Conf. (September 2000)
Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Proc. SIGMOD Conf. (May 2000)
Cho, J.: Crawling the web: Discovery and maintenance of a large-scale web data. PhD. Thesis, Stanford University (2001)
Ntoulas, A., Cho, J.: Effective change detection using sampling. In: Proc. 28th VLDB Conf., Hong Kong, China (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ghodsi, M., Hassanzadeh, O., Kamali, S., Monemizadeh, M. (2005). A Hybrid Approach for Refreshing Web Page Repositories. In: Zhou, L., Ooi, B.C., Meng, X. (eds) Database Systems for Advanced Applications. DASFAA 2005. Lecture Notes in Computer Science, vol 3453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408079_54
Download citation
DOI: https://doi.org/10.1007/11408079_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25334-1
Online ISBN: 978-3-540-32005-0
eBook Packages: Computer ScienceComputer Science (R0)