Abstract
A number of similarity metrics have been used to measure the degree of web page changes in the literature. When a web page changes, the metrics often represent the change differently. In this paper, we first define criteria for web page changes to evaluate the effectiveness of the metrics in terms of six important types of web page changes. Second, we propose a new similarity metric appropriate for measuring the degree of web page changes. Using real web pages and synthesized pages, we analyze the five existing metrics (i.e., the byte-wise comparison, the TF∙IDF cosine distance, the word distance, the edit distance, and the shingling) and ours under the proposed criteria. The analysis result shows that our metric represents the changes more effectively than other metrics. We expect that our study can help users select an appropriate metric for particular web applications.
This work was supported by Korea Research Foundation Grant (KRF-2004-005-D00172).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brewington, B.E., Cybenko, G.: How Dynamic is the Web? the 9th International World Wide Web Conference, pp. 257–276 (2000)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic Clustering of the Web. Computer Networks and ISDN Systems 29(8-13), 1157–1166 (1997)
Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: The 26th International Conference on Very Large Data Bases, pp. 200–209 (2000)
Cho, J., Garcia-Molina, H.: Synchronizing a Database to Improve Freshness. In: The ACM SIGMOD International Conference on Management of Data, pp. 117–128 (2000)
Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithm. The Massachusetts Institute of Technology (2001)
Fetterly, D., Manasse, M., Najork, M., Wiener, J.L.: A Large-Scale Study of the Evolution of Web Pages. Software: Practice & Experience 34(2), 213–237 (2003)
Kim, S.J., Lee, S.H.: An Empirical Study on the Change of Web Pages. In: The 7th Asia Pacific Web Conference, pp. 632–642 (2005)
Lim, L., Wang, M., Padmanabhan, S., Vitter, J.S., Agarwal, R.: Characterizing Web Document Change. In: The 2nd International Conference on Advances in Web-Age Information Management, pp. 133–144 (2001)
Ntoulas, A., Cho, J., Olston, C.: What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. In: The 13th International World Wide Web Conference, pp. 1–12 (2004)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kwon, S.Y., Lee, S.H., Kim, S.J. (2006). A Precise Metric for Measuring How Much Web Pages Change. In: Li Lee, M., Tan, KL., Wuwongse, V. (eds) Database Systems for Advanced Applications. DASFAA 2006. Lecture Notes in Computer Science, vol 3882. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11733836_39
Download citation
DOI: https://doi.org/10.1007/11733836_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33337-1
Online ISBN: 978-3-540-33338-8
eBook Packages: Computer ScienceComputer Science (R0)