Abstract
This paper describes an efficient Web page detection approach based on restricting the similarity computations between two versions of a given Web page to the nodes with the same HTML tag type. Before performing the similarity computations, the HTML Web page is transformed into an XML-like structure in which a node corresponds to an open-closed HTML tag. Analytical expressions and supporting experimental results are used to quantify the improvements that are made when comparing the proposed approach to the traditional one, which computes the similarities across all nodes of both pages. It is shown that the improvements are highly dependent on the diversity of tags in the page. That is, the more diverse the page is (i.e., contains mixed content of text, images, links, etc.), the greater the improvements are, while the more uniform it is, the lesser they are.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aignesberger, M. (2006). WebSite-Watcher home page. Retrieved from http://www.aignes.com.
Ball, T., & Douglis, F. (1996). An Internet difference engine and its applications. In 41 st Conference on Technologies for the Information Superhighway (COMPCON ’96), Santa Clara, CA, pp. 71–76.
Chawathe, S., & Garcia-Molina, H. (1997). Meaningful change detection in structured data. In ACM SIGMOD international conference on management of data, Arizona, vol 26, no 2, pp. 26–37.
Chawathe, S., Widom, J., Rajaraman, A., & Garcia-Molina, H. (1996). Change detection in hierarchically structured information. In ACM SIGMOD international conference on management of data, Montreal, pp. 493–504.
Chakravarthy, S., Jacob, J., Pandrangi, N., & Sanka, A. (2002). Webvigil: An approach to just-in-time information propagation in large network-centric environments. In 2nd International Workshop on Web Dynamics, Honolulu, HI.
Cobena, G., Abiteboul, S., & Marian, A. (2002). Detecting changes in XML documents. In 18th International Conference on Data Engineering, San Jose, CA, pp. 41–52.
Copernic Technologies (2006). Copernic Tracker home page. Retrieved from http://www.copernic.com/en/products/tracker/index.html.
Douglis, F., Ball, T., Chen, Y., & Koutsofios, E. (1998). The AT&T Internet difference engine: Tracking on the Web. World Wide Web, 1, 27–44.
Flesca, S., & Masciari, E. (2003). Efficient and effective web change detection. Data and Knowledge Engineering, 46(2), 203–224.
Hunt, J., & Mcllroy, M. (1975). An Algorithm for differential file comparison. Technical Report, TR #41, Bell Laboratories, Murray Hill, NJ.
Jacob, J., Sache, A., & Chakravarthy, S. (2005). CX-DIFF: A change detection algorithm for XML content and change visualization for WebVigiL. Data and Knowledge Engineering, 52(2), 209–230.
Kaizhong, Z., Wang, J., & Shasha, D. (1995). On the editing distance between undirected acyclic graphs and related problems. In 6th Annual Symposium on Combinatorial Pattern Matching, Espoo, pp. 395–407.
Kay M. (2007). SAXON: The XSLT and XQuery processor, home page. Retrieved from http://saxon.sourceforge.net/.
Kuhn, H. (2005). The Hungarian method for the assignment problem. Naval Research Logistics, 2(1), 7–21.
Lim, S.-J., & Ng, Y.-K. (2001). An automated change-detection algorithm for HTML documents based on semantic hierarchies. In 17th International Conference on Data Engineering, Heidelberg, pp. 303–312.
Liu, L., Pu, C., & Tang, W. (2000). WebCQ—detecting and delivering information changes on the Web. In 9th International Conference on Information and Knowledge Management, Atlanta, GA, pp. 512–519.
Melody, I., Rashmi, S., & Marti, H. (2000). Preliminary findings on quantitative measures for distinguishing highly rated information-centric web pages. In 6th Conference on Human Factors and the Web, Austin, TX.
Melody, I., Rashmi, S., & Marti, H. (2001). Empirically validated web page design metrics. In Conference on Human factors in computing systems, Seattle, WA, pp. 53–60.
Raggett, D. (2003). HTML Tidy. Retrieved from http://www.w3.org/People/Raggett/tidy/tidy.html.
Reschenhofer, E. (1997). Generalization of the Kolmogorov–Smirnov test. Computational Statistics & Data Analysis, 24(4), 433–441.
Thomopoulos, N., & Johnson, A. (2003). Tables and characteristics of the standardized lognormal distribution. In Proceedings—Annual Meeting of the Decision Sciences Institute, pp 2379–2384.
Vittorini, P., & Di Felice, P. (2001). Statistical analysis of web documents: A proposal and a case study. In 12th International Workshop on Database and Expert Systems Applications, Munich, pp. 275–281.
Wang, Y., DeWitt, D., & Cai, J. (2003). X-Diff: An effective change detection algorithm for XML documents. In International Conference on Data Engineering, Bangalore, pp. 519–530.
Woodruff, A., Aoki, P., Brewer, E., Gauthier, P., & Rowe, L. (1996). An investigation of documents from the World Wide Web. Computer Networks and ISDN Systems, 28(7), pp. 963–980.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Artail, H., Abi-Aad, M. An enhanced Web page change detection approach based on limiting similarity computations to elements of same type. J Intell Inf Syst 32, 1–21 (2009). https://doi.org/10.1007/s10844-007-0046-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-007-0046-z