Skip to main content
Log in

An enhanced Web page change detection approach based on limiting similarity computations to elements of same type

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

This paper describes an efficient Web page detection approach based on restricting the similarity computations between two versions of a given Web page to the nodes with the same HTML tag type. Before performing the similarity computations, the HTML Web page is transformed into an XML-like structure in which a node corresponds to an open-closed HTML tag. Analytical expressions and supporting experimental results are used to quantify the improvements that are made when comparing the proposed approach to the traditional one, which computes the similarities across all nodes of both pages. It is shown that the improvements are highly dependent on the diversity of tags in the page. That is, the more diverse the page is (i.e., contains mixed content of text, images, links, etc.), the greater the improvements are, while the more uniform it is, the lesser they are.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aignesberger, M. (2006). WebSite-Watcher home page. Retrieved from http://www.aignes.com.

  • Ball, T., & Douglis, F. (1996). An Internet difference engine and its applications. In 41 st Conference on Technologies for the Information Superhighway (COMPCON ’96), Santa Clara, CA, pp. 71–76.

  • Chawathe, S., & Garcia-Molina, H. (1997). Meaningful change detection in structured data. In ACM SIGMOD international conference on management of data, Arizona, vol 26, no 2, pp. 26–37.

  • Chawathe, S., Widom, J., Rajaraman, A., & Garcia-Molina, H. (1996). Change detection in hierarchically structured information. In ACM SIGMOD international conference on management of data, Montreal, pp. 493–504.

  • Chakravarthy, S., Jacob, J., Pandrangi, N., & Sanka, A. (2002). Webvigil: An approach to just-in-time information propagation in large network-centric environments. In 2nd International Workshop on Web Dynamics, Honolulu, HI.

  • Cobena, G., Abiteboul, S., & Marian, A. (2002). Detecting changes in XML documents. In 18th International Conference on Data Engineering, San Jose, CA, pp. 41–52.

  • Copernic Technologies (2006). Copernic Tracker home page. Retrieved from http://www.copernic.com/en/products/tracker/index.html.

  • Douglis, F., Ball, T., Chen, Y., & Koutsofios, E. (1998). The AT&T Internet difference engine: Tracking on the Web. World Wide Web, 1, 27–44.

    Article  Google Scholar 

  • Flesca, S., & Masciari, E. (2003). Efficient and effective web change detection. Data and Knowledge Engineering, 46(2), 203–224.

    Article  Google Scholar 

  • Hunt, J., & Mcllroy, M. (1975). An Algorithm for differential file comparison. Technical Report, TR #41, Bell Laboratories, Murray Hill, NJ.

  • Jacob, J., Sache, A., & Chakravarthy, S. (2005). CX-DIFF: A change detection algorithm for XML content and change visualization for WebVigiL. Data and Knowledge Engineering, 52(2), 209–230.

    Article  Google Scholar 

  • Kaizhong, Z., Wang, J., & Shasha, D. (1995). On the editing distance between undirected acyclic graphs and related problems. In 6th Annual Symposium on Combinatorial Pattern Matching, Espoo, pp. 395–407.

  • Kay M. (2007). SAXON: The XSLT and XQuery processor, home page. Retrieved from http://saxon.sourceforge.net/.

  • Kuhn, H. (2005). The Hungarian method for the assignment problem. Naval Research Logistics, 2(1), 7–21.

    Article  MathSciNet  Google Scholar 

  • Lim, S.-J., & Ng, Y.-K. (2001). An automated change-detection algorithm for HTML documents based on semantic hierarchies. In 17th International Conference on Data Engineering, Heidelberg, pp. 303–312.

  • Liu, L., Pu, C., & Tang, W. (2000). WebCQ—detecting and delivering information changes on the Web. In 9th International Conference on Information and Knowledge Management, Atlanta, GA, pp. 512–519.

  • Melody, I., Rashmi, S., & Marti, H. (2000). Preliminary findings on quantitative measures for distinguishing highly rated information-centric web pages. In 6th Conference on Human Factors and the Web, Austin, TX.

  • Melody, I., Rashmi, S., & Marti, H. (2001). Empirically validated web page design metrics. In Conference on Human factors in computing systems, Seattle, WA, pp. 53–60.

  • Raggett, D. (2003). HTML Tidy. Retrieved from http://www.w3.org/People/Raggett/tidy/tidy.html.

  • Reschenhofer, E. (1997). Generalization of the Kolmogorov–Smirnov test. Computational Statistics & Data Analysis, 24(4), 433–441.

    Article  MATH  Google Scholar 

  • Thomopoulos, N., & Johnson, A. (2003). Tables and characteristics of the standardized lognormal distribution. In Proceedings—Annual Meeting of the Decision Sciences Institute, pp 2379–2384.

  • Vittorini, P., & Di Felice, P. (2001). Statistical analysis of web documents: A proposal and a case study. In 12th International Workshop on Database and Expert Systems Applications, Munich, pp. 275–281.

  • Wang, Y., DeWitt, D., & Cai, J. (2003). X-Diff: An effective change detection algorithm for XML documents. In International Conference on Data Engineering, Bangalore, pp. 519–530.

  • Woodruff, A., Aoki, P., Brewer, E., Gauthier, P., & Rowe, L. (1996). An investigation of documents from the World Wide Web. Computer Networks and ISDN Systems, 28(7), pp. 963–980.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hassan Artail.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Artail, H., Abi-Aad, M. An enhanced Web page change detection approach based on limiting similarity computations to elements of same type. J Intell Inf Syst 32, 1–21 (2009). https://doi.org/10.1007/s10844-007-0046-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-007-0046-z

Keywords

Navigation