Abstract
Nowadays, many applications are interested in detecting and discovering changes on the web to help users to understand page updates and more generally, the web dynamics. Web archiving is one of these fields where detecting changes on web pages is important. Archiving institutes are collecting and preserving different web site versions for future generation. A major problem encountered by archiving systems is to understand what happened between two versions of web pages. In this paper, we address this requirement by proposing a new change detection approach that computes the semantic differences between two versions of HTML web pages. Our approach, called Vi-DIFF, detects changes on the visual representation of web pages. It detects two types of changes: content and structural changes. Content changes include modifications on text, hyperlinks and images. In contrast, structural changes alter the visual appearance of the page and the structure of its blocks. Our Vi-DIFF solution can serve for various applications such as crawl optimization, archive maintenance, web changes browsing, etc. Experiments on Vi-DIFF were conducted and the results are promising.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
The Web archive bibliography, http://www.ifs.tuwien.ac.at/~aola/links/WebArchiving.html
Abiteboul, S., Cobena, G., Masanes, J., Sedrati, G.: A First Experience in Archiving the French Web. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, p. 1. Springer, Heidelberg (2002)
Ben-Saad, M., Gançarski, S., Pehlivan, Z.: A Novel Web Archiving Approach based on Visual Pages Analysis. In: 9th International Web Archiving Workshop (IWAW’09), Corfu, Greece (2009)
Blakeman, K.: Tracking changes to web page content, http://www.rba.co.uk/sources/monitor.htm
Lampos, D.J.C., Eirinaki, M., Vazirgiannis, M.: Archiving the greek web. In: 4th International Web Archiving Workshop (IWAW’04), Bath, UK (2004)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a Vision-based Page Segmentation Algorithm. Technical report, Microsoft Research (2003)
Cathro, W.: Development of a digital services architecture at the national library of Australia. EduCause (2003)
Cobena, G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In: ICDE ’02: Proceedings of 18th International Conference on Data Engineering (2002)
Cosulschi, M., Constantinescu, N., Gabroveanu, M.: Classification and comparison of information structures from a web page. In: The Annals of the University of Craiova (2004)
Evi, M.K., Diligenti, M., Gori, M., Maggini, M., Milutinovi, V.: Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification. In: The Proceedings of 2002 IEEE International Conference on Data Mining ICDM’02 (2002)
Gomes, D., Santos, A.L., Silva, M.J.: Managing duplicates in a web archive. In: SAC ’06: Proceedings of the 2006 ACM Symposium on Applied Computing (2006)
Gu, X.-D., Chen, J., Ma, W.-Y., Chen, G.-L.: Visual Based Content Understanding towards Web Adaptation. In: De Bra, P., Brusilovsky, P., Conejo, R. (eds.) AH 2002. LNCS, vol. 2347, p. 164. Springer, Heidelberg (2002)
Jatowt, A., Kawai, Y., Nakamura, S., Kidawara, Y., Tanaka, K.: A browser for browsing the past web. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 877–878. ACM, New York (2006)
Kukulenz, D., Reinke, C., Hoeller, N.: Web contents tracking by learning of page grammars. In: ICIW ’08: Proceedings of the 2008 Third International Conference on Internet and Web Applications and Services, Washington, DC, USA, pp. 416–425. IEEE Computer Society, Los Alamitos (2008)
La-Fontaine, R.: A Delta Format for XML: Identifying Changes in XML Files and Representing the Changes in XML. In: XML Europe (2001)
Leonardi, E., Hoai, T.T., Bhowmick, S.S., Madria, S.: DTD-Diff: A change detection algorithm for DTDs. Data Knowl. Eng. 61(2) (2007)
Lindholm, T., Kangasharju, J., Tarkoma, S.: Fast and simple XML tree differencing by sequence alignment. In: DocEng ’06: Proceedings of the 2006 ACM Symposium on Document Engineering (2006)
Liu, L., Pu, C., Tang, W.: Webcq - detecting and delivering information changes on the web. In: Proc. Int. Conf. on Information and Knowledge Management (CIKM), pp. 512–519. ACM Press, New York (2000)
Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning block importance models for web pages. In: WWW ’04: Proceedings of the 13th International Conference on World Wide Web (2004)
Teevan, J., Dumais, S.T., Liebling, D.J., Hughes, R.L.: Changing how people view changes on the web. In: UIST ’09: Proceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology, pp. 237–246. ACM, New York (2009)
Wang, Y., DeWitt, D., Cai, J.-Y.: X-Diff: an effective change detection algorithm for XML documents. In: ICDE ’03: Proceedings of 19th International Conference on Data Engineering (March 2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pehlivan, Z., Ben-Saad, M., Gançarski, S. (2010). Vi-DIFF: Understanding Web Pages Changes. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds) Database and Expert Systems Applications. DEXA 2010. Lecture Notes in Computer Science, vol 6261. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15364-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-15364-8_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15363-1
Online ISBN: 978-3-642-15364-8
eBook Packages: Computer ScienceComputer Science (R0)