Abstract
Tools that allow effective information organisation, access and navigation are becoming increasingly important on the Web. Similarity between web pages is a concept that is central to such tools. In this paper, we examine the effect that content and layout-related aspects of web pages have on web page similarity. We consider the textual content contained within common HTML tags, the structural layout of pages, and the query terms contained within pages. Our study shows that combinations of factors can yield more promising results than individual factors, and that different aspects of web pages affect similarities between pages in a different manner. We found a number of factors that, when taken into account, can result in effective measures of similarity between web pages. Query information in particular, proved to be important for the effective organisation of web pages.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: Proceedings of the 6th WWW Conference, pp. 1157–1166 (1997)
Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J.: Automatic resource compilation by analyzing hyperlink structure and associated text. In: Proceedings of the 7th WWW Conference, pp. 65–74 (1998)
Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R.: Measuring structural similarity among web documents: preliminary results. In: Proceedings of the 7th International Conference on Electronic Publishing, pp. 513–524 (1998)
Cutler, M., Deng, H., Maniccam, S.S., Meng, W.: A new study on using html structures to improve retrieval. In: Proceedings of the 11th IEEE International Conference on Tools with Artificial Intelligence, pp. 406–409 (1999)
Dean, J., Henzinger, M.: Finding related pages in the world wide web. In: Proceedings of the 8th WWW Conference, pp. 1467–1479 (1999)
Eiron, N., McCurley, K.S.: Analysis of anchor text for web search. In: Proceedings of the 26th ACM SIGIR Conference, pp. 459–460 (2003)
Friburger, N., Maurel, D.: Textual similarity based on proper names. In: Proceedings of the ACM SIGIR Workshop on Mathematical Formal Methods in Information Retrieval, pp. 155–167 (2002)
Ganesan, P., Garcia-Molina, H., Widom, J.: Exploiting hierarchical domain structure to compute similarity. ACM Transactions on Information Systems 21(1), 64–93 (2003)
Halkidi, M., Nguyen, B., Varlamis, I., Vazirigiannis, M.: Thesus: Organising web document collections based on link semantics. VLDB Journal 12(4), 320–332 (2003)
Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the web. In: Proceedings of the 11th WWW Conference, pp. 157–163 (2002)
Hawking, D., Voorhees, E., Craswell, N., Bailey, P.: Overview of the trec-8 web track. In: Proceedings of TREC-8, pp. 131–150 (2000)
Hearst, M.A., Pedersen, J.O.: Re-examining the cluster hypothesis: Scatter/gather on retrieval results. In: Proceedings of the 19th ACM SIGIR Conference, pp. 76–84 (1996)
Jardine, N., van Rijsbergen, C.J.: The use of hierarchical clustering in information retrieval. Information Storage and Retrieval 7, 217–240 (1971)
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: Proceedings of the 9th ACM SIGKDD Conference, pp. 577–582 (2003)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Modha, D.S., Spangler, W.S.: Clustering hypertext with applications to web searching. In: Proceedings of the 11th ACM Conferencei on Hypertext and Hypermedia, pp. 143–152 (2000)
Mukherjea, S.: Organizing topic-specific web information. In: Proceedings of the 11th ACM Conference on Hypertext and Hypermedia, pp. 133–141 (2000)
Ozmutlu, S., Spink, A., Ozmutlu, H.C.: A day in the life of web searching: an exploratory study. Information Processing & Management 40(2), 319–345 (2004)
Pirolli, P., Pitkow, J., Rao, R.: Silk from a sow’s ear: extracting usable structures from the web. In: Proceedings of ACM SIGCHI Conference, pp. 118–125 (1996)
Tombros, A.: The effectiveness of hierarchic query-based clustering of documents for information retrieval. PhD thesis, Department of Computing Science, University of Glasgow (2002)
Tombros, A., van Rijsebrgen, C.J.: Query-sensitive similarity measures for information retrieval. Knowledge and Information Systems 6(5), 617–642 (2004)
Tombros, A., Villa, R., van Rijsbergen, C.J.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing & Management 38(4), 559–582 (2002)
Toyoda, M., Kitsuregawa, M.: Creating a web community chart for navigating related communities. In: Proceedings of the 12th ACM Conference on Hypertext and Hypermedia, pp. 103–112 (2001)
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)
Voorhees, E.: The Effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Department of Computer Science, Cornell University (1985)
Weiss, R., Velez, B., Sheldon, M.: Hypursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In: Proceedings of the 7th ACM Conference on Hypertext and Hypermedia, pp. 180–193 (1996)
Wong, W., Fu, A.W.: Finding structure and characteristics of web documents for classification. In: Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 96–105 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tombros, A., Ali, Z. (2005). Factors Affecting Web Page Similarity. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_35
Download citation
DOI: https://doi.org/10.1007/978-3-540-31865-1_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25295-5
Online ISBN: 978-3-540-31865-1
eBook Packages: Computer ScienceComputer Science (R0)