Skip to main content

On Combining Link and Contents Information for Web Page Clustering

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2453))

Abstract

Clustering is currently one of the most crucial techniques for dealing (e.g. resources locating, information interpreting) with massive amount of heterogeneous information on the web, which is beyond human being’s capacity to digest. In this paper, we discuss the shortcomings of pervious approaches and present a unifying clustering algorithm to cluster web search results for a specific query topic by combining link and contents information. Especially, we investigate how to combine link and contents analysis in clustering process to improve the quality and interpretation of web search results.The proposed approach automatically clusters the web search results into high quality, semantically meaningful groups in a concise, easy-to-interpret hierarchy with tagging terms. Preliminary experiments and evaluations are conducted and the experimental results show that the proposed approach is effective and promising. Keywords: co-citation, coupling, anchor window, snippet

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kleinberg 98 Jon Kleinberg. Authoritative sources in a hyperlinked environment. In proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms (SODA), January 1998.

    Google Scholar 

  2. Ravi Kumar et. al. 99 Trawling the Web for emerging cyber-communities In Proceedings of 8th WWW conference, 1999, Toronto, Canada.

    Google Scholar 

  3. Brin and Page 98 Sergey Brin, and Larry Page. The anatomy of a large scale hypertextual web search engine. In Proceedings of WWW7, Brisbane, Australia, April 1998.

    Google Scholar 

  4. Oren Zamir and Oren Etzioni 99 Grouper: A Dynamic Clustering Interface to Web Search Results In Proceedings of 8th WWW Conference, Toronto Canada.

    Google Scholar 

  5. Richard C. Dubes and Anil K. Jain, Algorithms for Clustering Data, Prentice Hall, 1988

    Google Scholar 

  6. Oren Zamir and Oren Etzioni 97 Fast and Intuitive clustering of Web documents, KDD’97, pp287–290

    Google Scholar 

  7. Oren Zamir and Oren Etzioni 98 Web document clustering: A feasibility demonstration In Proceedings of SIGIR’ 98 Melbourne, Australia.

    Google Scholar 

  8. Zhihua Jiang et. al. Retriever: Improving Web Search Engine Results Using Clustering

    Google Scholar 

  9. Ron Weiss et. al. 96 Hypursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering Hypertext’96 Washington USA

    Google Scholar 

  10. Michael Steinbach, George karypis and Vipin Kumar A Comparison of Document Clustering techniques KDD’2000. Technical report of University of Minnesota.

    Google Scholar 

  11. James Pitkow and Peter Pirolli 97 Life, Death and lawfulness on the Electronic Frontier. In proceedings of ACM SIGCHI Conference on Human Factors in computing, 1997

    Google Scholar 

  12. Cutting, D.R. et. al.92 Scatter/gather: A Cluster-based approach to browsing large document collections. In Proceedings of the 15th ACM SIGIR Conference on Research and Development in Information Retrieval. pp 318–329; 1992

    Google Scholar 

  13. A.V. Leouski and W.B. Croft. 96 An evaluation of techniques for clustering search results. Technical Report IR-76 Department of Computer Science, University of Massachusetts, Amherst, 1996

    Google Scholar 

  14. Broder et. al. 97 Syntactic clustering of the Web. In proceedings of the Sixth International World Wide Web Conference, April 1997, pages 391–404.

    Google Scholar 

  15. Lenoard Kaufman and Peter J. Rousseeuw. Finding groups in Data: an introduction to cluster analysis Wiley, 1990

    Google Scholar 

  16. Gibson, Kleinberg and Raghavan 98 David Gibson, Jon Kleinberg, Prabhakar Raghavan. Inferring Web communities from link topology. Proc. 9th ACM Conference on Hypertext and Hypermedia, 1998.

    Google Scholar 

  17. Agrawal and Srikant 94 Rakesh Agrawal and Ramakrishnan Srikanth. Fast Algorithms for mining Association rules, In Proceedings of VLDB, Sept 1994, Santiago, Chile.

    Google Scholar 

  18. M.M. Kessler, Bibliographic coupling between scientific papers, American Documentation, 14(1963), pp 10–25

    Article  Google Scholar 

  19. H. Small, Co-citation in the scientific literature: A new measure of the relationship between two documents, J. American Soc. Info. Sci., 24(1973), pp 265–269

    Article  Google Scholar 

  20. Yitong Wang and Masaru Kitsuregawa, Use Link-based clustering to improve web search results, WISE’01, pp. 119–128, 2001

    Google Scholar 

  21. Taher H. Haveliwa et. al. 99 Scalable techniques for Clustering the Web.

    Google Scholar 

  22. Taher H. Haveliwa et. al. Similarity Search on the Web: Evaluation and Scalability Considerations Extended Technical Report, 2000

    Google Scholar 

  23. Einat Amitay Using common hypertext links to identify the best phrasal description of target web documents, SIGIR’98 workshop for Hypertext IR for the web

    Google Scholar 

  24. Daniel Boley, Maria Gini Partitioning-based Clustering for web document Categorization The paper is also available at http://www.enterpriseware.net/EWRoot/Files/Boley1999a.pdf

  25. J. Dean and M. Henzinger Finding related page in the World Wide Web. Proceedings of WWW8, 1999

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, Y., Kitsuregawa, M. (2002). On Combining Link and Contents Information for Web Page Clustering. In: Hameurlain, A., Cicchetti, R., Traunmüller, R. (eds) Database and Expert Systems Applications. DEXA 2002. Lecture Notes in Computer Science, vol 2453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46146-9_89

Download citation

  • DOI: https://doi.org/10.1007/3-540-46146-9_89

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44126-7

  • Online ISBN: 978-3-540-46146-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics