Skip to main content

Web Page Clustering: A Hyperlink-Based Similarity and Matrix-Based Hierarchical Algorithms

  • Conference paper
  • First Online:
Web Technologies and Applications (APWeb 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2642))

Included in the following conference series:

Abstract

This paper proposes a hyperlink-based web page similarity measurement and two matrix-based hierarchical web page clustering algorithms. The web page similarity measurement incorporates hyperlink transitivity and page importance within the concerned web page space. One clustering algorithm takes cluster overlapping into account, another one does not. These algorithxms do not require predefined similarity thresholds for clustering, and are independent of the page order. The primary evaluations show the effectiveness of the proposed algorithms in clustering improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bharat, K., Broder, A., Henzinger, M., Kumar, P., Venkatasubramanian, S.: The Connectivity Server: Fast Access to Linkage Information on the Web, Proceedings of the 7 th International World Wide Web Conference (1998) 469–477

    Google Scholar 

  2. Bharat, K., Henzinger, M.: Improved Algorithms for Topic Distillation in a Hyperlinked Environment, Proceedings of ACM 21 st International SIGIR’98 (1998) 104–111

    Google Scholar 

  3. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, Addison Wesley, ACM Press (1999)

    Google Scholar 

  4. Botafogo, R. A.: Cluster Analysis for Hypertext Systems, Proceedings of ACM 16 th Annual International SIGIR’93 (1993)

    Google Scholar 

  5. Botafogo, R. A., Rivlin, E., Shneiderman, B.: Structural Analysis of Hypertexts: Indentifing Hierarchies and Useful Metrics, ACM Transactions on Information Systems, Vol 10, No 2 (1992)142–180

    Article  Google Scholar 

  6. Botafogo, R. A., Shneiderman, B.: Identifying Aggregates in Hypertext Structures, Proceedings of Hypertext’91(1991) 63–74

    Google Scholar 

  7. Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proceedings of the 7 th International World Wide Web Conference (1998)

    Google Scholar 

  8. Brin, S., Page, L.: The PageRank Citation Ranking: Bringing Order to the Web, January 1998, http://www-db.stanford.edu/~backrub/pageranksub.ps.

  9. Carriere, J., Kazman, R.: WebQuery: Searching and Visualizing the Web through Connectivity, Proceedings of the 6 th International world Wide Web Conference (1997)

    Google Scholar 

  10. Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Raghavan, P., Rajagopalan, S.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text, Proc. the 7 th International World Wide Web Conference (1998) 65–74

    Google Scholar 

  11. Chakrabarti, S., Dom, B., Indyk, P.: Enhanced Hypertext Categorization Using Hyperlinks, Proceedings of SIGMOD 1998, 307–318

    Google Scholar 

  12. Dean, J., Henzinger, M.: Finding Related Pages in the World Wide Web, Proc. the 8 th International World Wide Web Conference (1999) 389–401

    Google Scholar 

  13. Dubes, R. J., Jain, A. K.: Algorithms for Clustering Data, Prentice Hall (1988)

    Google Scholar 

  14. Hou, J., Zhang, Y.: Constructing Good Quality Web Page Communities, Proceedings of the 13th Australasian Database Conferences (ADC 2002) 65–74

    Google Scholar 

  15. Hou, J., Zhang, Y.: A Matrix Approach for Hierarchical Web Page Clustering Based on Hyperlinks, Proceedings of the 3 rd International Conference on Web Information Systems Engineering, Workshop: Mining Enhanced Web Search (2002) 207–216

    Google Scholar 

  16. Hou, J., Zhang, Y.: Effectively Finding Relevant Web Pages from Linkage Information, IEEE Transactions on Knowledge & Data Engineering (to appear)

    Google Scholar 

  17. Hou, J., Zhang, Y.: Utilizing Hyperlink Transitivity to Improve Web Page Clustering, Proceedings of the 14th Australasian Database Conference (ADC2003)

    Google Scholar 

  18. Jiang, H., Lou, W., Wang, W.,: Three-tier Clustering: an Online Citation Clustering System, Proceedings of the Second international Conference on Web-Age Information Management (WAIM2001) 237–248

    Google Scholar 

  19. Kleinberg, J.: Authoritative Sources in a Hyperlinked Environment, Proceedings of the 9 th ACM-SIAM Symposium on Discrete Algorithms (SODA, 1998)

    Google Scholar 

  20. Marchiori, M.: The Quest for Correct Information on the Web: Hyper Search Engines, Proceedings of the 6 th International Word Wide Web Conference (1997)

    Google Scholar 

  21. McCormick, W. T., Schweitzer, P. J., White, T. W.: Problem Decomposition and Data Reorganization by a Clustering Technique, Oper. Res. (1972), 20(5) 993–1009

    Article  MATH  Google Scholar 

  22. Özsu, M. T., Valduriez, P.: Principle of Distributed Database Systems, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, USA (1991)

    Google Scholar 

  23. Pirolli, P., Pitkow, J., Rao, R.: Silk from a Sow’s Ear: Extracting Usable Structures from the Web, Proceedings of ACM SIGCHI Conference on Human Factors in Computing (1996)

    Google Scholar 

  24. Pitkow, J., Pirolli, P.: Life, Death, and Lawfulness on the Electronic Frontier, Proceedings of ACM CHI’97 (1997) 383–390

    Google Scholar 

  25. Terveen, L., Hill, W.: Finding and Visualizing Inter-site Clan Graphs, Proceedings of CHI-98 (1998) 448–455

    Google Scholar 

  26. Wang, L.: On Competitive Learning, IEEE Transaction on Neural Networks, Vol. 8, No. 5 (1997) 1214–1217

    Article  Google Scholar 

  27. Wang, Y., Kitsuregawa, M.: Use Link-based Clustering to Improve Web Search Results, Proceedings of the Second International Conference on Web Information Systems Engineering (WISE 2001) 119–128

    Google Scholar 

  28. Weiss, R., Vélez, B., Sheldon, M. A., Namprempre, C., Szilagyi, P., Duda, A., Gifford, D. K.: HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering, Proceedings of the Seventh ACM Conference on Hypertext (1996) 180–193

    Google Scholar 

  29. Wen, C.W., Liu, H., Wen, W. X., Zheng, J.: A Distributed Hierarchical Clustering System for Web Mining, Proceedings of the Second international Conference on Web-Age Information Management (WAIM2001) 103–113

    Google Scholar 

  30. Xiao, J., Zhang, Y., Jia, X., Li, T.: Measuring Similarity of Interests for Clustering Web-Users, Proceedings of the 12 th Australasian Database Conference (ADC2001) 107–114

    Google Scholar 

  31. Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration, Proceedings of ACM SIGIR’98 (1998) 46–54

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hou, J., Zhang, Y., Cao, J. (2003). Web Page Clustering: A Hyperlink-Based Similarity and Matrix-Based Hierarchical Algorithms. In: Zhou, X., Orlowska, M.E., Zhang, Y. (eds) Web Technologies and Applications. APWeb 2003. Lecture Notes in Computer Science, vol 2642. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36901-5_22

Download citation

  • DOI: https://doi.org/10.1007/3-540-36901-5_22

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-02354-8

  • Online ISBN: 978-3-540-36901-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics