Abstract
This paper proposes a hyperlink-based web page similarity measurement and two matrix-based hierarchical web page clustering algorithms. The web page similarity measurement incorporates hyperlink transitivity and page importance within the concerned web page space. One clustering algorithm takes cluster overlapping into account, another one does not. These algorithxms do not require predefined similarity thresholds for clustering, and are independent of the page order. The primary evaluations show the effectiveness of the proposed algorithms in clustering improvement.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bharat, K., Broder, A., Henzinger, M., Kumar, P., Venkatasubramanian, S.: The Connectivity Server: Fast Access to Linkage Information on the Web, Proceedings of the 7 th International World Wide Web Conference (1998) 469–477
Bharat, K., Henzinger, M.: Improved Algorithms for Topic Distillation in a Hyperlinked Environment, Proceedings of ACM 21 st International SIGIR’98 (1998) 104–111
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, Addison Wesley, ACM Press (1999)
Botafogo, R. A.: Cluster Analysis for Hypertext Systems, Proceedings of ACM 16 th Annual International SIGIR’93 (1993)
Botafogo, R. A., Rivlin, E., Shneiderman, B.: Structural Analysis of Hypertexts: Indentifing Hierarchies and Useful Metrics, ACM Transactions on Information Systems, Vol 10, No 2 (1992)142–180
Botafogo, R. A., Shneiderman, B.: Identifying Aggregates in Hypertext Structures, Proceedings of Hypertext’91(1991) 63–74
Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proceedings of the 7 th International World Wide Web Conference (1998)
Brin, S., Page, L.: The PageRank Citation Ranking: Bringing Order to the Web, January 1998, http://www-db.stanford.edu/~backrub/pageranksub.ps.
Carriere, J., Kazman, R.: WebQuery: Searching and Visualizing the Web through Connectivity, Proceedings of the 6 th International world Wide Web Conference (1997)
Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Raghavan, P., Rajagopalan, S.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text, Proc. the 7 th International World Wide Web Conference (1998) 65–74
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced Hypertext Categorization Using Hyperlinks, Proceedings of SIGMOD 1998, 307–318
Dean, J., Henzinger, M.: Finding Related Pages in the World Wide Web, Proc. the 8 th International World Wide Web Conference (1999) 389–401
Dubes, R. J., Jain, A. K.: Algorithms for Clustering Data, Prentice Hall (1988)
Hou, J., Zhang, Y.: Constructing Good Quality Web Page Communities, Proceedings of the 13th Australasian Database Conferences (ADC 2002) 65–74
Hou, J., Zhang, Y.: A Matrix Approach for Hierarchical Web Page Clustering Based on Hyperlinks, Proceedings of the 3 rd International Conference on Web Information Systems Engineering, Workshop: Mining Enhanced Web Search (2002) 207–216
Hou, J., Zhang, Y.: Effectively Finding Relevant Web Pages from Linkage Information, IEEE Transactions on Knowledge & Data Engineering (to appear)
Hou, J., Zhang, Y.: Utilizing Hyperlink Transitivity to Improve Web Page Clustering, Proceedings of the 14th Australasian Database Conference (ADC2003)
Jiang, H., Lou, W., Wang, W.,: Three-tier Clustering: an Online Citation Clustering System, Proceedings of the Second international Conference on Web-Age Information Management (WAIM2001) 237–248
Kleinberg, J.: Authoritative Sources in a Hyperlinked Environment, Proceedings of the 9 th ACM-SIAM Symposium on Discrete Algorithms (SODA, 1998)
Marchiori, M.: The Quest for Correct Information on the Web: Hyper Search Engines, Proceedings of the 6 th International Word Wide Web Conference (1997)
McCormick, W. T., Schweitzer, P. J., White, T. W.: Problem Decomposition and Data Reorganization by a Clustering Technique, Oper. Res. (1972), 20(5) 993–1009
Özsu, M. T., Valduriez, P.: Principle of Distributed Database Systems, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, USA (1991)
Pirolli, P., Pitkow, J., Rao, R.: Silk from a Sow’s Ear: Extracting Usable Structures from the Web, Proceedings of ACM SIGCHI Conference on Human Factors in Computing (1996)
Pitkow, J., Pirolli, P.: Life, Death, and Lawfulness on the Electronic Frontier, Proceedings of ACM CHI’97 (1997) 383–390
Terveen, L., Hill, W.: Finding and Visualizing Inter-site Clan Graphs, Proceedings of CHI-98 (1998) 448–455
Wang, L.: On Competitive Learning, IEEE Transaction on Neural Networks, Vol. 8, No. 5 (1997) 1214–1217
Wang, Y., Kitsuregawa, M.: Use Link-based Clustering to Improve Web Search Results, Proceedings of the Second International Conference on Web Information Systems Engineering (WISE 2001) 119–128
Weiss, R., Vélez, B., Sheldon, M. A., Namprempre, C., Szilagyi, P., Duda, A., Gifford, D. K.: HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering, Proceedings of the Seventh ACM Conference on Hypertext (1996) 180–193
Wen, C.W., Liu, H., Wen, W. X., Zheng, J.: A Distributed Hierarchical Clustering System for Web Mining, Proceedings of the Second international Conference on Web-Age Information Management (WAIM2001) 103–113
Xiao, J., Zhang, Y., Jia, X., Li, T.: Measuring Similarity of Interests for Clustering Web-Users, Proceedings of the 12 th Australasian Database Conference (ADC2001) 107–114
Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration, Proceedings of ACM SIGIR’98 (1998) 46–54
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hou, J., Zhang, Y., Cao, J. (2003). Web Page Clustering: A Hyperlink-Based Similarity and Matrix-Based Hierarchical Algorithms. In: Zhou, X., Orlowska, M.E., Zhang, Y. (eds) Web Technologies and Applications. APWeb 2003. Lecture Notes in Computer Science, vol 2642. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36901-5_22
Download citation
DOI: https://doi.org/10.1007/3-540-36901-5_22
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-02354-8
Online ISBN: 978-3-540-36901-1
eBook Packages: Springer Book Archive