Summary
Nowadays, the Internet has become the largest data repository, facing the problem of information overload. Though, the web search environment is not ideal. The existence of an abundance of information, in combination with the dynamic and heterogeneous nature of the Web, makes information retrieval a difficult process for the average user. It is a valid requirement then the development of techniques that can help the users effectively organize and browse the available information, with the ultimate goal of satisfying their information need. Cluster analysis, which deals with the organization of a collection of objects into cohesive groups, can play a very important role towards the achievement of this objective. In this chapter, we present an exhaustive survey of web document clustering approaches available on the literature, classified into three main categories: text-based, link-based and hybrid. Furthermore, we present a thorough comparison of the algorithms based on the various facets of their features and functionality. Finally, based on the review of the different approaches we conclude that although clustering has been a topic for the scientific community for three decades, there are still many open issues that call for more research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bezdek, J.C., Ehrlich, R., Full, W. FCM: Fuzzy C-Means Algorithm. Computers and Geosciences, 1984.
Boley, D., Gini, M., Gross, R., Han, E.H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J. Partitioning-based clustering for web document categorization. Decision Support Systems, 27(3):329-341, 1999.
Botafogo, R.A., Shneiderman, B. Identifying aggregates in hypertext structures. Proc. 3rd ACM Conference on Hypertext, pp.63-74, 1991.
Botafogo, R.A. Cluster analysis for hypertext systems. Proc. ACM SIGIR Conference on Research and Development in Information Retrieval, pp.116- 125, 1993.
Cheeseman, P., Stutz, J. Bayesian Classification (AutoClass): Theory and Results. Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, pp. 153-180, 1996.
Croft, W. B. Retrieval strategies for hypertext. Information Processing and Management, 29:313-324, 1993.
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proc. ACM SIGIR Conference on Research and Development in Information Retrieval, pp.318-329, 1992.
Defays, D. An efficient algorithm for the complete link method. The Computer Journal, 20:364-366, 1977.
Dhillon, I.S. Co-clustering documents and words using Bipartite Spectral Graph Partitioning. UT CS Technical Report TR2001-05 20, 2001, (http://www.cs.texas.edu/users/inderjit/public_papers/kdd_bipartite.pdf).
Ding, Y. IR and AI: The role of ontology. Proc. 4th International Conference of Asian Digital Libraries, Bangalore, India, 2001.
El-Hamdouchi, A., Willett, P. Hierarchic document clustering using Ward’s method. Proceedings of the Ninth International Conference on Research and Development in Information Retrieval. ACM, Washington, pp.149-156, 1986.
El-Hamdouchi, A., Willett, P. Comparison of hierarchic agglomerative clustering methods for document retrieval. The Computer Journal 32, 1989.
Everitt, B. S., Hand, D. J. Finite Mixture Distributions. London: Chapman and Hall, 1981.
Frei, H. P., Stieger, D. The Use of Semantic Links in Hypertext Information Retrieval. Information Processing and Management, 31(1):1-13, 1995.
Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J. WebACE: a web agent for document categorization and exploration. Technical Report TR-97-049, Department of Computer Science, University of Minnesota, Minneapolis, 1997, (http://www.users.cs.umn.edu/karypis/publications/ir.html).
Jain, A.K., Murty, M.N., Flyn, P.J. Data Clustering: A Review. ACM Computing Surveys, Vol. 31, No. 2, 1999.
Karypis, G., Han, E.H, Kumar, V. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modelling. IEEE Computer, 32(8):68- 75, 1999.
Karypis, G., Kumar, V. A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1), 1999.
Kleinberg, J. Authoritative sources in a hyperlinked environment. Proc. of the 9th ACMSIAM Symposium on Discrete Algorithms, 1997.
Kohonen, T. Self-organizing maps. Springer-Verlag, Berlin, 1995.
Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A. Trawling the Web for Emerging Cyber-Communities. Proc. 8th WWW Conference, 1999.
Larson, R.R. Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace. Proc. 1996 American Society for Information Science Annual Meeting, 1996.
Looney, C. A Fuzzy Clustering and Fuzzy Merging Algorithm. Technical Report, CS-UNR-101-1999, 1999.
Merkl, D. Text Data Mining. Dale, R., Moisl, H., Somers, H. (eds.), A handbook of natural language processing: techniques and applications for the processing of language as text, Marcel Dekker, New York
Modha, D., Spangler, W.S. Clustering hypertext with applications to web searching. Proc. ACM Conference on Hypertext and Hypermedia, 2000.
Murtagh, F. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26:354-359
Page, L., Brin, S., Motwani, R., Winograd, T. The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford, 1998, (http://www.stanford.edu/backrub/pageranksub.ps)
Pirolli, P., Pitkow, J., Rao, R. Silk from a sow’s ear: Extracting usable structures from the Web Proc. ACM SIGCHI Conference on Human Factors in Computing, 1996.
Rasmussen, E. Clustering Algorithms. Information Retrieval,W.B. Frakes&R. Baeza-Yates, Prentice Hall PTR, New Jersey, 1992.
Salton, G., Wang, A., Yang, C. A vector space model for information retrieval. Journal of the American Society for Information Science, 18:613–620, 1975.
Sibson, R. SLINK: an optimally efficient algorithm for the single link cluster method. The Computer Journal 16:30-34, 1973
Steinbach, M., G. Karypis, G., Kumar, V. A Comparison of Document Clustering Techniques. KDD Workshop on Text Mining, 2000.
Strehl, A., Joydeep, G., Mooney, R. Impact of Similarity Measures on Web-page Clustering. Proc. 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search, pp.30-31, 2000.
Van Rijsbergen, C. J. Information Retrieval. Butterworths, 1979.
Varlamis, I., Vazirgiannis, M., Halkidi, M., Nguyen, B. THESUS: Effective Thematic Selection And Organization Of Web Document Collections based on Link Semantics. To appear in the IEEE Transactions on Knowledge And Data Engineering Journal
Voorhees, E. M. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing & Management, 22: 465-476, 1986.
Weiss, R., Velez, B., Sheldon, M., Nemprempre, C., Szilagyi, P., Gifford, D.K. HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. Proc. Seventh ACM Conference on Hypertext, 1996.
White, D.H., McCain, K.W. Bibliometrics. Annual Review of Information Science Technology, 24:119-165, 1989.
Willett, P. Recent Trends in Hierarchic document Clustering: a critical review. Information & Management, 24(5):577-597, 1988.
Wu, Z., Palmer, M. Verb Semantics and Lexical Selection. 32nd Annual Meetings of the Associations for Computational Linguistics, pp.133-138, 1994.
Zamir, O., Etzioni, O. Web document clustering: a feasibility demonstration. Proc. of SIGIR ’98, Melbourne, Appendix-Questionnaire, pp.46-54, 1998.
Zhao, Y., Karypis, G. Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report 01-40. University of Minnesota, Computer Science Department. Minneapolis, MN, 2001 (http://wwwuserscs.umn.edu/karypis/publications/ir.html.)
Zhao, Y., Karypis, G. Evaluation of Hierarchical Clustering Algorithms for Document Datasets. ACM Press, 16:515-524, 2002.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Oikonomakou, N., Vazirgiannis, M. (2009). A Review of Web Document Clustering Approaches. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_48
Download citation
DOI: https://doi.org/10.1007/978-0-387-09823-4_48
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-09822-7
Online ISBN: 978-0-387-09823-4
eBook Packages: Computer ScienceComputer Science (R0)