A Review of Web Document Clustering Approaches

Oikonomakou, Nora; Vazirgiannis, Michalis

doi:10.1007/978-0-387-09823-4_48

Nora Oikonomakou³ &
Michalis Vazirgiannis³

16k Accesses
1 Citations

Summary

Nowadays, the Internet has become the largest data repository, facing the problem of information overload. Though, the web search environment is not ideal. The existence of an abundance of information, in combination with the dynamic and heterogeneous nature of the Web, makes information retrieval a difficult process for the average user. It is a valid requirement then the development of techniques that can help the users effectively organize and browse the available information, with the ultimate goal of satisfying their information need. Cluster analysis, which deals with the organization of a collection of objects into cohesive groups, can play a very important role towards the achievement of this objective. In this chapter, we present an exhaustive survey of web document clustering approaches available on the literature, classified into three main categories: text-based, link-based and hybrid. Furthermore, we present a thorough comparison of the algorithms based on the various facets of their features and functionality. Finally, based on the review of the different approaches we conclude that although clustering has been a topic for the scientific community for three decades, there are still many open issues that call for more research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bezdek, J.C., Ehrlich, R., Full, W. FCM: Fuzzy C-Means Algorithm. Computers and Geosciences, 1984.
Google Scholar
Boley, D., Gini, M., Gross, R., Han, E.H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J. Partitioning-based clustering for web document categorization. Decision Support Systems, 27(3):329-341, 1999.
Article Google Scholar
Botafogo, R.A., Shneiderman, B. Identifying aggregates in hypertext structures. Proc. 3rd ACM Conference on Hypertext, pp.63-74, 1991.
Google Scholar
Botafogo, R.A. Cluster analysis for hypertext systems. Proc. ACM SIGIR Conference on Research and Development in Information Retrieval, pp.116- 125, 1993.
Google Scholar
Cheeseman, P., Stutz, J. Bayesian Classification (AutoClass): Theory and Results. Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, pp. 153-180, 1996.
Google Scholar
Croft, W. B. Retrieval strategies for hypertext. Information Processing and Management, 29:313-324, 1993.
Article Google Scholar
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proc. ACM SIGIR Conference on Research and Development in Information Retrieval, pp.318-329, 1992.
Google Scholar
Defays, D. An efficient algorithm for the complete link method. The Computer Journal, 20:364-366, 1977.
Article MATH MathSciNet Google Scholar
Dhillon, I.S. Co-clustering documents and words using Bipartite Spectral Graph Partitioning. UT CS Technical Report TR2001-05 20, 2001, (http://www.cs.texas.edu/users/inderjit/public_papers/kdd_bipartite.pdf).
Ding, Y. IR and AI: The role of ontology. Proc. 4th International Conference of Asian Digital Libraries, Bangalore, India, 2001.
Google Scholar
El-Hamdouchi, A., Willett, P. Hierarchic document clustering using Ward’s method. Proceedings of the Ninth International Conference on Research and Development in Information Retrieval. ACM, Washington, pp.149-156, 1986.
Google Scholar
El-Hamdouchi, A., Willett, P. Comparison of hierarchic agglomerative clustering methods for document retrieval. The Computer Journal 32, 1989.
Google Scholar
Everitt, B. S., Hand, D. J. Finite Mixture Distributions. London: Chapman and Hall, 1981.
MATH Google Scholar
Frei, H. P., Stieger, D. The Use of Semantic Links in Hypertext Information Retrieval. Information Processing and Management, 31(1):1-13, 1995.
Article Google Scholar
Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J. WebACE: a web agent for document categorization and exploration. Technical Report TR-97-049, Department of Computer Science, University of Minnesota, Minneapolis, 1997, (http://www.users.cs.umn.edu/karypis/publications/ir.html).
Jain, A.K., Murty, M.N., Flyn, P.J. Data Clustering: A Review. ACM Computing Surveys, Vol. 31, No. 2, 1999.
Google Scholar
Karypis, G., Han, E.H, Kumar, V. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modelling. IEEE Computer, 32(8):68- 75, 1999.
Google Scholar
Karypis, G., Kumar, V. A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1), 1999.
Google Scholar
Kleinberg, J. Authoritative sources in a hyperlinked environment. Proc. of the 9th ACMSIAM Symposium on Discrete Algorithms, 1997.
Google Scholar
Kohonen, T. Self-organizing maps. Springer-Verlag, Berlin, 1995.
Google Scholar
Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A. Trawling the Web for Emerging Cyber-Communities. Proc. 8th WWW Conference, 1999.
Google Scholar
Larson, R.R. Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace. Proc. 1996 American Society for Information Science Annual Meeting, 1996.
Google Scholar
Looney, C. A Fuzzy Clustering and Fuzzy Merging Algorithm. Technical Report, CS-UNR-101-1999, 1999.
Google Scholar
Merkl, D. Text Data Mining. Dale, R., Moisl, H., Somers, H. (eds.), A handbook of natural language processing: techniques and applications for the processing of language as text, Marcel Dekker, New York
Google Scholar
Modha, D., Spangler, W.S. Clustering hypertext with applications to web searching. Proc. ACM Conference on Hypertext and Hypermedia, 2000.
Google Scholar
Murtagh, F. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26:354-359
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T. The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford, 1998, (http://www.stanford.edu/backrub/pageranksub.ps)
Pirolli, P., Pitkow, J., Rao, R. Silk from a sow’s ear: Extracting usable structures from the Web Proc. ACM SIGCHI Conference on Human Factors in Computing, 1996.
Google Scholar
Rasmussen, E. Clustering Algorithms. Information Retrieval,W.B. Frakes&R. Baeza-Yates, Prentice Hall PTR, New Jersey, 1992.
Google Scholar
Salton, G., Wang, A., Yang, C. A vector space model for information retrieval. Journal of the American Society for Information Science, 18:613–620, 1975.
MATH Google Scholar
Sibson, R. SLINK: an optimally efficient algorithm for the single link cluster method. The Computer Journal 16:30-34, 1973
Article MathSciNet Google Scholar
Steinbach, M., G. Karypis, G., Kumar, V. A Comparison of Document Clustering Techniques. KDD Workshop on Text Mining, 2000.
Google Scholar
Strehl, A., Joydeep, G., Mooney, R. Impact of Similarity Measures on Web-page Clustering. Proc. 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search, pp.30-31, 2000.
Google Scholar
Van Rijsbergen, C. J. Information Retrieval. Butterworths, 1979.
Google Scholar
Varlamis, I., Vazirgiannis, M., Halkidi, M., Nguyen, B. THESUS: Effective Thematic Selection And Organization Of Web Document Collections based on Link Semantics. To appear in the IEEE Transactions on Knowledge And Data Engineering Journal
Google Scholar
Voorhees, E. M. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing & Management, 22: 465-476, 1986.
Article Google Scholar
Weiss, R., Velez, B., Sheldon, M., Nemprempre, C., Szilagyi, P., Gifford, D.K. HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. Proc. Seventh ACM Conference on Hypertext, 1996.
Google Scholar
White, D.H., McCain, K.W. Bibliometrics. Annual Review of Information Science Technology, 24:119-165, 1989.
Google Scholar
Willett, P. Recent Trends in Hierarchic document Clustering: a critical review. Information & Management, 24(5):577-597, 1988.
Google Scholar
Wu, Z., Palmer, M. Verb Semantics and Lexical Selection. 32nd Annual Meetings of the Associations for Computational Linguistics, pp.133-138, 1994.
Google Scholar
Zamir, O., Etzioni, O. Web document clustering: a feasibility demonstration. Proc. of SIGIR ’98, Melbourne, Appendix-Questionnaire, pp.46-54, 1998.
Google Scholar
Zhao, Y., Karypis, G. Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report 01-40. University of Minnesota, Computer Science Department. Minneapolis, MN, 2001 (http://wwwuserscs.umn.edu/karypis/publications/ir.html.)
Zhao, Y., Karypis, G. Evaluation of Hierarchical Clustering Algorithms for Document Datasets. ACM Press, 16:515-524, 2002.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Athens University of Economics and Business (AUEB), Patision 76, 10434, Athens, Greece
Nora Oikonomakou & Michalis Vazirgiannis

Authors

Nora Oikonomakou
View author publications
You can also search for this author in PubMed Google Scholar
Michalis Vazirgiannis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nora Oikonomakou .

Editor information

Editors and Affiliations

, Dept. Industrial Engineering, Tel Aviv University, Ramat Aviv, 69978, Israel
Oded Maimon
, Dept. Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel
Lior Rokach

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Oikonomakou, N., Vazirgiannis, M. (2009). A Review of Web Document Clustering Approaches. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_48

Download citation

DOI: https://doi.org/10.1007/978-0-387-09823-4_48
Published: 07 July 2010
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-09822-7
Online ISBN: 978-0-387-09823-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics