Skip to main content

A Review of Web Document Clustering Approaches

  • Chapter
  • First Online:
Data Mining and Knowledge Discovery Handbook

Summary

Nowadays, the Internet has become the largest data repository, facing the problem of information overload. Though, the web search environment is not ideal. The existence of an abundance of information, in combination with the dynamic and heterogeneous nature of the Web, makes information retrieval a difficult process for the average user. It is a valid requirement then the development of techniques that can help the users effectively organize and browse the available information, with the ultimate goal of satisfying their information need. Cluster analysis, which deals with the organization of a collection of objects into cohesive groups, can play a very important role towards the achievement of this objective. In this chapter, we present an exhaustive survey of web document clustering approaches available on the literature, classified into three main categories: text-based, link-based and hybrid. Furthermore, we present a thorough comparison of the algorithms based on the various facets of their features and functionality. Finally, based on the review of the different approaches we conclude that although clustering has been a topic for the scientific community for three decades, there are still many open issues that call for more research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bezdek, J.C., Ehrlich, R., Full, W. FCM: Fuzzy C-Means Algorithm. Computers and Geosciences, 1984.

    Google Scholar 

  • Boley, D., Gini, M., Gross, R., Han, E.H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J. Partitioning-based clustering for web document categorization. Decision Support Systems, 27(3):329-341, 1999.

    Article  Google Scholar 

  • Botafogo, R.A., Shneiderman, B. Identifying aggregates in hypertext structures. Proc. 3rd ACM Conference on Hypertext, pp.63-74, 1991.

    Google Scholar 

  • Botafogo, R.A. Cluster analysis for hypertext systems. Proc. ACM SIGIR Conference on Research and Development in Information Retrieval, pp.116- 125, 1993.

    Google Scholar 

  • Cheeseman, P., Stutz, J. Bayesian Classification (AutoClass): Theory and Results. Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, pp. 153-180, 1996.

    Google Scholar 

  • Croft, W. B. Retrieval strategies for hypertext. Information Processing and Management, 29:313-324, 1993.

    Article  Google Scholar 

  • Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proc. ACM SIGIR Conference on Research and Development in Information Retrieval, pp.318-329, 1992.

    Google Scholar 

  • Defays, D. An efficient algorithm for the complete link method. The Computer Journal, 20:364-366, 1977.

    Article  MATH  MathSciNet  Google Scholar 

  • Dhillon, I.S. Co-clustering documents and words using Bipartite Spectral Graph Partitioning. UT CS Technical Report TR2001-05 20, 2001, (http://www.cs.texas.edu/users/inderjit/public_papers/kdd_bipartite.pdf).

  • Ding, Y. IR and AI: The role of ontology. Proc. 4th International Conference of Asian Digital Libraries, Bangalore, India, 2001.

    Google Scholar 

  • El-Hamdouchi, A., Willett, P. Hierarchic document clustering using Ward’s method. Proceedings of the Ninth International Conference on Research and Development in Information Retrieval. ACM, Washington, pp.149-156, 1986.

    Google Scholar 

  • El-Hamdouchi, A., Willett, P. Comparison of hierarchic agglomerative clustering methods for document retrieval. The Computer Journal 32, 1989.

    Google Scholar 

  • Everitt, B. S., Hand, D. J. Finite Mixture Distributions. London: Chapman and Hall, 1981.

    MATH  Google Scholar 

  • Frei, H. P., Stieger, D. The Use of Semantic Links in Hypertext Information Retrieval. Information Processing and Management, 31(1):1-13, 1995.

    Article  Google Scholar 

  • Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J. WebACE: a web agent for document categorization and exploration. Technical Report TR-97-049, Department of Computer Science, University of Minnesota, Minneapolis, 1997, (http://www.users.cs.umn.edu/karypis/publications/ir.html).

  • Jain, A.K., Murty, M.N., Flyn, P.J. Data Clustering: A Review. ACM Computing Surveys, Vol. 31, No. 2, 1999.

    Google Scholar 

  • Karypis, G., Han, E.H, Kumar, V. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modelling. IEEE Computer, 32(8):68- 75, 1999.

    Google Scholar 

  • Karypis, G., Kumar, V. A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1), 1999.

    Google Scholar 

  • Kleinberg, J. Authoritative sources in a hyperlinked environment. Proc. of the 9th ACMSIAM Symposium on Discrete Algorithms, 1997.

    Google Scholar 

  • Kohonen, T. Self-organizing maps. Springer-Verlag, Berlin, 1995.

    Google Scholar 

  • Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A. Trawling the Web for Emerging Cyber-Communities. Proc. 8th WWW Conference, 1999.

    Google Scholar 

  • Larson, R.R. Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace. Proc. 1996 American Society for Information Science Annual Meeting, 1996.

    Google Scholar 

  • Looney, C. A Fuzzy Clustering and Fuzzy Merging Algorithm. Technical Report, CS-UNR-101-1999, 1999.

    Google Scholar 

  • Merkl, D. Text Data Mining. Dale, R., Moisl, H., Somers, H. (eds.), A handbook of natural language processing: techniques and applications for the processing of language as text, Marcel Dekker, New York

    Google Scholar 

  • Modha, D., Spangler, W.S. Clustering hypertext with applications to web searching. Proc. ACM Conference on Hypertext and Hypermedia, 2000.

    Google Scholar 

  • Murtagh, F. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26:354-359

    Google Scholar 

  • Page, L., Brin, S., Motwani, R., Winograd, T. The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford, 1998, (http://www.stanford.edu/backrub/pageranksub.ps)

  • Pirolli, P., Pitkow, J., Rao, R. Silk from a sow’s ear: Extracting usable structures from the Web Proc. ACM SIGCHI Conference on Human Factors in Computing, 1996.

    Google Scholar 

  • Rasmussen, E. Clustering Algorithms. Information Retrieval,W.B. Frakes&R. Baeza-Yates, Prentice Hall PTR, New Jersey, 1992.

    Google Scholar 

  • Salton, G., Wang, A., Yang, C. A vector space model for information retrieval. Journal of the American Society for Information Science, 18:613–620, 1975.

    MATH  Google Scholar 

  • Sibson, R. SLINK: an optimally efficient algorithm for the single link cluster method. The Computer Journal 16:30-34, 1973

    Article  MathSciNet  Google Scholar 

  • Steinbach, M., G. Karypis, G., Kumar, V. A Comparison of Document Clustering Techniques. KDD Workshop on Text Mining, 2000.

    Google Scholar 

  • Strehl, A., Joydeep, G., Mooney, R. Impact of Similarity Measures on Web-page Clustering. Proc. 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search, pp.30-31, 2000.

    Google Scholar 

  • Van Rijsbergen, C. J. Information Retrieval. Butterworths, 1979.

    Google Scholar 

  • Varlamis, I., Vazirgiannis, M., Halkidi, M., Nguyen, B. THESUS: Effective Thematic Selection And Organization Of Web Document Collections based on Link Semantics. To appear in the IEEE Transactions on Knowledge And Data Engineering Journal

    Google Scholar 

  • Voorhees, E. M. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing & Management, 22: 465-476, 1986.

    Article  Google Scholar 

  • Weiss, R., Velez, B., Sheldon, M., Nemprempre, C., Szilagyi, P., Gifford, D.K. HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. Proc. Seventh ACM Conference on Hypertext, 1996.

    Google Scholar 

  • White, D.H., McCain, K.W. Bibliometrics. Annual Review of Information Science Technology, 24:119-165, 1989.

    Google Scholar 

  • Willett, P. Recent Trends in Hierarchic document Clustering: a critical review. Information & Management, 24(5):577-597, 1988.

    Google Scholar 

  • Wu, Z., Palmer, M. Verb Semantics and Lexical Selection. 32nd Annual Meetings of the Associations for Computational Linguistics, pp.133-138, 1994.

    Google Scholar 

  • Zamir, O., Etzioni, O. Web document clustering: a feasibility demonstration. Proc. of SIGIR ’98, Melbourne, Appendix-Questionnaire, pp.46-54, 1998.

    Google Scholar 

  • Zhao, Y., Karypis, G. Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report 01-40. University of Minnesota, Computer Science Department. Minneapolis, MN, 2001 (http://wwwuserscs.umn.edu/karypis/publications/ir.html.)

  • Zhao, Y., Karypis, G. Evaluation of Hierarchical Clustering Algorithms for Document Datasets. ACM Press, 16:515-524, 2002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nora Oikonomakou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Oikonomakou, N., Vazirgiannis, M. (2009). A Review of Web Document Clustering Approaches. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_48

Download citation

  • DOI: https://doi.org/10.1007/978-0-387-09823-4_48

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-09822-7

  • Online ISBN: 978-0-387-09823-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics