Skip to main content

Part of the book series: Studies in Computational Intelligence ((SCI,volume 204))

Abstract

This chapter provides a survey of some clustering methods relevant to clustering Web elements for better information access. We start with classical methods of cluster analysis that seems to be relevant in approaching the clustering of Web data. Graph clustering is also described since its methods contribute significantly to clustering Web data. The use of artificial neural networks for clustering has the same motivation. Based on previously presented material, the core of the chapter provides an overview of approaches to clustering in the Web environment. Particularly, we focus on clustering Web search results, in which clustering search engines arrange the search results into groups around a common theme. We conclude with some general considerations concerning the justification of so many clustering algorithms and their application in the Web environment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abonyi, J., Feil, B.: Cluster Analysis for Data Mining and System Identification. Birkhäuser Verlag AG, Basel (2007)

    MATH  Google Scholar 

  2. Adamic, L.A.: The Small World Web. In: Abiteboul, S., Vercoustre, A.-M. (eds.) ECDL 1999. LNCS, vol. 1696, pp. 443–452. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  3. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: ACM SIGMOD Record 27(2), 94–105 (1998)

    Article  Google Scholar 

  4. Al-Sultan, K.: Patttern Recognition 28(9), 1443–1451 (1995)

    Article  Google Scholar 

  5. Anonymous, 5384 works that have been based on the self-organizing map (SOM) method developed by Kohonen, Part I (2005), http://www.cis.hut.fi/research/som-bibl/references_a-k.ps

  6. Anonymous, 5384 works that have been based on the self-organizing map (SOM) method developed by Kohonen, Part II (2005), http://www.cis.hut.fi/research/som-bibl/references_l-z.ps

  7. Barabasi, A.L., Albert, R.: Science 286(5439), 509–512 (1999)

    Article  MathSciNet  Google Scholar 

  8. Barbará, D., Li, Y., Couto, J.: COOLCAT: An entropy-based algorithm for categorical clustering. In: Proceedings of the 11th international conference on information and knowledge management, pp. 582–589. ACM Press, McLean (2002)

    Google Scholar 

  9. Ben-Dor, A., Shamir, R., Yakhini, Z.: Journal of Computational Biology 6(3/4), 281–297 (1999)

    Article  Google Scholar 

  10. Berkhin, P.: Survey of Clustering Data Mining Techniques. Accrue Software, Inc., San Jose (2002)

    Google Scholar 

  11. Berry, M.W., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval: Software, Environments, Tools. Society for Industrial & Applied Mathematics (1999)

    Google Scholar 

  12. Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Journal of Decision Support Systems 27(3), 329–341 (1999)

    Article  Google Scholar 

  13. Broder, A., Kumar, R., Maghoul, R., Raghavan, P., Rajagopalan, P., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the Web. In: The 9th international WWW Conference, Amsterdam (2000)

    Google Scholar 

  14. Cai, D., He, X., Li, Z., Ma, W., Wen, J.: Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Information. In: Proceedings of MM 2004. ACM, New York (2004)

    Google Scholar 

  15. Canutescu, A.A., Shelenkov, A.A., Dunbrack, R.L.: Protein Science 12, 2001–2014 (2003)

    Article  Google Scholar 

  16. Carpenter, G.A., Grossberg, S.: Applied Optics 26, 4919–4930 (1987)

    Article  Google Scholar 

  17. Carpenter, G.A., Grossberg, S.: Proceedings of the IEEE First International Conference on Neural Networks, pp. 737–745 (1987)

    Google Scholar 

  18. Carpenter, G.A., Grossberg, S.: Neural Networks 3, 129–152 (1990)

    Article  Google Scholar 

  19. Carpenter, G.A., Grossberg, S., Markuzon, N., Reynolds, J.H., Rosen, D.B.: IEEE Transactions on Neural Networks 3(5), 698–713 (1992)

    Article  Google Scholar 

  20. Carpenter, G.A., Grossberg, S., Reynolds, J.H.: Neural Networks 4, 565–588 (1991)

    Article  Google Scholar 

  21. Carpenter, G.A., Grossberg, S., Reynolds, J.H.: IEEE Transactions on Neural Networks 6(6), 1330–1336 (1995)

    Article  Google Scholar 

  22. Carpenter, G.A., Grossberg, S., Rosen, D.B.: Neural Networks 4, 493–504 (1991)

    Article  Google Scholar 

  23. Carpenter, G.A., Grossberg, S., Rosen, D.B.: Neural Networks 4, 759–771 (1991)

    Article  Google Scholar 

  24. Chatuverdi, A.: Journal of Classification 18, 35–55 (2001)

    MathSciNet  Google Scholar 

  25. Cheng, C., Fu, A.W., Zhang, Y.: Entropy-Based Subspace Clustering for Mining Numerical Data. In: Proceedings of 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93. ACM Press, San Diego (1999)

    Google Scholar 

  26. Chung, F.R.K.: Spectral Graph Theory. In: CBMS Regional Conference Series in Mathematics, vol. 92. American Mathematical Society, Providence (1997)

    Google Scholar 

  27. Dill, S., Kumar, R., McCurley, K., Rajagopalan, S., Sivakumar, D., Tomkins, A.: ACM Trans. Internet Techn. 2(3), 205–223 (2002)

    Article  Google Scholar 

  28. Donath, W.E., Hoffman, A.J.: IBM Journal of Research and Development 17, 420–425 (1973)

    Article  MATH  MathSciNet  Google Scholar 

  29. Dopazo, J., Carazo, J.M.: Journal of Molecular Evolution 44, 226–233 (1997)

    Article  Google Scholar 

  30. Ebel, H., Mielsch, L.I., Bornholdt, S.: Scale-free topology of e-mail networks. Phys. Rev. E 66 (2002)

    Google Scholar 

  31. Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press, Portland (1996)

    Google Scholar 

  32. Estivill-Castro, V.: ACM SIGKDD Explorations Newsletter 4(1), 65–75 (2002)

    Article  MathSciNet  Google Scholar 

  33. Ferragin, P., Gulli, A.: A personalized search engine based on Web-snippet hierarchical clustering. In: Proceedings of 14th international conference on World Wide Web 2005, Chiba, Japan, pp. 801–810 (2005)

    Google Scholar 

  34. Fiedler, M.: Czech. Math. J. 23, 298–305 (1973)

    MathSciNet  Google Scholar 

  35. Fiedler, M.: Czech. Math. J. 25(100), 619–633 (1975)

    MathSciNet  Google Scholar 

  36. Fritzke, B.: Neural Network 7, 1141–1160 (1974)

    Google Scholar 

  37. Gan, J., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM, Philadelphia (2007)

    MATH  Google Scholar 

  38. Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS – Clustering categorical data using summaries. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 73–83 (1999)

    Google Scholar 

  39. Gibson, D., Kleinberg, J., Raghavan, P.: Inferring Web Communities from Link Topology. In: Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, pp. 225–234 (1998)

    Google Scholar 

  40. Goil, S., Nagesh, H., Choudhary, A.: MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets. Technical Report No. CPDC-TR-9906-010, Northwestern University (1999)

    Google Scholar 

  41. Golub, G., Van Loan, C.: Matrix computations. Johns Hopkins University Press (1989)

    Google Scholar 

  42. Gordon, A.D.: Classification, 2nd edn. Chapman & Hall/CRC, Boca Raton (1999)

    MATH  Google Scholar 

  43. Grossberg, S.: Biological Cybernetics 23, 187–202 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  44. Grossberg, S.: Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors. In: Anderson, R. (ed.), pp. 121–134 (1988) (Reprinted from Biological Cybernetics 23)

    Google Scholar 

  45. Guha, S., Rastogi, R., Shim, K.: ACM SIGMOD Record 28(2), 73–84 (1998)

    Article  Google Scholar 

  46. Guha, S., Rastogi, R., Shim, K.: Information Systems 25(5), 345–366 (2000)

    Article  Google Scholar 

  47. Guimerà, R., Danon, L., Díaz-Guilera, A., Giralt, F., Arenas, A.: Self-similar community structure in a network of human interactions. Physical Review 68 (2003)

    Google Scholar 

  48. Hammouda, K.M., Kamel, M.S.: IEEE Transactions on Knowledge and Fata Engineering 18(10), 1279–1296 (2004)

    Article  Google Scholar 

  49. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001)

    Google Scholar 

  50. Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, New York (1975)

    MATH  Google Scholar 

  51. Hartuv, E., Shamir, R.: Information Processing Letters 76(4-6), 175–181 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  52. Hassoun, M.H.: Fundamentals of Artificial Neural Networks. MIT Press, Cambridge (1995)

    MATH  Google Scholar 

  53. Haveliwala, T., Gionis, A., Indyk, P.: Scalable Techniques for Clustering the Web. In: Proceedings of WebDB (2000)

    Google Scholar 

  54. He, X., Ding, C.H.Q., Zha, H., Simon, H.D.: Automatic Topic Identification Using Webpage Clustering. In: Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 2001), pp. 195–203 (2001)

    Google Scholar 

  55. He, Z., Xu, X., Deng, S., Dong, B.: K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset. Technical Report No. Tr-2003-08, Harbin Institute of Technology (2003)

    Google Scholar 

  56. Henzinger, M.R.: Improved Data Structures for Fully Dynamic Biconnectivity. Report, Digital Equipment Corporation (1997)

    Google Scholar 

  57. Henzinger, M.R.: Internet Mathematics 1(1), 115–126 (2003)

    MATH  MathSciNet  Google Scholar 

  58. Her, J.H., Jun, S.H., Choi, J.H., Lee, J.H.: A Bayesian Neural Network Model for Dynamic Web Document Clustering. In: Proceedings of the IEEE Region 10 Conference (TENCON 1999), vol. 2, pp. 1415–1418 (1999)

    Google Scholar 

  59. Herrero, J., Valencia, A., Dopazo, J.: Bioinformatics 17, 126–136 (2001)

    Article  Google Scholar 

  60. Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 58–65. AAAI Press, Menlo Park (1998)

    Google Scholar 

  61. Hinneburg, A., Keim, D.A.: Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th International Conference on Very Large Data Bases, Edinburgh, pp. 506–517 (1999)

    Google Scholar 

  62. Hinton, G.E., Anderson, J.A.: Parallel Models of Associative Memory. Hillsidale, NJ (1989)

    MATH  Google Scholar 

  63. Hopfield, J.J.: Neural Network and Physical Systems with Emergent Collective Computational Abilities. Proceedings of Acad. Sci. USA 79, 2554–2558 (1982)

    Article  MathSciNet  Google Scholar 

  64. Höppner, F., Klawon, F., Kruse, R., Runkler, T.: Fuzzy Cluster Analysis. Methods for Classification. In: Data Analysis and Image Recognition. Wiley, New York (2000)

    Google Scholar 

  65. Huang, X., Lai, W.: Identification of Clusters in the Web Graph Based on Link Topology. In: Proceedings of the Seventh International Database Engineering and Applications Symposium (IDEAS 2003), pp. 123–130 (2003)

    Google Scholar 

  66. Huang, Z.: Data Mining and Knowledge Discovery 2, 283–304 (1998)

    Article  Google Scholar 

  67. Húsek, D., Pokorný, J., Řezanková, H., Snášel, V.: Data Clustering: From Documents to the Web. In: Vakali, A., Pallis, G. (eds.) Web Data Management Practices: Emerging Techniques and Technologies, pp. 1–33. Idea Group Publishing, Hershey (2007)

    Google Scholar 

  68. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, New Jersey (1988)

    MATH  Google Scholar 

  69. Jain, A.K., Murty, M.N., Flynn, P.J.: ACM Computing Surveys 31(3), 264–323 (1999)

    Article  Google Scholar 

  70. Joshi, A., Jiang, Z.: Retriever: Improving Web Search Engine Results Using Clustering. Idea Group Publishing (2002)

    Google Scholar 

  71. Karypis, G., Han, E., Kumar, V.: IEEE Computer 32(8), 68–75 (1999)

    Google Scholar 

  72. Kaski, S., Honkela, T., Lagus, K., Kohonen, T.: Neurocomputing 21, 101–117 (1998)

    Article  MATH  Google Scholar 

  73. Kasuba, T.: Simplified fuzzy ARTMAP. AI Expert, 18–25 (1993)

    Google Scholar 

  74. Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)

    Google Scholar 

  75. Kawamura, M., Okada, M., Hirai, Y.: IEEE Transaction on Neural Networks 10(3), 704–713 (1999)

    Article  Google Scholar 

  76. Kleinberg, J.M.: JACM 46(5), 604–632 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  77. Kohonen, T.: Self-Organizing Maps. Proceedings of IEEE 78, 1464–1480 (1991)

    Article  Google Scholar 

  78. Kohonen, T.: Self-Organizing Maps, Third Extended Edition. Springer, Heidelberg (2001)

    Google Scholar 

  79. Kohonen, T., Kaski, S., Lagus, K., Salogärui, J., Honkela, J., Paatero, V., Saarela, A.: IEEE Transaction on Neural Networks 11, 574–585 (2000)

    Article  Google Scholar 

  80. Kosko, B.: Appl. Opt. 26(23), 4947–4960 (1987)

    Article  Google Scholar 

  81. Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for Emerging Cyber Communities. In: Proceedings of the 8th WWW Conference, pp. 403–416 (1999)

    Google Scholar 

  82. Langville, A.N., Meyer, C.D.: SIAM Review 47(1), 135–161 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  83. Lian, W., Cheung, D.W.L., Mamoulis, N., Yiu, S.M.: IEEE Transaction on Knowledge Data Engineering 16(1), 82–96 (2004)

    Article  Google Scholar 

  84. Massey, L.: Neural Networks 16(5-6), 771–778 (2003)

    Article  Google Scholar 

  85. Matula, D.W.: Cluster analysis via graph theoretic techniques. In: Mullin, R.C., Reid, K.B., Roselle, D. (eds.) Proceedings Louisiana Conference on Combinatorics, Graph Theory and Computing, pp. 199–212 (1970)

    Google Scholar 

  86. Matula, D.W.: SIAM Journal of Applied Mathematics 22(3), 459–480 (1972)

    Article  MATH  MathSciNet  Google Scholar 

  87. Matula, D.W.: Graph theoretic techniques for cluster analysis algorithms. In: Van Ryzin, J. (ed.) Classification and Clustering, pp. 95–129 (1987)

    Google Scholar 

  88. McCulloch, W.S., Pitts, W.: Bulletin of Mathematical Biophysics 5, 115–133 (1943)

    Article  MATH  MathSciNet  Google Scholar 

  89. Mercer, D.P.: Clustering large datasets. Linacre College (2003)

    Google Scholar 

  90. Nagesh, H., Goil, S., Choudhary, A.: Adaptive grids for clustering massive data sets. In: Proceedings of the 1st SIAM ICDM, Chicago, IL, vol. 477 (2001)

    Google Scholar 

  91. Newman, M.E.J.: SIAM Review 45, 167–256 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  92. Newman, M.E.J., Balthrop, J., Forrest, S., Williamson, M.M.: Science 304, 527–529 (2004)

    Article  Google Scholar 

  93. Ng, R.T., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th International Conference on Very Large Data Bases, pp. 144–155 (1994)

    Google Scholar 

  94. Pal, S.K., Talwar, V., Mitra, P.: IEEE Transactions on Neural Networks 13(5), 1163–1177 (2002)

    Article  Google Scholar 

  95. Pierrakos, D., Paliouras, G., Papatheodorou, C., Karkaletsis, V., Dikaiakos, M.: Construction of Web Community Directories using Document Clustering and Web Usage Mining. In: Berendt, B., Hotho, A., Mladenic, D., Van Someren, M., Spiliopoulou, M., Stumme, G. (eds.) ECML/PKDD 2003, First European Web Mining Forum, Cavtat, Dubrovnik, Croatia (2003)

    Google Scholar 

  96. Řezanková, H., Húsek, D., Snášel, V.: Clustering as a Tool for Data Mining. In: Klíma, M. (ed.) Applications of Mathematics and Statistics in Economy, pp. 203–208. Professional Publishing, Praha (2004)

    Google Scholar 

  97. Rice, M.D., Siff, M.: Electronic Notes in Theoretical Computer Science 40, 323–346 (2001)

    Article  Google Scholar 

  98. Rumelhart, D.E., McClelland, J.L.: Explorations in the Microstructure of Cognition Vols. 1- 2. MIT Press, Cambridge (1988)

    Google Scholar 

  99. Salton, G., Buckley, C.: Information Processing and Management 24(5), 513–523 (1988)

    Article  Google Scholar 

  100. Sásik, R., Hwa, T., Iranfar, N., Loomis, W.F.: Percolation Clustering: A Novel Approach to the Clustering of Gene Expression Patterns. Dictyostelium Development PSB Proceedings 6, 335–347 (2001)

    Google Scholar 

  101. Schenker, A., Kande, A., Bunke, H., Last, M.: Graph-Theoretic Techniques for Web Content Mining. World Scientific, Singapore (2005)

    MATH  Google Scholar 

  102. Sharan, R., Shamir, R.: A clustering algorithm for gene expression analysis. In: Miyano, S., Shamir, R., Takagi, T. (eds.) Currents in Computational Molecular Biology, pp. 6–7. Universal Academy Press (2000)

    Google Scholar 

  103. Shi, J., Malik, J.: IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)

    Article  Google Scholar 

  104. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., Golub, T.R.: Proceedings of National Acad. Sci. USA 96, 2907–2912 (1999)

    Article  Google Scholar 

  105. Tomida, S., Hanai, T., Honda, H., Kobayashi, T.: Genome Informatics 12, 245–246 (2001)

    Google Scholar 

  106. Vakali, A., Pokorný, J., Dalamagas, T.: An Overview of Web Clustering Practices. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 597–606. Springer, Heidelberg (2004)

    Google Scholar 

  107. Wang, H.C., Dopazo, J., Carazo, J.M.: Bioinformatics 14(4), 376–377 (1998)

    Article  Google Scholar 

  108. Wang, Y., Kitsuregawa, M.: Evaluating Contents-Link Web Page Clustering for Web Search Results. In: CIKM 2002, pp. 499–506. ACM McLean, Virginia (2002)

    Chapter  Google Scholar 

  109. Wang, W., Yang, J., Muntz, R.: STING: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 186–195. Morgan Kaufmann Publishers, Athens (1997)

    Google Scholar 

  110. White, S., Smyth, P.: A Spectral Clustering Approach to Finding Communities in Graph. SDM (2005)

    Google Scholar 

  111. Wu, C., Zhao, S., Chen, H.L., Lo, C.J., McLarty, J.: CABIOS 12(2), 109–118 (1996)

    Google Scholar 

  112. Wu, C.H.: Gene Classification Artificial Neural System. In: Doolittle, R.F. (ed.) Methods in Enzymology: Computer Methods for Macromolecular Sequence Analysis. Academic Press, New York (1995)

    Google Scholar 

  113. Yao, Y., Chen, L., Chen, Y.Q.: Neural Processing Letters 14, 169–177 (2001)

    Article  MATH  Google Scholar 

  114. Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54 (1998)

    Google Scholar 

  115. Zamir, O., Etzioni, O.: The International Journal of Computer and Telecommunications Networking Archive 31(11-16), 1361–1374 (1999)

    Google Scholar 

  116. Zamir, O., Etzioni, O., Madanim, O., Karp, R.M.: Fast and Intuitive Clustering of Web Documents. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 287–290 (1997)

    Google Scholar 

  117. Zhang, T., Ramakrishnan, R., Livny, M.: ACM SIGMOD Record 25(2), 103–114 (1996)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Húsek, D., Pokorný, J., Řezanková, H., Snášel, V. (2009). Web Data Clustering. In: Abraham, A., Hassanien, AE., de Carvalho, A.P.d.L.F. (eds) Foundations of Computational Intelligence Volume 4. Studies in Computational Intelligence, vol 204. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01088-0_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-01088-0_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-01087-3

  • Online ISBN: 978-3-642-01088-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics