Abstract
This chapter provides a survey of some clustering methods relevant to clustering Web elements for better information access. We start with classical methods of cluster analysis that seems to be relevant in approaching the clustering of Web data. Graph clustering is also described since its methods contribute significantly to clustering Web data. The use of artificial neural networks for clustering has the same motivation. Based on previously presented material, the core of the chapter provides an overview of approaches to clustering in the Web environment. Particularly, we focus on clustering Web search results, in which clustering search engines arrange the search results into groups around a common theme. We conclude with some general considerations concerning the justification of so many clustering algorithms and their application in the Web environment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abonyi, J., Feil, B.: Cluster Analysis for Data Mining and System Identification. Birkhäuser Verlag AG, Basel (2007)
Adamic, L.A.: The Small World Web. In: Abiteboul, S., Vercoustre, A.-M. (eds.) ECDL 1999. LNCS, vol. 1696, pp. 443–452. Springer, Heidelberg (1999)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: ACM SIGMOD Record 27(2), 94–105 (1998)
Al-Sultan, K.: Patttern Recognition 28(9), 1443–1451 (1995)
Anonymous, 5384 works that have been based on the self-organizing map (SOM) method developed by Kohonen, Part I (2005), http://www.cis.hut.fi/research/som-bibl/references_a-k.ps
Anonymous, 5384 works that have been based on the self-organizing map (SOM) method developed by Kohonen, Part II (2005), http://www.cis.hut.fi/research/som-bibl/references_l-z.ps
Barabasi, A.L., Albert, R.: Science 286(5439), 509–512 (1999)
Barbará, D., Li, Y., Couto, J.: COOLCAT: An entropy-based algorithm for categorical clustering. In: Proceedings of the 11th international conference on information and knowledge management, pp. 582–589. ACM Press, McLean (2002)
Ben-Dor, A., Shamir, R., Yakhini, Z.: Journal of Computational Biology 6(3/4), 281–297 (1999)
Berkhin, P.: Survey of Clustering Data Mining Techniques. Accrue Software, Inc., San Jose (2002)
Berry, M.W., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval: Software, Environments, Tools. Society for Industrial & Applied Mathematics (1999)
Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Journal of Decision Support Systems 27(3), 329–341 (1999)
Broder, A., Kumar, R., Maghoul, R., Raghavan, P., Rajagopalan, P., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the Web. In: The 9th international WWW Conference, Amsterdam (2000)
Cai, D., He, X., Li, Z., Ma, W., Wen, J.: Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Information. In: Proceedings of MM 2004. ACM, New York (2004)
Canutescu, A.A., Shelenkov, A.A., Dunbrack, R.L.: Protein Science 12, 2001–2014 (2003)
Carpenter, G.A., Grossberg, S.: Applied Optics 26, 4919–4930 (1987)
Carpenter, G.A., Grossberg, S.: Proceedings of the IEEE First International Conference on Neural Networks, pp. 737–745 (1987)
Carpenter, G.A., Grossberg, S.: Neural Networks 3, 129–152 (1990)
Carpenter, G.A., Grossberg, S., Markuzon, N., Reynolds, J.H., Rosen, D.B.: IEEE Transactions on Neural Networks 3(5), 698–713 (1992)
Carpenter, G.A., Grossberg, S., Reynolds, J.H.: Neural Networks 4, 565–588 (1991)
Carpenter, G.A., Grossberg, S., Reynolds, J.H.: IEEE Transactions on Neural Networks 6(6), 1330–1336 (1995)
Carpenter, G.A., Grossberg, S., Rosen, D.B.: Neural Networks 4, 493–504 (1991)
Carpenter, G.A., Grossberg, S., Rosen, D.B.: Neural Networks 4, 759–771 (1991)
Chatuverdi, A.: Journal of Classification 18, 35–55 (2001)
Cheng, C., Fu, A.W., Zhang, Y.: Entropy-Based Subspace Clustering for Mining Numerical Data. In: Proceedings of 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93. ACM Press, San Diego (1999)
Chung, F.R.K.: Spectral Graph Theory. In: CBMS Regional Conference Series in Mathematics, vol. 92. American Mathematical Society, Providence (1997)
Dill, S., Kumar, R., McCurley, K., Rajagopalan, S., Sivakumar, D., Tomkins, A.: ACM Trans. Internet Techn. 2(3), 205–223 (2002)
Donath, W.E., Hoffman, A.J.: IBM Journal of Research and Development 17, 420–425 (1973)
Dopazo, J., Carazo, J.M.: Journal of Molecular Evolution 44, 226–233 (1997)
Ebel, H., Mielsch, L.I., Bornholdt, S.: Scale-free topology of e-mail networks. Phys. Rev. E 66 (2002)
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press, Portland (1996)
Estivill-Castro, V.: ACM SIGKDD Explorations Newsletter 4(1), 65–75 (2002)
Ferragin, P., Gulli, A.: A personalized search engine based on Web-snippet hierarchical clustering. In: Proceedings of 14th international conference on World Wide Web 2005, Chiba, Japan, pp. 801–810 (2005)
Fiedler, M.: Czech. Math. J. 23, 298–305 (1973)
Fiedler, M.: Czech. Math. J. 25(100), 619–633 (1975)
Fritzke, B.: Neural Network 7, 1141–1160 (1974)
Gan, J., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM, Philadelphia (2007)
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS – Clustering categorical data using summaries. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 73–83 (1999)
Gibson, D., Kleinberg, J., Raghavan, P.: Inferring Web Communities from Link Topology. In: Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, pp. 225–234 (1998)
Goil, S., Nagesh, H., Choudhary, A.: MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets. Technical Report No. CPDC-TR-9906-010, Northwestern University (1999)
Golub, G., Van Loan, C.: Matrix computations. Johns Hopkins University Press (1989)
Gordon, A.D.: Classification, 2nd edn. Chapman & Hall/CRC, Boca Raton (1999)
Grossberg, S.: Biological Cybernetics 23, 187–202 (1976)
Grossberg, S.: Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors. In: Anderson, R. (ed.), pp. 121–134 (1988) (Reprinted from Biological Cybernetics 23)
Guha, S., Rastogi, R., Shim, K.: ACM SIGMOD Record 28(2), 73–84 (1998)
Guha, S., Rastogi, R., Shim, K.: Information Systems 25(5), 345–366 (2000)
Guimerà, R., Danon, L., Díaz-Guilera, A., Giralt, F., Arenas, A.: Self-similar community structure in a network of human interactions. Physical Review 68 (2003)
Hammouda, K.M., Kamel, M.S.: IEEE Transactions on Knowledge and Fata Engineering 18(10), 1279–1296 (2004)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001)
Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, New York (1975)
Hartuv, E., Shamir, R.: Information Processing Letters 76(4-6), 175–181 (2000)
Hassoun, M.H.: Fundamentals of Artificial Neural Networks. MIT Press, Cambridge (1995)
Haveliwala, T., Gionis, A., Indyk, P.: Scalable Techniques for Clustering the Web. In: Proceedings of WebDB (2000)
He, X., Ding, C.H.Q., Zha, H., Simon, H.D.: Automatic Topic Identification Using Webpage Clustering. In: Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 2001), pp. 195–203 (2001)
He, Z., Xu, X., Deng, S., Dong, B.: K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset. Technical Report No. Tr-2003-08, Harbin Institute of Technology (2003)
Henzinger, M.R.: Improved Data Structures for Fully Dynamic Biconnectivity. Report, Digital Equipment Corporation (1997)
Henzinger, M.R.: Internet Mathematics 1(1), 115–126 (2003)
Her, J.H., Jun, S.H., Choi, J.H., Lee, J.H.: A Bayesian Neural Network Model for Dynamic Web Document Clustering. In: Proceedings of the IEEE Region 10 Conference (TENCON 1999), vol. 2, pp. 1415–1418 (1999)
Herrero, J., Valencia, A., Dopazo, J.: Bioinformatics 17, 126–136 (2001)
Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 58–65. AAAI Press, Menlo Park (1998)
Hinneburg, A., Keim, D.A.: Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th International Conference on Very Large Data Bases, Edinburgh, pp. 506–517 (1999)
Hinton, G.E., Anderson, J.A.: Parallel Models of Associative Memory. Hillsidale, NJ (1989)
Hopfield, J.J.: Neural Network and Physical Systems with Emergent Collective Computational Abilities. Proceedings of Acad. Sci. USA 79, 2554–2558 (1982)
Höppner, F., Klawon, F., Kruse, R., Runkler, T.: Fuzzy Cluster Analysis. Methods for Classification. In: Data Analysis and Image Recognition. Wiley, New York (2000)
Huang, X., Lai, W.: Identification of Clusters in the Web Graph Based on Link Topology. In: Proceedings of the Seventh International Database Engineering and Applications Symposium (IDEAS 2003), pp. 123–130 (2003)
Huang, Z.: Data Mining and Knowledge Discovery 2, 283–304 (1998)
Húsek, D., Pokorný, J., Řezanková, H., Snášel, V.: Data Clustering: From Documents to the Web. In: Vakali, A., Pallis, G. (eds.) Web Data Management Practices: Emerging Techniques and Technologies, pp. 1–33. Idea Group Publishing, Hershey (2007)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, New Jersey (1988)
Jain, A.K., Murty, M.N., Flynn, P.J.: ACM Computing Surveys 31(3), 264–323 (1999)
Joshi, A., Jiang, Z.: Retriever: Improving Web Search Engine Results Using Clustering. Idea Group Publishing (2002)
Karypis, G., Han, E., Kumar, V.: IEEE Computer 32(8), 68–75 (1999)
Kaski, S., Honkela, T., Lagus, K., Kohonen, T.: Neurocomputing 21, 101–117 (1998)
Kasuba, T.: Simplified fuzzy ARTMAP. AI Expert, 18–25 (1993)
Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Kawamura, M., Okada, M., Hirai, Y.: IEEE Transaction on Neural Networks 10(3), 704–713 (1999)
Kleinberg, J.M.: JACM 46(5), 604–632 (1999)
Kohonen, T.: Self-Organizing Maps. Proceedings of IEEE 78, 1464–1480 (1991)
Kohonen, T.: Self-Organizing Maps, Third Extended Edition. Springer, Heidelberg (2001)
Kohonen, T., Kaski, S., Lagus, K., Salogärui, J., Honkela, J., Paatero, V., Saarela, A.: IEEE Transaction on Neural Networks 11, 574–585 (2000)
Kosko, B.: Appl. Opt. 26(23), 4947–4960 (1987)
Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for Emerging Cyber Communities. In: Proceedings of the 8th WWW Conference, pp. 403–416 (1999)
Langville, A.N., Meyer, C.D.: SIAM Review 47(1), 135–161 (2005)
Lian, W., Cheung, D.W.L., Mamoulis, N., Yiu, S.M.: IEEE Transaction on Knowledge Data Engineering 16(1), 82–96 (2004)
Massey, L.: Neural Networks 16(5-6), 771–778 (2003)
Matula, D.W.: Cluster analysis via graph theoretic techniques. In: Mullin, R.C., Reid, K.B., Roselle, D. (eds.) Proceedings Louisiana Conference on Combinatorics, Graph Theory and Computing, pp. 199–212 (1970)
Matula, D.W.: SIAM Journal of Applied Mathematics 22(3), 459–480 (1972)
Matula, D.W.: Graph theoretic techniques for cluster analysis algorithms. In: Van Ryzin, J. (ed.) Classification and Clustering, pp. 95–129 (1987)
McCulloch, W.S., Pitts, W.: Bulletin of Mathematical Biophysics 5, 115–133 (1943)
Mercer, D.P.: Clustering large datasets. Linacre College (2003)
Nagesh, H., Goil, S., Choudhary, A.: Adaptive grids for clustering massive data sets. In: Proceedings of the 1st SIAM ICDM, Chicago, IL, vol. 477 (2001)
Newman, M.E.J.: SIAM Review 45, 167–256 (2003)
Newman, M.E.J., Balthrop, J., Forrest, S., Williamson, M.M.: Science 304, 527–529 (2004)
Ng, R.T., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th International Conference on Very Large Data Bases, pp. 144–155 (1994)
Pal, S.K., Talwar, V., Mitra, P.: IEEE Transactions on Neural Networks 13(5), 1163–1177 (2002)
Pierrakos, D., Paliouras, G., Papatheodorou, C., Karkaletsis, V., Dikaiakos, M.: Construction of Web Community Directories using Document Clustering and Web Usage Mining. In: Berendt, B., Hotho, A., Mladenic, D., Van Someren, M., Spiliopoulou, M., Stumme, G. (eds.) ECML/PKDD 2003, First European Web Mining Forum, Cavtat, Dubrovnik, Croatia (2003)
Řezanková, H., Húsek, D., Snášel, V.: Clustering as a Tool for Data Mining. In: Klíma, M. (ed.) Applications of Mathematics and Statistics in Economy, pp. 203–208. Professional Publishing, Praha (2004)
Rice, M.D., Siff, M.: Electronic Notes in Theoretical Computer Science 40, 323–346 (2001)
Rumelhart, D.E., McClelland, J.L.: Explorations in the Microstructure of Cognition Vols. 1- 2. MIT Press, Cambridge (1988)
Salton, G., Buckley, C.: Information Processing and Management 24(5), 513–523 (1988)
Sásik, R., Hwa, T., Iranfar, N., Loomis, W.F.: Percolation Clustering: A Novel Approach to the Clustering of Gene Expression Patterns. Dictyostelium Development PSB Proceedings 6, 335–347 (2001)
Schenker, A., Kande, A., Bunke, H., Last, M.: Graph-Theoretic Techniques for Web Content Mining. World Scientific, Singapore (2005)
Sharan, R., Shamir, R.: A clustering algorithm for gene expression analysis. In: Miyano, S., Shamir, R., Takagi, T. (eds.) Currents in Computational Molecular Biology, pp. 6–7. Universal Academy Press (2000)
Shi, J., Malik, J.: IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., Golub, T.R.: Proceedings of National Acad. Sci. USA 96, 2907–2912 (1999)
Tomida, S., Hanai, T., Honda, H., Kobayashi, T.: Genome Informatics 12, 245–246 (2001)
Vakali, A., Pokorný, J., Dalamagas, T.: An Overview of Web Clustering Practices. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 597–606. Springer, Heidelberg (2004)
Wang, H.C., Dopazo, J., Carazo, J.M.: Bioinformatics 14(4), 376–377 (1998)
Wang, Y., Kitsuregawa, M.: Evaluating Contents-Link Web Page Clustering for Web Search Results. In: CIKM 2002, pp. 499–506. ACM McLean, Virginia (2002)
Wang, W., Yang, J., Muntz, R.: STING: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 186–195. Morgan Kaufmann Publishers, Athens (1997)
White, S., Smyth, P.: A Spectral Clustering Approach to Finding Communities in Graph. SDM (2005)
Wu, C., Zhao, S., Chen, H.L., Lo, C.J., McLarty, J.: CABIOS 12(2), 109–118 (1996)
Wu, C.H.: Gene Classification Artificial Neural System. In: Doolittle, R.F. (ed.) Methods in Enzymology: Computer Methods for Macromolecular Sequence Analysis. Academic Press, New York (1995)
Yao, Y., Chen, L., Chen, Y.Q.: Neural Processing Letters 14, 169–177 (2001)
Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54 (1998)
Zamir, O., Etzioni, O.: The International Journal of Computer and Telecommunications Networking Archive 31(11-16), 1361–1374 (1999)
Zamir, O., Etzioni, O., Madanim, O., Karp, R.M.: Fast and Intuitive Clustering of Web Documents. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 287–290 (1997)
Zhang, T., Ramakrishnan, R., Livny, M.: ACM SIGMOD Record 25(2), 103–114 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Húsek, D., Pokorný, J., Řezanková, H., Snášel, V. (2009). Web Data Clustering. In: Abraham, A., Hassanien, AE., de Carvalho, A.P.d.L.F. (eds) Foundations of Computational Intelligence Volume 4. Studies in Computational Intelligence, vol 204. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01088-0_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-01088-0_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01087-3
Online ISBN: 978-3-642-01088-0
eBook Packages: EngineeringEngineering (R0)