Abstract
Data collection and analysis in web mining faces certain unique challenges. Due to a variety of reasons inherent in web browsing and web logging, the likelihood of bad or incomplete data is higher than conventional applications. The analytical techniques in web mining need to accommodate such data. Fuzzy and rough sets provide the ability to deal with incomplete and approximate information. Fuzzy set theory has been shown to be useful in three important aspects of web and data mining, namely clustering, association, and sequential analysis. There is increasing interest in research on clustering based on rough set theory. Clustering is an important part of web mining that involves finding natural groupings of web resources or web users. Researchers have pointed out some important differences between clustering in conventional applications and clustering in web mining. For example, the clusters and associations in web mining do not necessarily have crisp boundaries. As a result, researchers have studied the possibility of using fuzzy sets in web mining clustering applications. Recent attempts have used genetic algorithms based on rough set theory for clustering. However, the genetic algorithms based clustering may not be able to handle the large amount of data typical in a web mining application. This paper proposes a variation of the K-means clustering algorithm based on properties of rough sets. The proposed algorithm represents clusters as interval or rough sets. The paper also describes the design of an experiment including data collection and the clustering process. The experiment is used to create interval set representations of clusters of web visitors.
Similar content being viewed by others
References
do Prado, H.A., Engel, P.M., and Filho, H.C. (2002). Rough Clustering: An Alternative to Finding Meaningful Clusters by Using the Reducts from a Dataset. In J. Alpigini, J.F. Peters, A. Skowron, N. Zhong (Eds.), Rough Sets and Current Trends in Computing (RSCTC'02). Springer-Verlag, Lecture notes in Artificial Intelligence 2475.
Hartigan, J.A. and Wong, M.A. (1979). Algorithm AS136: A K-Means Clustering Algorithm. Applied Statistics, 28, 100-108.
Hathaway, R.J. and Bezdek, J.C. (1993). Switching Regression Models and Fuzzy Clustering. IEEE Transactions of Fuzzy Systems, 1(3), 195-204.
Hirano, S. and Tsumoto, S. (2000). Rough Clustering and Its Application to Medicine. Journal of Information Science, 124, 125-137.
Joachims, T., Armstrong, R., Freitag, D., and Mitchell, T. (1995). Webwatcher: A Learning Apprentice for the World Wide Web. In AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments.
Joshi, A. and Krishnapuram, R. (1998). Robust Fuzzy Clustering Methods to SupportWeb Mining. In Proceedings of the Workshop on Data Mining and Knowledge Discovery, SIGMOD '98 (pp. 15/1-15/8).
Krishnapuram, R., Frigui, H., and Nasraoui, O. (1995). Fuzzy and Possibilistic Shell Clustering Algorithms and Their Application to Boundary Detection and Surface Approximation: Parts I and II. IEEE Transactions on Fuzzy Systems, 3(1), 29-60.
Krishnapuram, R. and Keller, J. (1993). A Possibilistic Approach to Clustering. IEEE Transactions on Fuzzy Systems, 1(2), 98-110.
Lingras, P. (2001). Unsupervised Rough Set Classification Using GAs. Journal of Intelligent Information Systems, 16(3), 215-228.
Lingras, P. (2002). Rough Set Clustering forWebMining. In Proceedings of 2002 IEEE International Conference on Fuzzy Systems.
Lingras, P. and Huang, X. (2002). Statistical, Evolutionary, and Neurocomputing Clustering Techniques: Cluster-Based Versus Object-Based Approaches. Intelligence Review (submitted).
MacQueen, J. (1967). Some Methods fir Classification and Analysis of Multivariate Observations. In L.M. Le Cam and J. Neyman (Eds.), Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1 (pp. 281-297).
Pawlak, Z. (1982). Rough Sets. International Journal of Information and Computer Sciences, 11, 145-172.
Pawlak, Z. (1984). Rough Classification. International Journal of Man-Machine Studies, 20, 469-483.
Pawlak, Z. (1992). Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishers.
Polkowski, L. and Skowron. (1996). Rough Mereology: A New Paradigm for Approximate Reasoning. International Journal of Approximate Reasoning, 15(4), 333-365.
Perkowitz, M. and Etzioni, O. (1997). Adaptive Web Sites: An AI Challenge. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence.
Perkowitz, M. and Etzioni, O. (1999). Adaptive Web Sites: Conceptual Cluster Mining. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence.
Peters, J.F., Skowron, A., Suraj, Z., Rzasa, W., and Borkowski, M. (2002). Clustering: A Rough Set Approach to Constructing Information Granules. In Z. Suraj (Ed.), Soft Computing and Distributed Processing, Proceedings of 6th International Conference, SCDP 2002 (pp. 57-61).
Skowron, A. and Stepaniuk, J. (1999). Information Granules in Distributed Environment. In S. Ohsuga, N. Zhong, and A. Skowron (Eds.), New Directions in Rough Sets, Data Mining, and Granular-Soft Computing (pp. 357-365). Springer-Verlag, Lecture notes in Artificial Intelligence 1711, Tokyo.
Voges, K.E., Pope, N.K.Ll., and Brown, M.R. (2002a). Cluster Analysis of Marketing Data: A Comparison of K-Means, Rough Set, and Rough Genetic Approaches. In H.A. Abbas, R.A. Sarker, and C.S. Newton (Eds.), Heuristics and Optimization for Knowledge Discovery (pp. 208-216). Idea Group Publishing.
Voges, K.E., Pope, N.K.Ll., and Brown, M.R. (2002b). Cluster Analysis of Marketing Data Examining On-Line Shopping Orientation: A Comparison of K-Means, Rough Clustering Approaches. In H.A. Abbas, R.A. Sarker, and C.S. Newton (Eds.), Heuristics and Optimization for Knowledge Discovery (pp. 217-225). Idea Group Publishing.
Yao, Y.Y., Li, X., Lin, T.Y., and Liu, Q. (1994). Representation and Classification of Rough Set Models. In Proceeding of Third International Workshop on Rough Sets and Soft Computing (pp. 630-637).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lingras, P., West, C. Interval Set Clustering of Web Users with Rough K-Means. Journal of Intelligent Information Systems 23, 5–16 (2004). https://doi.org/10.1023/B:JIIS.0000029668.88665.1a
Issue Date:
DOI: https://doi.org/10.1023/B:JIIS.0000029668.88665.1a