Abstract
In data stream environment, most of the conventional clustering algorithms are not sufficiently efficient, since large volumes of data arrive in a stream and these data points unfold with time. The problem of clustering time-evolving metric data and categorical time-evolving data has separately been well explored in recent years, but the problem of clustering mixed type time-evolving data remains a challenging issue due to an awkward gap between the structure of metric and categorical attributes. In this paper, we devise a generalized framework, termed Equi-Clustream to dynamically cluster mixed type time-evolving data, which comprises three algorithms: a Hybrid Drifting Concept Detection Algorithm that detects the drifting concept between the current sliding window and previous sliding window, a Hybrid Data Labeling Algorithm that assigns an appropriate cluster label to each data vector of the current non-drifting window based on the clustering result of the previous sliding window, and a visualization algorithm that analyses the relationship between the clusters at different timestamps and also visualizes the evolving trends of the clusters. The efficacy of the proposed framework is shown by experiments on synthetic and real world datasets.
Similar content being viewed by others
References
Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) Streamkm++: a clustering algorithm for data streams. J Exp Algorithm 17:2–4
Aggarwal CC, Philip SY (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on Very Large Data Bases, VLDB Endowment, Berlin, Germany, 9–12 September, 2003. VLDB, vol 29, pp 81–92
Ahmad A, Dey L (2007) A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl Eng 63(2):503–527
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms, New Orleans, Louisiana, 7–9 January, 2007. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. pp 1027–1035
Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 26 Aug 2014
Bhatnagar V, Kaur S, Chakravarthy S (2014) Clustering data streams using grid-based synopsis. Knowl Inf Syst 41(1):127–152
Can-Shi Z, Xiao D, Lin Z (2011) A study on the application of data stream clustering mining through a sliding and damped window to intrusion detection. In: 4th International conference on information and computing (ICIC), Phuket Island, Thailand, 25–27 April, 2011. IEEE Computer Society, pp 22–26
Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 6th SIAM international conference on data mining (SDM), Bethesda, MD, USA, 20–22 April, 2006. SIAM, vol 6, pp 326–337
Cao F, Liang J, Bai L, Zhao X, Dang C (2010) A framework for clustering categorical timeevolving data. IEEE Trans Fuzzy Syst 18(5):872–882
Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, USA, 20–23 August, 2006. ACM, pp 554–560
Cheeseman P, Stutz J (1996) Bayesian classification (AutoClass): theory and results. In: Fayyad UM et al (eds) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, Menlo Park, pp 153–180
Chen HL, Chen MS, Lin SC (2009) Catching the trend: a framework for clustering conceptdrifting categorical data. IEEE Trans Knowl Data Eng 21(5):652–665
Chen L, Zou LJ, Tu L (2012) A clustering algorithm for multiple data streams based on spectral component similarity. Inf Sci 183(1):35–47
Cheung YM, Jia H (2013) Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognit 46(8):2228–2238
Chi Y, Song X, Zhou D, Hino K, Tseng BL (2007) Evolutionary spectral clustering by incorporating temporal smoothness. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, CA, USA, 12–15 August, 2007. ACM, pp 153–162
Chi Y, Song X, Zhou D, Hino K, Tseng BL (2010) Evolutionary spectral clustering by incorporating temporal smoothness. US Patent 7,831,538, 9 Nov 2010
Dai BR, Huang JW, Yeh MY, Chen MS (2006) Adaptive clustering for multiple evolving streams. IEEE Trans Knowl Data Eng 18(9):1166–1180
David G, Averbuch A (2012) Spectralcat: categorical spectral clustering of numerical and nominal data. Pattern Recognit 45(1):416–433
Dubes R, Jain AK (1980) Clustering methodologies in exploratory data analysis. Adv Comput 19:113–228
Forestiero A, Pizzuti C, Spezzano G (2009) Flockstream: a bio-inspired algorithm for clustering evolving data streams. In: Proceeding of the 21st international conference on tools with artificial intelligence (ICTAI’09), Newark, New Jersey, 2–5 November, 2009. IEEE Computer Society, pp 1–8
Gaber MM, Yu PS (2006) Detection and classification of changes in evolving data streams. Int J Inf Technol Decis Mak 5(4):659–670
Golab L, Özsu MT (2003) Issues in data stream management. ACM Sigmod Record 32(2):5–14
Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 15(3):515–528
He Z, Xu X, Deng S (2005) Scalable algorithms for clustering large datasets with mixed type attributes. Int J Intell Syst 20(10):1077–1089
Hsu CC, Chen YC (2007) Mining of mixed data with application to catalog marketing. Expert Syst Appl 32(1):12–23
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle River
Ji J, Bai T, Zhou C, Ma C, Wang Z (2013) An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120:590–596
Jiawei H, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufmann, San Francisco
Khalilian M, Mustapha N (2010) Data stream clustering: challenges and issues. arXiv preprint arXiv:1006.5261
Li C, Biswas G (2002) Unsupervised learning with mixed numeric and nominal data. IEEE Trans Knowl Data Eng 14(4):673–690
Luo H, Kong F, Li Y (2006) Clustering mixed data based on evidence accumulation. In: Proceedings of second international conference on advanced data mining and applications (ADMA), Xi’an, China, 14–16 August, 2006. Lecture Notes in Computer Science, vol 4093. Springer, Heidelberg, pp 348–355
Mellier R, Myoupo JF (2006) A weighted clustering algorithm for mobile ad hoc networks with non unique weights. In: Proceedings of 2nd international conference on wireless and mobile communications (ICWMC’06) Bucharest, Romania, 29–31 July, 2006. IEEE Computer Society, pp 39–44
Nasraoui O, Rojas C (2006) Robust Clustering for tracking noisy evolving data streams. In: Proceedings of the 6th SIAM international conference on data mining (SDM), Bethesda, MD, USA, 20–22 April, 2006. SIAM, vol 6, pp 619–623
Nasraoui O, Soliman M, Saka E, Badia A, Germain R (2008) A web usage mining framework for mining evolving user profiles in dynamic web sites. IEEE Trans Knowl Data Eng 20(2):202–215
Oh SH, Kang JS, Byun YC, Park GL, Byun SY (2005) Intrusion detection based on clustering a data stream. In: Proceedings of 3rd ACIS international conference on software engineering research, management and applications, Central Michigan University, Mount Pleasant, Michigan, USA, 11–13 August, 2005. IEEE Computer Society, pp 220–227
Pawlak Z (1982) Rough sets. Int J Comput Inf Sci 11(5):341–356
Rokach L (2010) A survey of clustering algorithms. In: Maimon OZ, Rokach L (eds) Data mining and knowledge discovery handbook, 2nd edn. Springer, Heidelberg, pp 269–298
Sangam RS, Om H (2015) Hybrid data labeling algorithm for clustering large mixed type data. J Intell Inf Syst 45(2):273–293
Su Q, Chen L (2015) A method for discovering clusters of e-commerce interest patterns using click-stream data. Electron Commer Res Appl 14(1):1–13. https://doi.org/10.1016/j.elerap.2014.10.002
Yeh MY, Dai BR, Chen MS (2007) Clustering over multiple evolving streams by events and correlations. IEEE Trans Knowl Data Eng 19(10):1349–1362
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sangam, R.S., Om, H. Equi-Clustream: a framework for clustering time evolving mixed data. Adv Data Anal Classif 12, 973–995 (2018). https://doi.org/10.1007/s11634-018-0316-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-018-0316-3