Abstract
Clustering is among the most popular data mining algorithm families. Before applying clustering algorithms to datasets, it is usually necessary to preprocess the data properly. Data preprocessing is a crucial, still neglected step in data mining. Although preprocessing techniques and algorithms are well-known, the preprocessing process is very complex and takes usually a lot of time. Instead of handling preprocessing more systematically, it is usually undervalued, i.e. more emphasis is put on choosing the appropriate clustering algorithm and setting its parameters. In our opinion, this is not because preprocessing is less important, but because it is difficult to choose the best sequence of preprocessing algorithms. We argue that it is important to better standardize this process so it is performed efficiently. Therefore, this paper proposes a generic framework for data preprocessing. It is based on a survey with data mining experts, as well as a literature and software review. The framework enables pipelining preprocessing algorithms and methods which facilitate further automated preprocessing design and the selection of a suitable preprocessing stream. The proposed framework is easily extendible, so it can be applied to other data mining algorithm families that have their own idiosyncrasies.
Similar content being viewed by others
References
Ankerst M, Breunig MM, Kriegel H-P (1999) OPTICS: ordering points to identify the clustering structure. In: ACM, Sigmod record, pp 49–60
Belkin M, Niyogi P (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv Neural Inf Process Syst 14:585–591
Berkhin P (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data. Springer, Berlin, pp 25–71
Bernstein A, Provost F, Hill S (2005) Toward intelligent assistance for a data mining process: an ontology-based approach for cost-sensitive classification. IEEE Trans Knowl Data Eng 17(4):503–518
Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B (2008) KNIME: the Konstanz information miner. In: Data analysis, machine learning and applications. Springer, Berlin, pp 319–326
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
Chakraborty S, Nagwani NK (2011) Analysis and study of incremental DBSCAN clustering algorithm. IJECBS 1(2). http://www.ijecbs.com/July2011/44.pdf
Chan C, Batur C, Sirnivasan A (1991) Determination of quantization intervals in rule based model for dynamic. In: Proceedings of the IEEE conference on systems, pp 1719–1723
Chapman P, Clinton J, Kerber R, Khabaza T, Reinartz T, Shearer C, Wirth R (2000) CRISP-DM 1.0 Step-by-step data mining guide. SPSS Inc. ftp://ftp.software.ibm.com/software/analytics/spss/support/Modeler/Documentation/14/UserManual/CRISP-DM.pdf
Chickering D, Meek C, and Rounthwaite R (2001) Efficient determination of dynamic split points in a decision tree. In: Proceedings 2001 IEEE international conference on data mining, pp 91–98
Cox T, Cox M (2000) Multidimensional scaling. Chapman & Hall, London
Delibašić B, Jovanović M, Vukićević M, Suknović M, Obradović Z (2011) Component-based decision trees for classification. Mach Learn 15(5):327–334
Delibašić B, Kirchner K, Ruhland J (2008) A pattern based data mining approach. Springer, Berlin
Delibašić B, Kirchner K, Ruhland J, Jovanović M, Vukićević M (2009) Reusable components for partitioning clustering algorithms. Artif Intell Rev 32(1–4):59–75
Delibašić B, Vukićević M, Jovanović M, Kirchner K, Ruhland J, Suknović M (2012) An architecture for component-based design of representative-based clustering algorithms. Data Knowl Eng 75:78–98
Demers D, Cottrell G, Diego S, Jolla L (1993) Non linear dimensionality reduction. Adva Neural Inf Process Syst 5:580–587
Demšar J, Zupan B, Leban G, Curk T (2004) Orange: from experimental machine learning to interactive data mining. In: PKDD 2004. Knowledge discovery in databases. Springer, Berlin, pp 537–539
Dijkstra E (1959) A note on two problems in connexion with graphs. Numer Math 1(1):269–271
Donoho D, Grimes C (2005) New locally linear embedding techniques for high-dimensional data. In: Proceedings of the National Academy of Sciences, pp 7426–7431
Dougherty J, Kohavi R, and Sahami M (1995) Supervised and unsupervised discretization of continuous features. ICML, pp 194–202
Elomaa T, Rousu J (2004) Efficient multisplitting revisited: optima-preserving. Data Min Knowl Disc 8:97–126
Enders C (2010) Applied missing data analysis. Guilford Press, New York
Engels R, Theusinger C (1998) Using a data metric for preprocessing advice for data mining applications. In: Machine learning, pp 430–434
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, KDD, pp 226–231
Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (1996) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence
Fodor I (2002) A survey of dimension reduction techniques. Technical report 1, U.S. Department of Energy
Fong M (2007) Dimension reduction on hyperspectral images. Technical report Figure 1, UCLA Department of Mathematics, Los Angeles
Foss A, Lee C-H, Wang W (2002) On data clustering analysis: scalability, constraints and validation. Adv Knowl Discov Data Min 2336:28–39
Frank E, Hall M, Holmes G, Kirkby R, Pfahringer B, Witten IH, Trigg L (2005) Weka. In: Data mining and knowledge discovery handbook. Springer, US, pp 1305–1314
García S, Luengo J, Herrera F (2015) Instance selection. Data Preprocess Data Min 72:195–243
Grira N, Crucianu M, Boujemaa N, Rocquencourt I (2005) Unsupervised and semi-supervised clustering: a brief survey. Technical report, Report of the MUSCLE European Network of Excellence
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of the international conference on management of data, (SIGMOD), pp 73–84, Seattle. ACM Press
Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: 15th international conference on data engineering (ICDE’99), pp 345–366
Gul N, Barki I, Akhtar N (2009) MFP: a mechanism for determining associated patterns of stock. Architecture, pp 1–7
Han J, Kamber M (2011) Data mining: concepts and techniques. Morgan Kaufmann, Los Altos
IBM Director of Licensing, I. C. (2012) IBM SPSS 21 Information Center
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666
Jin R, Breitbart Y, Muoh C (2008) Data discretization unification. Knowl Inf Syst 19(1):1–29
Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining—KDD ’01, pp 293–298
Jordan A, Ng M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 14:849–856
Jovanović M, Delibašić B, Vukićević M, Suknović M, Martić M et al (2014) Evolutionary approach for automated component-based decision tree algorithm design. Intell Data Anal 18:25–42
Kambhatla N, Leen TK (1997) Dimension reduction by local principal component analysis. Neural Comput 9(7):1493–1516
Kantardzic M (2011) Data mining: concepts, models, methods, and algorithms, 2nd edn. Wiley, New York
Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. North-Holland, Amsterdam
Kaufman L, Rousseeuw PJ (1990) Clustering large applications (Program CLARA). In: Finding groups in data: an introduction to cluster analysis. Wiley, Hoboken
Khabaza T, Shearer C (1995) Data mining with Clementine. IEE colloquium on knowledge discovery in databases, IEE Digest No. 1995/021(B), London
Kim J, Curry J (1977) The treatment of missing data in multivariate analysis. Sociol Methods Res 6:215–240
Kirchner K, Delibašić B, Vukićević M (2010) Projektovanje procesa klasterovanja pomoću paterna (Designing the clustering process with reusable components). InfoM 34:23–29
Kurgan LA, Musilek P (2006) A survey of Knowledge Discovery and Data Mining process models. Knowl Eng Rev 21(01):1
Law M, Jain A (2006) Incremental nonlinear dimensionality reduction by manifold learning. IEEE Trans Pattern Anal Mach Intell 28:377–391
Leyton-Brown K, Nudelman E, Andrew G, Mcfadden J, Shoham Y (2003) A portfolio approach to algorithm selection. IJCAI 1543:6–7
Li D, Zhong C, Zhang L (2010) Fuzzy c-means clustering of partially missing data sets based on statistical representation. In: 2010 seventh international conference on fuzzy systems and knowledge discovery (FSKD 2010), pp 460–464
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. University of California, School of Information and Computer Science, Irvine, CA
Lu H, Plataniotis KNK, Venetsanopoulos AN (2008) MPCA: multilinear principal component analysis of tensor objects. IEEE Trans Neural Netw 19(1):18–39
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematics, pp 281–297
Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) YALE: rapid prototyping for complex data mining tasks. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1–6
Milligan GW, Martha C (1987) Methodology review: clustering methods. Appl Psychol Meas 11:329–354
Othman Z, Bakar A, Hamdan A, Omar K, Shuib M, Liyana N (2007) Agent based preprocessing. In: Intelligent and advanced systems, pp 219–223
Pan J, Yang Q, Yang Y, Li L, Li F, Li G (2007) Cost-sensitive-data preprocessing for mining customer relationship management databases. Intell Syst IEEE 22:46–51
Pelleg D, Moore AW (2000) X-means: extending k-means with efficient estimation of the number of clusters. In: ICML, pp 727–734
Rakotomalala R (2005) TANAGRA: a free software for research and academic purposes. In: Proceedings of EGC, vol 2. pp 697–702
Raymond TN, Han JW (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th international conference on very large data bases, pp 144–155
R Development Core Team (2008) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. ISBN 3-900051-07-0. http://www.Rproject.org
Rexer K (2013) 6th Rexer Analytics Data Miner Survey. Technical report, Rexer Analytics
Rice J (1975) The algorithm selection problem. Adv Comput 15:65–118
Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Sametinger J (1997) Software engineering with reusable components. Springer, Berlin
SAS Institute (2008) SAS Enterprise Miner SEMMA
Saul LK, Weinberger KQ, Lee DD (2006) Spectral methods for dimensionality reduction. MIT Press, Cambridge
Schwarz G (2008) Estimating the dimension of a model. Ann Stat 6(2):461–464
Shawe-Taylor J, Christianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Smith-Miles KA (2008) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput Surv 41(1):6
Sonnenburg S, Braun M, Ong CS, Bengio S, Bottou L, Holmes G, Lecun Y, Müller K-R, Raetsch G, Schölkopf B, Weston J, Williamson B (2007) The need for open source software in machine learning. J Mach Learn Res 8:2443–2466
Tenenbaum JB, Silva VD, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323
Teng L, Li H, Fu X, Chen W, Shen I (2005) Dimension reduction of microarray data based on local tangent space alignment. In: Proceedings of the 4th IEEE international conference on cognitive informatics, pp 154–159
Valarmathie P, Dinakaran K (2009) An increased performance of clustering high dimensional data through dimensionality. J Theor Appl Inf Technol 13:731–733
Van de Merckt T (1993) Decision trees in numerical attribute spaces. In: 13th international joint conference on artificial intelligence
Van Der Maaten LJP, Postma EO, Herik HJVD (2008) Dimensionality reduction: a comparative review. J Mach Learn Res 10(January):66–71
Vannucci M, Colla V (2004) Meaningful discretization of continuous features for association rules mining by means of a SOM. European Symposium on Artificial Neural Networks, Bruges
Vinh NX, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? J Mach Learn Res 11:2837–2854
Vukićević M, Kirchner K, Delibašić B, Jovanovı’c M, Ruhland J, Suknović M (2012) Finding best algorithmic components for clustering microarray data. Knowl Inf Syst 35(11):111–130
Weiss Y (1999) Segmentation using eigenvectors: a unifying view. In: Proceedings of the IEEE international conference on computer vision. IEEE Computer Society Press, p 2
Wilks S (1932) Moments and distributions of estimates of population parameters from fragmentary samples. Ann Math Stat 3:163–195
Wirth R, Hipp J (2000) CRISP-DM: towards a standard process model for data mining. In: Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining, pp 29–39
Wong A, Chiu D (1987) Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans Pattern Anal Mach Intell 9:796–805
Wong AK, Wang DC (1979) DECA: a discrete-valued data clustering algorithm. IEEE Trans Pattern Anal Mach Intell 1(4):342–349
Wu J, Song C-H, Kong JM, Lee WD (2007) Extended mean field annealing for clustering incomplete data. In: 2007 international symposium on information technology convergence (ISITC 2007). IEEE, pp 8–12
Xie X, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13(8):841–847
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–678
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the international conference on management of data, (SIGMOD), pp 103–114
Zhang Z, Zha H (2004) Principal manifolds and nonlinear dimensionality reduction via local tangent space alignment. SIAM J Sci Comput 26:313–338
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kirchner, K., Zec, J. & Delibašić, B. Facilitating data preprocessing by a generic framework: a proposal for clustering. Artif Intell Rev 45, 271–297 (2016). https://doi.org/10.1007/s10462-015-9446-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-015-9446-6