Skip to main content
Log in

Facilitating data preprocessing by a generic framework: a proposal for clustering

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Clustering is among the most popular data mining algorithm families. Before applying clustering algorithms to datasets, it is usually necessary to preprocess the data properly. Data preprocessing is a crucial, still neglected step in data mining. Although preprocessing techniques and algorithms are well-known, the preprocessing process is very complex and takes usually a lot of time. Instead of handling preprocessing more systematically, it is usually undervalued, i.e. more emphasis is put on choosing the appropriate clustering algorithm and setting its parameters. In our opinion, this is not because preprocessing is less important, but because it is difficult to choose the best sequence of preprocessing algorithms. We argue that it is important to better standardize this process so it is performed efficiently. Therefore, this paper proposes a generic framework for data preprocessing. It is based on a survey with data mining experts, as well as a literature and software review. The framework enables pipelining preprocessing algorithms and methods which facilitate further automated preprocessing design and the selection of a suitable preprocessing stream. The proposed framework is easily extendible, so it can be applied to other data mining algorithm families that have their own idiosyncrasies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Ankerst M, Breunig MM, Kriegel H-P (1999) OPTICS: ordering points to identify the clustering structure. In: ACM, Sigmod record, pp 49–60

  • Belkin M, Niyogi P (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv Neural Inf Process Syst 14:585–591

    Google Scholar 

  • Berkhin P (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data. Springer, Berlin, pp 25–71

  • Bernstein A, Provost F, Hill S (2005) Toward intelligent assistance for a data mining process: an ontology-based approach for cost-sensitive classification. IEEE Trans Knowl Data Eng 17(4):503–518

    Article  Google Scholar 

  • Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B (2008) KNIME: the Konstanz information miner. In: Data analysis, machine learning and applications. Springer, Berlin, pp 319–326

  • Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York

    Book  MATH  Google Scholar 

  • Chakraborty S, Nagwani NK (2011) Analysis and study of incremental DBSCAN clustering algorithm. IJECBS 1(2). http://www.ijecbs.com/July2011/44.pdf

  • Chan C, Batur C, Sirnivasan A (1991) Determination of quantization intervals in rule based model for dynamic. In: Proceedings of the IEEE conference on systems, pp 1719–1723

  • Chapman P, Clinton J, Kerber R, Khabaza T, Reinartz T, Shearer C, Wirth R (2000) CRISP-DM 1.0 Step-by-step data mining guide. SPSS Inc. ftp://ftp.software.ibm.com/software/analytics/spss/support/Modeler/Documentation/14/UserManual/CRISP-DM.pdf

  • Chickering D, Meek C, and Rounthwaite R (2001) Efficient determination of dynamic split points in a decision tree. In: Proceedings 2001 IEEE international conference on data mining, pp 91–98

  • Cox T, Cox M (2000) Multidimensional scaling. Chapman & Hall, London

    Google Scholar 

  • Delibašić B, Jovanović M, Vukićević M, Suknović M, Obradović Z (2011) Component-based decision trees for classification. Mach Learn 15(5):327–334

    Google Scholar 

  • Delibašić B, Kirchner K, Ruhland J (2008) A pattern based data mining approach. Springer, Berlin

    Google Scholar 

  • Delibašić B, Kirchner K, Ruhland J, Jovanović M, Vukićević M (2009) Reusable components for partitioning clustering algorithms. Artif Intell Rev 32(1–4):59–75

    Google Scholar 

  • Delibašić B, Vukićević M, Jovanović M, Kirchner K, Ruhland J, Suknović M (2012) An architecture for component-based design of representative-based clustering algorithms. Data Knowl Eng 75:78–98

    Article  Google Scholar 

  • Demers D, Cottrell G, Diego S, Jolla L (1993) Non linear dimensionality reduction. Adva Neural Inf Process Syst 5:580–587

    Google Scholar 

  • Demšar J, Zupan B, Leban G, Curk T (2004) Orange: from experimental machine learning to interactive data mining. In: PKDD 2004. Knowledge discovery in databases. Springer, Berlin, pp 537–539

  • Dijkstra E (1959) A note on two problems in connexion with graphs. Numer Math 1(1):269–271

    Article  MathSciNet  MATH  Google Scholar 

  • Donoho D, Grimes C (2005) New locally linear embedding techniques for high-dimensional data. In: Proceedings of the National Academy of Sciences, pp 7426–7431

  • Dougherty J, Kohavi R, and Sahami M (1995) Supervised and unsupervised discretization of continuous features. ICML, pp 194–202

  • Elomaa T, Rousu J (2004) Efficient multisplitting revisited: optima-preserving. Data Min Knowl Disc 8:97–126

    Article  MathSciNet  Google Scholar 

  • Enders C (2010) Applied missing data analysis. Guilford Press, New York

    Google Scholar 

  • Engels R, Theusinger C (1998) Using a data metric for preprocessing advice for data mining applications. In: Machine learning, pp 430–434

  • Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, KDD, pp 226–231

  • Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (1996) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence

  • Fodor I (2002) A survey of dimension reduction techniques. Technical report 1, U.S. Department of Energy

  • Fong M (2007) Dimension reduction on hyperspectral images. Technical report Figure 1, UCLA Department of Mathematics, Los Angeles

  • Foss A, Lee C-H, Wang W (2002) On data clustering analysis: scalability, constraints and validation. Adv Knowl Discov Data Min 2336:28–39

    Article  Google Scholar 

  • Frank E, Hall M, Holmes G, Kirkby R, Pfahringer B, Witten IH, Trigg L (2005) Weka. In: Data mining and knowledge discovery handbook. Springer, US, pp 1305–1314

  • García S, Luengo J, Herrera F (2015) Instance selection. Data Preprocess Data Min 72:195–243

    Google Scholar 

  • Grira N, Crucianu M, Boujemaa N, Rocquencourt I (2005) Unsupervised and semi-supervised clustering: a brief survey. Technical report, Report of the MUSCLE European Network of Excellence

  • Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of the international conference on management of data, (SIGMOD), pp 73–84, Seattle. ACM Press

  • Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: 15th international conference on data engineering (ICDE’99), pp 345–366

  • Gul N, Barki I, Akhtar N (2009) MFP: a mechanism for determining associated patterns of stock. Architecture, pp 1–7

  • Han J, Kamber M (2011) Data mining: concepts and techniques. Morgan Kaufmann, Los Altos

    Google Scholar 

  • IBM Director of Licensing, I. C. (2012) IBM SPSS 21 Information Center

  • Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs

    MATH  Google Scholar 

  • Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666

    Article  Google Scholar 

  • Jin R, Breitbart Y, Muoh C (2008) Data discretization unification. Knowl Inf Syst 19(1):1–29

    Article  Google Scholar 

  • Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining—KDD ’01, pp 293–298

  • Jordan A, Ng M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 14:849–856

    Google Scholar 

  • Jovanović M, Delibašić B, Vukićević M, Suknović M, Martić M et al (2014) Evolutionary approach for automated component-based decision tree algorithm design. Intell Data Anal 18:25–42

    Google Scholar 

  • Kambhatla N, Leen TK (1997) Dimension reduction by local principal component analysis. Neural Comput 9(7):1493–1516

    Article  Google Scholar 

  • Kantardzic M (2011) Data mining: concepts, models, methods, and algorithms, 2nd edn. Wiley, New York

    Book  Google Scholar 

  • Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. North-Holland, Amsterdam

    Google Scholar 

  • Kaufman L, Rousseeuw PJ (1990) Clustering large applications (Program CLARA). In: Finding groups in data: an introduction to cluster analysis. Wiley, Hoboken

  • Khabaza T, Shearer C (1995) Data mining with Clementine. IEE colloquium on knowledge discovery in databases, IEE Digest No. 1995/021(B), London

  • Kim J, Curry J (1977) The treatment of missing data in multivariate analysis. Sociol Methods Res 6:215–240

    Article  Google Scholar 

  • Kirchner K, Delibašić B, Vukićević M (2010) Projektovanje procesa klasterovanja pomoću paterna (Designing the clustering process with reusable components). InfoM 34:23–29

    Google Scholar 

  • Kurgan LA, Musilek P (2006) A survey of Knowledge Discovery and Data Mining process models. Knowl Eng Rev 21(01):1

    Article  Google Scholar 

  • Law M, Jain A (2006) Incremental nonlinear dimensionality reduction by manifold learning. IEEE Trans Pattern Anal Mach Intell 28:377–391

    Article  Google Scholar 

  • Leyton-Brown K, Nudelman E, Andrew G, Mcfadden J, Shoham Y (2003) A portfolio approach to algorithm selection. IJCAI 1543:6–7

    Google Scholar 

  • Li D, Zhong C, Zhang L (2010) Fuzzy c-means clustering of partially missing data sets based on statistical representation. In: 2010 seventh international conference on fuzzy systems and knowledge discovery (FSKD 2010), pp 460–464

  • Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. University of California, School of Information and Computer Science, Irvine, CA

  • Lu H, Plataniotis KNK, Venetsanopoulos AN (2008) MPCA: multilinear principal component analysis of tensor objects. IEEE Trans Neural Netw 19(1):18–39

    Article  Google Scholar 

  • MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematics, pp 281–297

  • Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) YALE: rapid prototyping for complex data mining tasks. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1–6

  • Milligan GW, Martha C (1987) Methodology review: clustering methods. Appl Psychol Meas 11:329–354

    Article  Google Scholar 

  • Othman Z, Bakar A, Hamdan A, Omar K, Shuib M, Liyana N (2007) Agent based preprocessing. In: Intelligent and advanced systems, pp 219–223

  • Pan J, Yang Q, Yang Y, Li L, Li F, Li G (2007) Cost-sensitive-data preprocessing for mining customer relationship management databases. Intell Syst IEEE 22:46–51

    Article  Google Scholar 

  • Pelleg D, Moore AW (2000) X-means: extending k-means with efficient estimation of the number of clusters. In: ICML, pp 727–734

  • Rakotomalala R (2005) TANAGRA: a free software for research and academic purposes. In: Proceedings of EGC, vol 2. pp 697–702

  • Raymond TN, Han JW (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th international conference on very large data bases, pp 144–155

  • R Development Core Team (2008) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. ISBN 3-900051-07-0. http://www.Rproject.org

  • Rexer K (2013) 6th Rexer Analytics Data Miner Survey. Technical report, Rexer Analytics

  • Rice J (1975) The algorithm selection problem. Adv Comput 15:65–118

    Article  Google Scholar 

  • Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  MATH  Google Scholar 

  • Sametinger J (1997) Software engineering with reusable components. Springer, Berlin

    Book  MATH  Google Scholar 

  • SAS Institute (2008) SAS Enterprise Miner SEMMA

  • Saul LK, Weinberger KQ, Lee DD (2006) Spectral methods for dimensionality reduction. MIT Press, Cambridge

    Google Scholar 

  • Schwarz G (2008) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  Google Scholar 

  • Shawe-Taylor J, Christianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Smith-Miles KA (2008) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput Surv 41(1):6

    Article  Google Scholar 

  • Sonnenburg S, Braun M, Ong CS, Bengio S, Bottou L, Holmes G, Lecun Y, Müller K-R, Raetsch G, Schölkopf B, Weston J, Williamson B (2007) The need for open source software in machine learning. J Mach Learn Res 8:2443–2466

    Google Scholar 

  • Tenenbaum JB, Silva VD, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323

    Article  Google Scholar 

  • Teng L, Li H, Fu X, Chen W, Shen I (2005) Dimension reduction of microarray data based on local tangent space alignment. In: Proceedings of the 4th IEEE international conference on cognitive informatics, pp 154–159

  • Valarmathie P, Dinakaran K (2009) An increased performance of clustering high dimensional data through dimensionality. J Theor Appl Inf Technol 13:731–733

    Google Scholar 

  • Van de Merckt T (1993) Decision trees in numerical attribute spaces. In: 13th international joint conference on artificial intelligence

  • Van Der Maaten LJP, Postma EO, Herik HJVD (2008) Dimensionality reduction: a comparative review. J Mach Learn Res 10(January):66–71

  • Vannucci M, Colla V (2004) Meaningful discretization of continuous features for association rules mining by means of a SOM. European Symposium on Artificial Neural Networks, Bruges

  • Vinh NX, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? J Mach Learn Res 11:2837–2854

    MathSciNet  Google Scholar 

  • Vukićević M, Kirchner K, Delibašić B, Jovanovı’c M, Ruhland J, Suknović M (2012) Finding best algorithmic components for clustering microarray data. Knowl Inf Syst 35(11):111–130

    Google Scholar 

  • Weiss Y (1999) Segmentation using eigenvectors: a unifying view. In: Proceedings of the IEEE international conference on computer vision. IEEE Computer Society Press, p 2

  • Wilks S (1932) Moments and distributions of estimates of population parameters from fragmentary samples. Ann Math Stat 3:163–195

    Article  Google Scholar 

  • Wirth R, Hipp J (2000) CRISP-DM: towards a standard process model for data mining. In: Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining, pp 29–39

  • Wong A, Chiu D (1987) Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans Pattern Anal Mach Intell 9:796–805

    Article  Google Scholar 

  • Wong AK, Wang DC (1979) DECA: a discrete-valued data clustering algorithm. IEEE Trans Pattern Anal Mach Intell 1(4):342–349

    Article  MATH  Google Scholar 

  • Wu J, Song C-H, Kong JM, Lee WD (2007) Extended mean field annealing for clustering incomplete data. In: 2007 international symposium on information technology convergence (ISITC 2007). IEEE, pp 8–12

  • Xie X, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13(8):841–847

    Article  Google Scholar 

  • Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–678

    Article  Google Scholar 

  • Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the international conference on management of data, (SIGMOD), pp 103–114

  • Zhang Z, Zha H (2004) Principal manifolds and nonlinear dimensionality reduction via local tangent space alignment. SIAM J Sci Comput 26:313–338

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kathrin Kirchner.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kirchner, K., Zec, J. & Delibašić, B. Facilitating data preprocessing by a generic framework: a proposal for clustering. Artif Intell Rev 45, 271–297 (2016). https://doi.org/10.1007/s10462-015-9446-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-015-9446-6

Keywords

Navigation