Facilitating data preprocessing by a generic framework: a proposal for clustering

Kirchner, Kathrin; Zec, Jelena; Delibašić, Boris

doi:10.1007/s10462-015-9446-6

Facilitating data preprocessing by a generic framework: a proposal for clustering

Published: 17 October 2015

Volume 45, pages 271–297, (2016)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

1250 Accesses
15 Citations
Explore all metrics

Abstract

Clustering is among the most popular data mining algorithm families. Before applying clustering algorithms to datasets, it is usually necessary to preprocess the data properly. Data preprocessing is a crucial, still neglected step in data mining. Although preprocessing techniques and algorithms are well-known, the preprocessing process is very complex and takes usually a lot of time. Instead of handling preprocessing more systematically, it is usually undervalued, i.e. more emphasis is put on choosing the appropriate clustering algorithm and setting its parameters. In our opinion, this is not because preprocessing is less important, but because it is difficult to choose the best sequence of preprocessing algorithms. We argue that it is important to better standardize this process so it is performed efficiently. Therefore, this paper proposes a generic framework for data preprocessing. It is based on a survey with data mining experts, as well as a literature and software review. The framework enables pipelining preprocessing algorithms and methods which facilitate further automated preprocessing design and the selection of a suitable preprocessing stream. The proposed framework is easily extendible, so it can be applied to other data mining algorithm families that have their own idiosyncrasies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Generalized Study on Data Mining and Clustering Algorithms

Effective Data Clustering Algorithms

Big Data Clustering Techniques: Recent Advances and Survey

References

Ankerst M, Breunig MM, Kriegel H-P (1999) OPTICS: ordering points to identify the clustering structure. In: ACM, Sigmod record, pp 49–60
Belkin M, Niyogi P (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv Neural Inf Process Syst 14:585–591
Google Scholar
Berkhin P (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data. Springer, Berlin, pp 25–71
Bernstein A, Provost F, Hill S (2005) Toward intelligent assistance for a data mining process: an ontology-based approach for cost-sensitive classification. IEEE Trans Knowl Data Eng 17(4):503–518
Article Google Scholar
Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B (2008) KNIME: the Konstanz information miner. In: Data analysis, machine learning and applications. Springer, Berlin, pp 319–326
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
Book MATH Google Scholar
Chakraborty S, Nagwani NK (2011) Analysis and study of incremental DBSCAN clustering algorithm. IJECBS 1(2). http://www.ijecbs.com/July2011/44.pdf
Chan C, Batur C, Sirnivasan A (1991) Determination of quantization intervals in rule based model for dynamic. In: Proceedings of the IEEE conference on systems, pp 1719–1723
Chapman P, Clinton J, Kerber R, Khabaza T, Reinartz T, Shearer C, Wirth R (2000) CRISP-DM 1.0 Step-by-step data mining guide. SPSS Inc. ftp://ftp.software.ibm.com/software/analytics/spss/support/Modeler/Documentation/14/UserManual/CRISP-DM.pdf
Chickering D, Meek C, and Rounthwaite R (2001) Efficient determination of dynamic split points in a decision tree. In: Proceedings 2001 IEEE international conference on data mining, pp 91–98
Cox T, Cox M (2000) Multidimensional scaling. Chapman & Hall, London
Google Scholar
Delibašić B, Jovanović M, Vukićević M, Suknović M, Obradović Z (2011) Component-based decision trees for classification. Mach Learn 15(5):327–334
Google Scholar
Delibašić B, Kirchner K, Ruhland J (2008) A pattern based data mining approach. Springer, Berlin
Google Scholar
Delibašić B, Kirchner K, Ruhland J, Jovanović M, Vukićević M (2009) Reusable components for partitioning clustering algorithms. Artif Intell Rev 32(1–4):59–75
Google Scholar
Delibašić B, Vukićević M, Jovanović M, Kirchner K, Ruhland J, Suknović M (2012) An architecture for component-based design of representative-based clustering algorithms. Data Knowl Eng 75:78–98
Article Google Scholar
Demers D, Cottrell G, Diego S, Jolla L (1993) Non linear dimensionality reduction. Adva Neural Inf Process Syst 5:580–587
Google Scholar
Demšar J, Zupan B, Leban G, Curk T (2004) Orange: from experimental machine learning to interactive data mining. In: PKDD 2004. Knowledge discovery in databases. Springer, Berlin, pp 537–539
Dijkstra E (1959) A note on two problems in connexion with graphs. Numer Math 1(1):269–271
Article MathSciNet MATH Google Scholar
Donoho D, Grimes C (2005) New locally linear embedding techniques for high-dimensional data. In: Proceedings of the National Academy of Sciences, pp 7426–7431
Dougherty J, Kohavi R, and Sahami M (1995) Supervised and unsupervised discretization of continuous features. ICML, pp 194–202
Elomaa T, Rousu J (2004) Efficient multisplitting revisited: optima-preserving. Data Min Knowl Disc 8:97–126
Article MathSciNet Google Scholar
Enders C (2010) Applied missing data analysis. Guilford Press, New York
Google Scholar
Engels R, Theusinger C (1998) Using a data metric for preprocessing advice for data mining applications. In: Machine learning, pp 430–434
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, KDD, pp 226–231
Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (1996) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence
Fodor I (2002) A survey of dimension reduction techniques. Technical report 1, U.S. Department of Energy
Fong M (2007) Dimension reduction on hyperspectral images. Technical report Figure 1, UCLA Department of Mathematics, Los Angeles
Foss A, Lee C-H, Wang W (2002) On data clustering analysis: scalability, constraints and validation. Adv Knowl Discov Data Min 2336:28–39
Article Google Scholar
Frank E, Hall M, Holmes G, Kirkby R, Pfahringer B, Witten IH, Trigg L (2005) Weka. In: Data mining and knowledge discovery handbook. Springer, US, pp 1305–1314
García S, Luengo J, Herrera F (2015) Instance selection. Data Preprocess Data Min 72:195–243
Google Scholar
Grira N, Crucianu M, Boujemaa N, Rocquencourt I (2005) Unsupervised and semi-supervised clustering: a brief survey. Technical report, Report of the MUSCLE European Network of Excellence
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of the international conference on management of data, (SIGMOD), pp 73–84, Seattle. ACM Press
Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: 15th international conference on data engineering (ICDE’99), pp 345–366
Gul N, Barki I, Akhtar N (2009) MFP: a mechanism for determining associated patterns of stock. Architecture, pp 1–7
Han J, Kamber M (2011) Data mining: concepts and techniques. Morgan Kaufmann, Los Altos
Google Scholar
IBM Director of Licensing, I. C. (2012) IBM SPSS 21 Information Center
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs
MATH Google Scholar
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666
Article Google Scholar
Jin R, Breitbart Y, Muoh C (2008) Data discretization unification. Knowl Inf Syst 19(1):1–29
Article Google Scholar
Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining—KDD ’01, pp 293–298
Jordan A, Ng M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 14:849–856
Google Scholar
Jovanović M, Delibašić B, Vukićević M, Suknović M, Martić M et al (2014) Evolutionary approach for automated component-based decision tree algorithm design. Intell Data Anal 18:25–42
Google Scholar
Kambhatla N, Leen TK (1997) Dimension reduction by local principal component analysis. Neural Comput 9(7):1493–1516
Article Google Scholar
Kantardzic M (2011) Data mining: concepts, models, methods, and algorithms, 2nd edn. Wiley, New York
Book Google Scholar
Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. North-Holland, Amsterdam
Google Scholar
Kaufman L, Rousseeuw PJ (1990) Clustering large applications (Program CLARA). In: Finding groups in data: an introduction to cluster analysis. Wiley, Hoboken
Khabaza T, Shearer C (1995) Data mining with Clementine. IEE colloquium on knowledge discovery in databases, IEE Digest No. 1995/021(B), London
Kim J, Curry J (1977) The treatment of missing data in multivariate analysis. Sociol Methods Res 6:215–240
Article Google Scholar
Kirchner K, Delibašić B, Vukićević M (2010) Projektovanje procesa klasterovanja pomoću paterna (Designing the clustering process with reusable components). InfoM 34:23–29
Google Scholar
Kurgan LA, Musilek P (2006) A survey of Knowledge Discovery and Data Mining process models. Knowl Eng Rev 21(01):1
Article Google Scholar
Law M, Jain A (2006) Incremental nonlinear dimensionality reduction by manifold learning. IEEE Trans Pattern Anal Mach Intell 28:377–391
Article Google Scholar
Leyton-Brown K, Nudelman E, Andrew G, Mcfadden J, Shoham Y (2003) A portfolio approach to algorithm selection. IJCAI 1543:6–7
Google Scholar
Li D, Zhong C, Zhang L (2010) Fuzzy c-means clustering of partially missing data sets based on statistical representation. In: 2010 seventh international conference on fuzzy systems and knowledge discovery (FSKD 2010), pp 460–464
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. University of California, School of Information and Computer Science, Irvine, CA
Lu H, Plataniotis KNK, Venetsanopoulos AN (2008) MPCA: multilinear principal component analysis of tensor objects. IEEE Trans Neural Netw 19(1):18–39
Article Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematics, pp 281–297
Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) YALE: rapid prototyping for complex data mining tasks. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1–6
Milligan GW, Martha C (1987) Methodology review: clustering methods. Appl Psychol Meas 11:329–354
Article Google Scholar
Othman Z, Bakar A, Hamdan A, Omar K, Shuib M, Liyana N (2007) Agent based preprocessing. In: Intelligent and advanced systems, pp 219–223
Pan J, Yang Q, Yang Y, Li L, Li F, Li G (2007) Cost-sensitive-data preprocessing for mining customer relationship management databases. Intell Syst IEEE 22:46–51
Article Google Scholar
Pelleg D, Moore AW (2000) X-means: extending k-means with efficient estimation of the number of clusters. In: ICML, pp 727–734
Rakotomalala R (2005) TANAGRA: a free software for research and academic purposes. In: Proceedings of EGC, vol 2. pp 697–702
Raymond TN, Han JW (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th international conference on very large data bases, pp 144–155
R Development Core Team (2008) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. ISBN 3-900051-07-0. http://www.Rproject.org
Rexer K (2013) 6th Rexer Analytics Data Miner Survey. Technical report, Rexer Analytics
Rice J (1975) The algorithm selection problem. Adv Comput 15:65–118
Article Google Scholar
Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article MATH Google Scholar
Sametinger J (1997) Software engineering with reusable components. Springer, Berlin
Book MATH Google Scholar
SAS Institute (2008) SAS Enterprise Miner SEMMA
Saul LK, Weinberger KQ, Lee DD (2006) Spectral methods for dimensionality reduction. MIT Press, Cambridge
Google Scholar
Schwarz G (2008) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article Google Scholar
Shawe-Taylor J, Christianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Book Google Scholar
Smith-Miles KA (2008) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput Surv 41(1):6
Article Google Scholar
Sonnenburg S, Braun M, Ong CS, Bengio S, Bottou L, Holmes G, Lecun Y, Müller K-R, Raetsch G, Schölkopf B, Weston J, Williamson B (2007) The need for open source software in machine learning. J Mach Learn Res 8:2443–2466
Google Scholar
Tenenbaum JB, Silva VD, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323
Article Google Scholar
Teng L, Li H, Fu X, Chen W, Shen I (2005) Dimension reduction of microarray data based on local tangent space alignment. In: Proceedings of the 4th IEEE international conference on cognitive informatics, pp 154–159
Valarmathie P, Dinakaran K (2009) An increased performance of clustering high dimensional data through dimensionality. J Theor Appl Inf Technol 13:731–733
Google Scholar
Van de Merckt T (1993) Decision trees in numerical attribute spaces. In: 13th international joint conference on artificial intelligence
Van Der Maaten LJP, Postma EO, Herik HJVD (2008) Dimensionality reduction: a comparative review. J Mach Learn Res 10(January):66–71
Vannucci M, Colla V (2004) Meaningful discretization of continuous features for association rules mining by means of a SOM. European Symposium on Artificial Neural Networks, Bruges
Vinh NX, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? J Mach Learn Res 11:2837–2854
MathSciNet Google Scholar
Vukićević M, Kirchner K, Delibašić B, Jovanovı’c M, Ruhland J, Suknović M (2012) Finding best algorithmic components for clustering microarray data. Knowl Inf Syst 35(11):111–130
Google Scholar
Weiss Y (1999) Segmentation using eigenvectors: a unifying view. In: Proceedings of the IEEE international conference on computer vision. IEEE Computer Society Press, p 2
Wilks S (1932) Moments and distributions of estimates of population parameters from fragmentary samples. Ann Math Stat 3:163–195
Article Google Scholar
Wirth R, Hipp J (2000) CRISP-DM: towards a standard process model for data mining. In: Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining, pp 29–39
Wong A, Chiu D (1987) Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans Pattern Anal Mach Intell 9:796–805
Article Google Scholar
Wong AK, Wang DC (1979) DECA: a discrete-valued data clustering algorithm. IEEE Trans Pattern Anal Mach Intell 1(4):342–349
Article MATH Google Scholar
Wu J, Song C-H, Kong JM, Lee WD (2007) Extended mean field annealing for clustering incomplete data. In: 2007 international symposium on information technology convergence (ISITC 2007). IEEE, pp 8–12
Xie X, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13(8):841–847
Article Google Scholar
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–678
Article Google Scholar
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the international conference on management of data, (SIGMOD), pp 103–114
Zhang Z, Zha H (2004) Principal manifolds and nonlinear dimensionality reduction via local tangent space alignment. SIAM J Sci Comput 26:313–338
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Berlin School of Economics and Law, Alt-Friedrichsfelde 60, 10315, Berlin, Germany
Kathrin Kirchner
Faculty of Organizational Sciences, University of Belgrade, Belgrade, Serbia
Jelena Zec & Boris Delibašić

Authors

Kathrin Kirchner
View author publications
You can also search for this author in PubMed Google Scholar
Jelena Zec
View author publications
You can also search for this author in PubMed Google Scholar
Boris Delibašić
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kathrin Kirchner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kirchner, K., Zec, J. & Delibašić, B. Facilitating data preprocessing by a generic framework: a proposal for clustering. Artif Intell Rev 45, 271–297 (2016). https://doi.org/10.1007/s10462-015-9446-6

Download citation

Published: 17 October 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s10462-015-9446-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Facilitating data preprocessing by a generic framework: a proposal for clustering

Abstract

Access this article

Similar content being viewed by others

A Generalized Study on Data Mining and Clustering Algorithms

Effective Data Clustering Algorithms

Big Data Clustering Techniques: Recent Advances and Survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Facilitating data preprocessing by a generic framework: a proposal for clustering

Abstract

Access this article

Similar content being viewed by others

A Generalized Study on Data Mining and Clustering Algorithms

Effective Data Clustering Algorithms

Big Data Clustering Techniques: Recent Advances and Survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation