Abstract
Clustering algorithms like k-means, BIRCH, CLARANS and DBSCAN are designed to be scalable and they are developed to discover clusters in the full dimensional space of a database. Nevertheless their characteristics depend upon the size of the database. A database/data warehouse may store terabytes of data. Complex data analysis (mining) may take a very long time to run on the complex dataset. One has to obtain a reduced representation of the dataset that is much smaller in volume—but yet produces the same or almost the same analytical results—in order to accelerate information processing. Reduced representations yield simplified models that are easier to interpret, avoid the curse of dimensionality and enhance generalization by reducing overfitting. Data reduction methods include data cube aggregation, attribute subset selection, fitting data into models, dimensionality reduction, hierarchies as well as other approaches. On the other hand, data-dependent partitions—like the Gessaman’s partition and tree-quantization partition—allow for processing different partitions of a dataset separately. Hence parallel processing may be used as an option for big data. Online analytical processing is a practical approach that deals with multi-dimensional queries in DB management. Feature selection is considered as a specific case of a more general paradigm which is called Structure Learning in cases of an outcome associated to a set of attributes. Feature selection aims at selecting a minimum set of features such that the probability distribution of different classes given the values of those features is as close as possible to the original distribution given the values of all features. A mutual information approach based upon representing complex datasets in DB as a minimal set of coherent attribute sets of reduced dimensions is herein proposed. The novelty of the proposed approach consists of employing piecewise analysis of compact clusters in order to increase overall Shannon’s mutual information-entropy as a variant to conventional Classification and Regression Trees. Numerical data regarding a test-bed system for anomaly detection are provided in order to illustrate the aforementioned approach.
Similar content being viewed by others
Notes
ibvirt is an open-source API, daemon and management tool for managing platform virtualization.
One may see http://www.seccrit.eu//publications/presentatons for more details.
References
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: ACM SIGMOD international conference on management of data, Seattle, WA, USA, pp 94–105
Auffarth B, Lopez M, Cerquides J (2010) Comparison of redundancy and relevance measures for feature selection in tissue classification of CT images. Advances in data mining. applications and theoretical aspects. Springer, Berlin, pp 248–262
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5:537–550
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and Regression Trees. CRC Press, Wadsworth, Belmont, CA
Darbellay GA, Vajda I (1999) Estimation of the information by an adaptive partition of the observed space. IEEE Trans Inf Theory 45(4):1315–1321
Dash M, Choi K, Scheuermann P, Liu H (2002) Feature selection for clustering—a filter solution. In: Proceedings of the second international conference on data mining, Maebashi, Japan, pp 115–122
Devroye I, Gyorfi L, Lugosi G (1996) A probability theory of pattern recognition. Springer, New York
Doquire G, Verleysen M (2011) Mutual information based feature selection for mixed data. In: ESANN 2011 Proceedings, European symposium on artificial neural networks, computational intelligence and machine learning. Bruges (Belgium). ISBN:978-2-87419-044-5
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery in databases and data mining, Portland, Oregon
Fraser AM, Swinney HL (1986) Independent coordinates for strange attractors from mutual information. Phys Rev A 33:1134–1140
Gencaga D, Malakar N, Lary DJ (2014) Survey on the estimation of mutual information methods as a measure of dependency versus correlation analysis. In: AIP conference proceedings, Canberra, ACT, Australia. https://doi.org/10.1063/1.4903714
Gessaman MP (1970) A consistent nonparametric multivariate density estimator based on statistically equivalent blocks. Ann Math Stat 41:1344–1346
Guha S, Rastog R, Shim K, (1999) ROCK: a robust clusterin algorithm for categorical attributes. In: Proceedings of the ICDE, Sydney, NSW, Australia, pp 512–521
Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159
Li W (1990) Mutual information functions versus correlation functions. J Stat Phys 60(5/6):823–837
Maji P, Garai P (2013) Fuzzy-rough simultaneous attribute selection and feature extraction algorithm. IEEE Trans Cybern 43(4):1166–1177
Miao DQ, Hu GR (1999) A heuristic algorithm for reduction of knowledge. J Comput Res Dev 36:681–684
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-relevance and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27:1226–1238
Shirazi N, Simpson S, Oechsner S, Mauthe A, Hutchison D (2015) A framework for resilience management in the cloud. Electrotechnik Informationstechnik 132/2(132/2):122–132. https://doi.org/10.1007/s005002-015-0290-9
Stanfill C, Waltz B (1986) Towards memory based reasoning. Commun ACM 29:1213–1228
Stephanakis IM, Iliou T, Anastassopoulos G (2017) Information feature selection: using local attribute selections to represent connected distributions in complex datasets. In: Proceedings EANN 2017, vol 744, Athens, Greece, pp 441–450. ISBN:9783319651712
Sun L, Xu J (2014) Information entropy and mutual information-based uncertainty measures in rough set theory. Appl Math Inf Sci 8(4):1973–1985
Witten IH, Frank E (2000) Data mining. Morgan Kaufman, San Francisco
Xu FF, Miao DQ, Wei L (2009) Fuzzy-rough attribute reduction via mutual information with an application to cancer classification. Comput Math Appl 57:1010–1017. https://doi.org/10.1016/j.camwa.2008.10.027
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation based filter solution. In: Proceedings of the 20th international conference on machine learning, Washington, DC, USA, pp 56–63
Zeng A, Li T, Liu D, Zhang J, Chen H (2015) A fuzzy rough set approach for incremental feature selection on hybrid information systems. Fuzzy Sets Syst 258:39–60
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD conference on management of data, Montreal, Canada
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Stephanakis, I.M., Iliou, T. & Anastassopoulos, G. Mutual information algorithms for optimal attribute selection in data driven partitions of databases. Evolving Systems 11, 517–529 (2020). https://doi.org/10.1007/s12530-018-9237-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12530-018-9237-9