Mutual information algorithms for optimal attribute selection in data driven partitions of databases

Stephanakis, Ioannis M.; Iliou, Theodoros; Anastassopoulos, George

doi:10.1007/s12530-018-9237-9

Mutual information algorithms for optimal attribute selection in data driven partitions of databases

Original Paper
Published: 04 June 2018

Volume 11, pages 517–529, (2020)
Cite this article

Evolving Systems Aims and scope Submit manuscript

Ioannis M. Stephanakis ORCID: orcid.org/0000-0001-5739-4707¹,
Theodoros Iliou² &
George Anastassopoulos²

162 Accesses
2 Citations
Explore all metrics

Abstract

Clustering algorithms like k-means, BIRCH, CLARANS and DBSCAN are designed to be scalable and they are developed to discover clusters in the full dimensional space of a database. Nevertheless their characteristics depend upon the size of the database. A database/data warehouse may store terabytes of data. Complex data analysis (mining) may take a very long time to run on the complex dataset. One has to obtain a reduced representation of the dataset that is much smaller in volume—but yet produces the same or almost the same analytical results—in order to accelerate information processing. Reduced representations yield simplified models that are easier to interpret, avoid the curse of dimensionality and enhance generalization by reducing overfitting. Data reduction methods include data cube aggregation, attribute subset selection, fitting data into models, dimensionality reduction, hierarchies as well as other approaches. On the other hand, data-dependent partitions—like the Gessaman’s partition and tree-quantization partition—allow for processing different partitions of a dataset separately. Hence parallel processing may be used as an option for big data. Online analytical processing is a practical approach that deals with multi-dimensional queries in DB management. Feature selection is considered as a specific case of a more general paradigm which is called Structure Learning in cases of an outcome associated to a set of attributes. Feature selection aims at selecting a minimum set of features such that the probability distribution of different classes given the values of those features is as close as possible to the original distribution given the values of all features. A mutual information approach based upon representing complex datasets in DB as a minimal set of coherent attribute sets of reduced dimensions is herein proposed. The novelty of the proposed approach consists of employing piecewise analysis of compact clusters in order to increase overall Shannon’s mutual information-entropy as a variant to conventional Classification and Regression Trees. Numerical data regarding a test-bed system for anomaly detection are provided in order to illustrate the aforementioned approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Notes

ibvirt is an open-source API, daemon and management tool for managing platform virtualization.
One may see http://www.seccrit.eu//publications/presentatons for more details.

References

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: ACM SIGMOD international conference on management of data, Seattle, WA, USA, pp 94–105
Auffarth B, Lopez M, Cerquides J (2010) Comparison of redundancy and relevance measures for feature selection in tissue classification of CT images. Advances in data mining. applications and theoretical aspects. Springer, Berlin, pp 248–262
Google Scholar
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5:537–550
Article Google Scholar
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and Regression Trees. CRC Press, Wadsworth, Belmont, CA
MATH Google Scholar
Darbellay GA, Vajda I (1999) Estimation of the information by an adaptive partition of the observed space. IEEE Trans Inf Theory 45(4):1315–1321
Article Google Scholar
Dash M, Choi K, Scheuermann P, Liu H (2002) Feature selection for clustering—a filter solution. In: Proceedings of the second international conference on data mining, Maebashi, Japan, pp 115–122
Devroye I, Gyorfi L, Lugosi G (1996) A probability theory of pattern recognition. Springer, New York
Book Google Scholar
Doquire G, Verleysen M (2011) Mutual information based feature selection for mixed data. In: ESANN 2011 Proceedings, European symposium on artificial neural networks, computational intelligence and machine learning. Bruges (Belgium). ISBN:978-2-87419-044-5
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery in databases and data mining, Portland, Oregon
Fraser AM, Swinney HL (1986) Independent coordinates for strange attractors from mutual information. Phys Rev A 33:1134–1140
Article MathSciNet Google Scholar
Gencaga D, Malakar N, Lary DJ (2014) Survey on the estimation of mutual information methods as a measure of dependency versus correlation analysis. In: AIP conference proceedings, Canberra, ACT, Australia. https://doi.org/10.1063/1.4903714
Gessaman MP (1970) A consistent nonparametric multivariate density estimator based on statistically equivalent blocks. Ann Math Stat 41:1344–1346
Article Google Scholar
Guha S, Rastog R, Shim K, (1999) ROCK: a robust clusterin algorithm for categorical attributes. In: Proceedings of the ICDE, Sydney, NSW, Australia, pp 512–521
Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159
Article Google Scholar
Li W (1990) Mutual information functions versus correlation functions. J Stat Phys 60(5/6):823–837
Article MathSciNet Google Scholar
Maji P, Garai P (2013) Fuzzy-rough simultaneous attribute selection and feature extraction algorithm. IEEE Trans Cybern 43(4):1166–1177
Article Google Scholar
Miao DQ, Hu GR (1999) A heuristic algorithm for reduction of knowledge. J Comput Res Dev 36:681–684
Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-relevance and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27:1226–1238
Article Google Scholar
Shirazi N, Simpson S, Oechsner S, Mauthe A, Hutchison D (2015) A framework for resilience management in the cloud. Electrotechnik Informationstechnik 132/2(132/2):122–132. https://doi.org/10.1007/s005002-015-0290-9
Article Google Scholar
Stanfill C, Waltz B (1986) Towards memory based reasoning. Commun ACM 29:1213–1228
Article Google Scholar
Stephanakis IM, Iliou T, Anastassopoulos G (2017) Information feature selection: using local attribute selections to represent connected distributions in complex datasets. In: Proceedings EANN 2017, vol 744, Athens, Greece, pp 441–450. ISBN:9783319651712
Sun L, Xu J (2014) Information entropy and mutual information-based uncertainty measures in rough set theory. Appl Math Inf Sci 8(4):1973–1985
Article Google Scholar
Witten IH, Frank E (2000) Data mining. Morgan Kaufman, San Francisco
MATH Google Scholar
Xu FF, Miao DQ, Wei L (2009) Fuzzy-rough attribute reduction via mutual information with an application to cancer classification. Comput Math Appl 57:1010–1017. https://doi.org/10.1016/j.camwa.2008.10.027
Article MATH Google Scholar
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation based filter solution. In: Proceedings of the 20th international conference on machine learning, Washington, DC, USA, pp 56–63
Zeng A, Li T, Liu D, Zhang J, Chen H (2015) A fuzzy rough set approach for incremental feature selection on hybrid information systems. Fuzzy Sets Syst 258:39–60
Article MathSciNet Google Scholar
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD conference on management of data, Montreal, Canada

Download references

Author information

Authors and Affiliations

Hellenic Telecommunication Organization S.A. (OTE), 99 Kifissias Avenue, 151 24, Athens, Greece
Ioannis M. Stephanakis
Medical Informatics Lab, Democritus University of Thrace, 681 00, Alexandroupolis, Greece
Theodoros Iliou & George Anastassopoulos

Authors

Ioannis M. Stephanakis
View author publications
You can also search for this author in PubMed Google Scholar
Theodoros Iliou
View author publications
You can also search for this author in PubMed Google Scholar
George Anastassopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ioannis M. Stephanakis.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stephanakis, I.M., Iliou, T. & Anastassopoulos, G. Mutual information algorithms for optimal attribute selection in data driven partitions of databases. Evolving Systems 11, 517–529 (2020). https://doi.org/10.1007/s12530-018-9237-9

Download citation

Received: 02 January 2018
Accepted: 17 May 2018
Published: 04 June 2018
Issue Date: September 2020
DOI: https://doi.org/10.1007/s12530-018-9237-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mutual information algorithms for optimal attribute selection in data driven partitions of databases

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mutual information algorithms for optimal attribute selection in data driven partitions of databases

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation