Skip to main content
Log in

Mutual information algorithms for optimal attribute selection in data driven partitions of databases

  • Original Paper
  • Published:
Evolving Systems Aims and scope Submit manuscript

Abstract

Clustering algorithms like k-means, BIRCH, CLARANS and DBSCAN are designed to be scalable and they are developed to discover clusters in the full dimensional space of a database. Nevertheless their characteristics depend upon the size of the database. A database/data warehouse may store terabytes of data. Complex data analysis (mining) may take a very long time to run on the complex dataset. One has to obtain a reduced representation of the dataset that is much smaller in volume—but yet produces the same or almost the same analytical results—in order to accelerate information processing. Reduced representations yield simplified models that are easier to interpret, avoid the curse of dimensionality and enhance generalization by reducing overfitting. Data reduction methods include data cube aggregation, attribute subset selection, fitting data into models, dimensionality reduction, hierarchies as well as other approaches. On the other hand, data-dependent partitions—like the Gessaman’s partition and tree-quantization partition—allow for processing different partitions of a dataset separately. Hence parallel processing may be used as an option for big data. Online analytical processing is a practical approach that deals with multi-dimensional queries in DB management. Feature selection is considered as a specific case of a more general paradigm which is called Structure Learning in cases of an outcome associated to a set of attributes. Feature selection aims at selecting a minimum set of features such that the probability distribution of different classes given the values of those features is as close as possible to the original distribution given the values of all features. A mutual information approach based upon representing complex datasets in DB as a minimal set of coherent attribute sets of reduced dimensions is herein proposed. The novelty of the proposed approach consists of employing piecewise analysis of compact clusters in order to increase overall Shannon’s mutual information-entropy as a variant to conventional Classification and Regression Trees. Numerical data regarding a test-bed system for anomaly detection are provided in order to illustrate the aforementioned approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. ibvirt is an open-source API, daemon and management tool for managing platform virtualization.

  2. One may see http://www.seccrit.eu//publications/presentatons for more details.

References

  • Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: ACM SIGMOD international conference on management of data, Seattle, WA, USA, pp 94–105

  • Auffarth B, Lopez M, Cerquides J (2010) Comparison of redundancy and relevance measures for feature selection in tissue classification of CT images. Advances in data mining. applications and theoretical aspects. Springer, Berlin, pp 248–262

    Google Scholar 

  • Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5:537–550

    Article  Google Scholar 

  • Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and Regression Trees. CRC Press, Wadsworth, Belmont, CA

    MATH  Google Scholar 

  • Darbellay GA, Vajda I (1999) Estimation of the information by an adaptive partition of the observed space. IEEE Trans Inf Theory 45(4):1315–1321

    Article  Google Scholar 

  • Dash M, Choi K, Scheuermann P, Liu H (2002) Feature selection for clustering—a filter solution. In: Proceedings of the second international conference on data mining, Maebashi, Japan, pp 115–122

  • Devroye I, Gyorfi L, Lugosi G (1996) A probability theory of pattern recognition. Springer, New York

    Book  Google Scholar 

  • Doquire G, Verleysen M (2011) Mutual information based feature selection for mixed data. In: ESANN 2011 Proceedings, European symposium on artificial neural networks, computational intelligence and machine learning. Bruges (Belgium). ISBN:978-2-87419-044-5

  • Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery in databases and data mining, Portland, Oregon

  • Fraser AM, Swinney HL (1986) Independent coordinates for strange attractors from mutual information. Phys Rev A 33:1134–1140

    Article  MathSciNet  Google Scholar 

  • Gencaga D, Malakar N, Lary DJ (2014) Survey on the estimation of mutual information methods as a measure of dependency versus correlation analysis. In: AIP conference proceedings, Canberra, ACT, Australia. https://doi.org/10.1063/1.4903714

  • Gessaman MP (1970) A consistent nonparametric multivariate density estimator based on statistically equivalent blocks. Ann Math Stat 41:1344–1346

    Article  Google Scholar 

  • Guha S, Rastog R, Shim K, (1999) ROCK: a robust clusterin algorithm for categorical attributes. In: Proceedings of the ICDE, Sydney, NSW, Australia, pp 512–521

  • Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159

    Article  Google Scholar 

  • Li W (1990) Mutual information functions versus correlation functions. J Stat Phys 60(5/6):823–837

    Article  MathSciNet  Google Scholar 

  • Maji P, Garai P (2013) Fuzzy-rough simultaneous attribute selection and feature extraction algorithm. IEEE Trans Cybern 43(4):1166–1177

    Article  Google Scholar 

  • Miao DQ, Hu GR (1999) A heuristic algorithm for reduction of knowledge. J Comput Res Dev 36:681–684

    Google Scholar 

  • Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-relevance and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27:1226–1238

    Article  Google Scholar 

  • Shirazi N, Simpson S, Oechsner S, Mauthe A, Hutchison D (2015) A framework for resilience management in the cloud. Electrotechnik Informationstechnik 132/2(132/2):122–132. https://doi.org/10.1007/s005002-015-0290-9

    Article  Google Scholar 

  • Stanfill C, Waltz B (1986) Towards memory based reasoning. Commun ACM 29:1213–1228

    Article  Google Scholar 

  • Stephanakis IM, Iliou T, Anastassopoulos G (2017) Information feature selection: using local attribute selections to represent connected distributions in complex datasets. In: Proceedings EANN 2017, vol 744, Athens, Greece, pp 441–450. ISBN:9783319651712

  • Sun L, Xu J (2014) Information entropy and mutual information-based uncertainty measures in rough set theory. Appl Math Inf Sci 8(4):1973–1985

    Article  Google Scholar 

  • Witten IH, Frank E (2000) Data mining. Morgan Kaufman, San Francisco

    MATH  Google Scholar 

  • Xu FF, Miao DQ, Wei L (2009) Fuzzy-rough attribute reduction via mutual information with an application to cancer classification. Comput Math Appl 57:1010–1017. https://doi.org/10.1016/j.camwa.2008.10.027

    Article  MATH  Google Scholar 

  • Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation based filter solution. In: Proceedings of the 20th international conference on machine learning, Washington, DC, USA, pp 56–63

  • Zeng A, Li T, Liu D, Zhang J, Chen H (2015) A fuzzy rough set approach for incremental feature selection on hybrid information systems. Fuzzy Sets Syst 258:39–60

    Article  MathSciNet  Google Scholar 

  • Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD conference on management of data, Montreal, Canada

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ioannis M. Stephanakis.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Stephanakis, I.M., Iliou, T. & Anastassopoulos, G. Mutual information algorithms for optimal attribute selection in data driven partitions of databases. Evolving Systems 11, 517–529 (2020). https://doi.org/10.1007/s12530-018-9237-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12530-018-9237-9

Keywords

Navigation