A fast and effective method to find correlations among attributes in databases

de Sousa, Elaine P. M.; Traina, Caetano; Traina, Agma J. M.; Wu, Leejay; Faloutsos, Christos

doi:10.1007/s10618-006-0056-4

A fast and effective method to find correlations among attributes in databases

Published: 10 February 2007

Volume 14, pages 367–407, (2007)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Elaine P. M. de Sousa¹,
Caetano Traina Jr.¹,
Agma J. M. Traina¹,
Leejay Wu² &
…
Christos Faloutsos²

491 Accesses
28 Citations
Explore all metrics

Abstract

The problem of identifying meaningful patterns in a database lies at the very heart of data mining. A core objective of data mining processes is the recognition of inter-attribute correlations. Not only are correlations necessary for predictions and classifications – since rules would fail in the absence of pattern – but also the identification of groups of mutually correlated attributes expedites the selection of a representative subset of attributes, from which existing mappings allow others to be derived. In this paper, we describe a scalable, effective algorithm to identify groups of correlated attributes. This algorithm can handle non-linear correlations between attributes, and is not restricted to a specific family of mapping functions, such as the set of polynomials. We show the results of our evaluation of the algorithm applied to synthetic and real world datasets, and demonstrate that it is able to spot the correlated attributes. Moreover, the execution time of the proposed technique is linear on the number of elements and of correlations in the dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

References

Aha DW, Bankert RL (1995) A comparative evaluation of sequential feature selection algorithms. In: Proceedings of the 5th international workshop on artificial intelligence and statistics, Ft. Lauderdale, FL, USA, pp 1–7
Babu S, Garofalakis M, Rastogi R (2001) SPARTAN: a model-based semantic compression system for massive data tables. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data (SIGMOD’01), Santa Barbara, CA, USA, pp 283–294
Balan AGR, Traina AJM, Traina C Jr, Azevedo-Marques PM (2005) Fractal analysis of image textures for indexing and retrieval by content. In: Proceedings of the 18th IEEE symposium on computer-based medical systems (CBMS’05), Dublin, Ireland, pp 581–586
Barbara D, Chen P (2003) Using self-similarity to cluster large data sets. Data Min Knowl Discov 7(2):123–152
Article MathSciNet Google Scholar
Belussi A, Faloutsos C (1995) Estimating the selectivity of spatial queries using the ‘Correlation Fractal Dimension’. In: Proceedings of the 21st international conference on very large data bases (VLDB’95), Zurich, Switzerland, pp 299–310
Belussi A, Faloutsos C (1998) Self-spatial join selectivity estimation using fractal concepts. ACM Trans Inf Syst 16(2):161–201
Article Google Scholar
Böhm C (2000) A cost model for query processing in high dimensional data spaces. ACM Trans Database Syst 25(2):129–178
Article Google Scholar
Böhm C, Kriegel H-P (2000) Dynamically optimizing high-dimensional index structures. In: Advances in Database Technology—EDBT 2000, 7th international conference on extending database technology. Lecture notes in computer science, vol 1777, Konstanz, Germany, pp 36–50
Blake C, Merz C (1998) UCI repository of machine learning databases. http://www.ics.uci. edu/∼mlearn/MLRePository.html
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1–2):245–271
Article MATH MathSciNet Google Scholar
Calvo R, Partridge M, Jabri M (1998) A comparative study of principal component analysis techniques. In: Proceedings of the 9th Australian conference on neural networks, Brisbane, Australia
Chakrabarti D, Faloutsos C (2002) F4: large-scale automated forecasting using fractals. In: International conference on information and knowledge management (CIKM), vol 1, McLean, EUA, pp 2–9
Chakrabarti K, Keogh E, Mehrotra S, Pazzani M (2002) Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans Database Syst (TODS) 27(2):188–228
Article Google Scholar
Chang K, Ghosh J (1998) Principal curves for nonlinear feature extraction and classification. Appl Artif Neural Netw Image Process III (SPIE) 3307:120–129
Google Scholar
Dash M, Liu H, Yao J (1997) Dimensionality reduction for unsupervised data. In: Proceedings of the 9th IEEE international conference on tools with artificial intelligence (ICTAI’97), Newport Beach, CA, USA, pp 532–539
DeMers D, Cottrell G (1993) Non-linear dimensionality reduction. In: Advances in neural information processing systems (NIPS conference), vol 5, Denver, CO, USA, pp 580–587
Faloutsos C, Seeger B, Traina AJM, Traina C (2000) Spatial join selectivity using power laws. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data (SIGMOD’00), Dallas, TX, USA, pp 177–188
Fayyad UM (1998) Mining databases: towards algorithms for knowledge discovery. IEEE Data Eng Bull 21(1):39–48
Google Scholar
Hyvärinen A (1999) Survey on independent component analysis. Neural Comput Surv 2:94–128
Google Scholar
Jebara TS, Jaakkola TS (2000) Feature selection and dualities in maximum entropy discrimination. In: Proceedings of the 16th conference on uncertainty in artificial intelligence (UAI’2000), San Francisco, CA, USA, pp 291–300
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386
Article Google Scholar
John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Proceedings of the 11th international conference on machine learning (ICML’94), New Brunswick, NJ, USA, pp 121–129
Kamel I, Faloutsos C (1994) Hilbert R-tree: an improved R-tree using fractals. In: Proceedings of 20th international conference on very large data bases (VLDB’94), Santiago de Chile, Chile, pp 500–509
Kantardzic M, Sadeghian P, Shen C (2004) The time diversification monitoring of a stock portfolio: an approach based on the fractal dimension. In: Proceedings of the 2004 ACM symposium on applied computing (SAC’04), pp 637–641
Kanth KVR, Agrawal D, Singh AK (1998) Dimensionality reduction for similarity searching in dynamic databases. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data (SIGMOD’98), Seattle, WA, USA, pp 166–176
Keogh EJ, Chakrabarti K, Mehrotra S, Pazzani MJ (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data (SIGMOD’01), Santa Barbara, CA, USA, pp 151–162
Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the 9th international workshop on machine learning (ML’92), Aberdeen, Scotland, pp 249–256
Kononenko I (1994) Estimating attributes: analysis and extensions of RELIEF. In: Proceeding of European conference on machine learning (ECML-94). Lecture notes in computer science, vol 784, Catania, Italy, pp 171–182
Korn F, Jagadish HV, Faloutsos C (1997) Efficiently supporting ad hoc queries in large datasets of time sequences. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data (SIGMOD’97), Tucson, AZ, USA, pp 289–300
Korn F, Pagel B-U, Faloutsos C (2001) On the ‘dimensionality curse’ and the self-similarity blessing. IEEE Trans Knowl Data Eng 13(1):96–111
Article Google Scholar
Kramer MA (1991) Nonlinear principal component analysis using autoassociative neural networks. Am Inst Chem Eng J (AIChE) 37(2):233–243
Google Scholar
Langley P, Sage S (1997) Scaling to domains with many irrelavant features. In: Computational learning theory and natural learning system, vol IV, Cambrige, MA, USA
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502
Article Google Scholar
Liu Y, Lazar NA, Rothfus WE, Buzoianu M, Kanade T (2001) Classification-driven feature space reduction for semantic-based neuroimage retrieval. In: VISIM workshop: information retrieval and exploration in large medical image collections, Utrecht, The Netherlands
Mitra P, Murthy C, Pal S (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intel 24(4):301–312
Article Google Scholar
Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the Hilbert space-filling curve. IEEE Trans Knowl Data Eng 13(1):124–141
Article Google Scholar
Pagel B-U, Korn F, Faloutsos C (2000) Deflating the dimensionality curse using multiple fractal dimensions. In: Proceedings of the 16th international conference on data engineering (ICDE’00), San Diego, CA, USA, pp 589–598
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco, CA, USA
Google Scholar
Reynolds HT (1977) The analysis of cross-classifications. The Free Press, New York, NY, USA
Google Scholar
Robnik-Sikonja M (1997) CORE—a system that predicts continuous variables. In: Proceedings of ERK’97, Portoroz, Slovenia
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326
Article Google Scholar
Scherf M, Brauer W (1997) Feature selection by means of a feature weighting approach. Technical report FKI-221-97, Munich University of Technology, Munich, Germany
Schölkopf AS, Müller K-R (1998) Nonlinear component analysis as a Kernel Eigenvalue problem. Neural Comput 10:1299–1319
Article Google Scholar
Schroeder M, (1991) Fractals, chaos, power laws: minutes from an infinite paradise. W. H. Freeman and Company, New York, NY, USA
MATH Google Scholar
Singh M, Provan GM (1995) A comparison of induction algorithms for selective and non-selective bayesian classifiers. In: Proceedings of the 12th international conference on machine learning (ICML’95), Tahoe City, CA, USA, pp 497–505
Takahashi T, Tokunaga R (1999) Nonlinear dimensionality reduction by multi layer perceptron using superposed energy. In: Proceedings of 1999 international symposium on nonlinear theory and its applications, vol 2, Hawaii, USA, pp 863–866
Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323
Article Google Scholar
Traina A, Traina C, Papadimitriou S, Faloutsos C (2001) Tri-Plots: scalable tools for multidimensional data mining. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), San Francisco, CA, USA, pp 184–193
Traina AJM, Castan̆ón CAB, Traina C (2003) MultiWaveMed: a system for medical image retrieval by wavelets transformations. In: Proceedings of the 16th IEEE symposium on computer-based medical systems, New York, NY, USA, pp 150–155
Traina C, Traina A, Wu L, Faloutsos C (2000) Fast feature selection using fractal dimension. In: Anais do XV Simpósio Brasileiro de Banco de Dados (SBBD’00), Joäo Pessoa, Brasil, pp 158–171
Traina C, Traina AJM, Faloutsos C (1999) Distance exponent: a new concept for selectivity estimation in metric trees. Technical report CMU-CS-99-110, Carnegie Mellon University, School of Computer Science, Pittsburgh, PA, USA
Traina C, Traina AJM, Faloutsos C, Seeger B (2002) Fast indexing and visualization of metric data sets using Slim-Trees. IEEE Trans Knowl Data Eng 14(2):244–260
Article Google Scholar
Turk M, Pentland A (1991) Eigenfaces for recognition. J Cogn Neurosci 3(1):71–86
Article Google Scholar
Vafaie H, DeJong K (1993) Robust feature selection algorithms. In: Proceedings of the 5th international conference on tools with artificial intelligence (ICTAI ’93), Boston, MA, USA, pp 356–363
Vailaya A, Figueiredo MAT, Jain AK, Zhang H-J (2001) Image classification for content-based indexing. IEEE Trans Image Process 10(1):117–130
Article MATH Google Scholar
Wactlar HD, Kanade T, Smith MA, Stevens SM (1996) Intelligent access to digital video: informedia project. IEEE Comput 29(5):46–52
Google Scholar
Witten IH, Frank E (2000) Data mining: practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco, CA, USA
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of São Paulo at São Carlos, São Carlos, Brazil
Elaine P. M. de Sousa, Caetano Traina Jr. & Agma J. M. Traina
Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Leejay Wu & Christos Faloutsos

Authors

Elaine P. M. de Sousa
View author publications
You can also search for this author in PubMed Google Scholar
Caetano Traina Jr.
View author publications
You can also search for this author in PubMed Google Scholar
Agma J. M. Traina
View author publications
You can also search for this author in PubMed Google Scholar
Leejay Wu
View author publications
You can also search for this author in PubMed Google Scholar
Christos Faloutsos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elaine P. M. de Sousa.

Additional information

Responsible editor: E. Keogh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Sousa, E.P.M., Traina, C., Traina, A.J.M. et al. A fast and effective method to find correlations among attributes in databases. Data Min Knowl Disc 14, 367–407 (2007). https://doi.org/10.1007/s10618-006-0056-4

Download citation

Received: 07 July 2005
Accepted: 19 July 2006
Published: 10 February 2007
Issue Date: June 2007
DOI: https://doi.org/10.1007/s10618-006-0056-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A fast and effective method to find correlations among attributes in databases

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A fast and effective method to find correlations among attributes in databases

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation