Skip to main content

Factor Preselection and Multiple Measures of Dependence

  • Conference paper
  • First Online:

Abstract

Factor selection or factor reduction is carried out to reduce the complexity of a data analysis problems (classification, regression) or to improve the fit of a model (via parameter estimation). In data mining there are special needs for a process by which relevant factors of influence are identified in order to achieve a balance between bias and noise. Insurance companies, for example, face data sets that contain hundreds of attributes or factors per object. With a large number of factors, the selection procedure requires a suitable process model. A process like that becomes compelling once data analysis is to be (semi) automated.We suggest an approach that proceeds in two phases: In the first one, we cluster attributes that are highly correlated in order to identify factor combinations that—statistically speaking—are near duplicates. In the second phase, we choose factors from each cluster that are highly associated with a target variable. The implementation requires some form of non-linear canonical correlation analysis. We define a correlation measure for two blocks of factors that will be employed as a measure of similarity within the clustering process. Such measures, in turn, are based on multiple indices of dependence. Few indices have been introduced cf. Wolff (Stochastica 4(3):175–188, 1980), ‘Few indices have been introduced in the literature’. All of them, however, are hard to interpret if the number of dimensions considerably exceeds two. For that reason we come up with signed measures that can be interpreted in the usual way.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    We use Event-driven Process Chains (EPC) as a modeling language. Details can be found in Becker and Schütte (1996).

References

  • Becker, J. & Schütte, R. (1996). Handelsinformationssysteme. Verl. Moderne Industrie, Landsberg/Lech.

    Google Scholar 

  • Hall, M. A. (1999). Correlation-based feature subset selection for machine learning. PhD thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand.

    Google Scholar 

  • Kiesl, H. (2003). Ordinale Streuungsmasse: Theoretische Fundierung und statistische Anwendung. PhD thesis, Universität Bamberg.

    Google Scholar 

  • Renyi, A. (1958). On measures of dependence. Acta mathematica hungarica, 9, 441–451.

    Google Scholar 

  • Rüschendorf, L. (1976). Asymptotic distributions of multivariate rank order statistics. The Annals of Statistics, 4, 912–923.

    Article  MathSciNet  MATH  Google Scholar 

  • Rüschendorf, L. (2009). On the distributional transform, sklar’s theorem, and the empirical copula process. Journal of Statistical Planning and Inference, 139, 3921–3927.

    Article  MathSciNet  MATH  Google Scholar 

  • Schmid, F., Blumentritt, T., Gaißer, S., Ruppert, M., & Schmidt, R. (2010). Copula-based measures of multivariate association. In F. Durante, W. Härdle, P. Jaworski, & T. Rychlik (Eds.), Workshop on copula theory and its applications, Warsaw. Berlin Heidelberg: Springer-Verlag.

    Google Scholar 

  • Witting, H., & Müller-Funk, U. (1995). Mathematische Statistik II. Stuttgart: Teubner Verlag.

    Book  MATH  Google Scholar 

  • Wolff, E. F. (1980). N-dimensional measures of dependence. Stochastica, 4(3), 175–188.

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nina Büchel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer International Publishing Switzerland

About this paper

Cite this paper

Büchel, N., Hildebrand, K.F., Müller-Funk, U. (2013). Factor Preselection and Multiple Measures of Dependence. In: Lausen, B., Van den Poel, D., Ultsch, A. (eds) Algorithms from and for Nature and Life. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-00035-0_22

Download citation

Publish with us

Policies and ethics