Factor Preselection and Multiple Measures of Dependence

Büchel, Nina; Hildebrand, Kay. F.; Müller-Funk, Ulrich

doi:10.1007/978-3-319-00035-0_22

Factor Preselection and Multiple Measures of Dependence

Nina Büchel²¹,
Kay. F. Hildebrand²¹ &
Ulrich Müller-Funk²¹

Conference paper
First Online: 01 January 2013

2775 Accesses
3 Altmetric

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

Abstract

Factor selection or factor reduction is carried out to reduce the complexity of a data analysis problems (classification, regression) or to improve the fit of a model (via parameter estimation). In data mining there are special needs for a process by which relevant factors of influence are identified in order to achieve a balance between bias and noise. Insurance companies, for example, face data sets that contain hundreds of attributes or factors per object. With a large number of factors, the selection procedure requires a suitable process model. A process like that becomes compelling once data analysis is to be (semi) automated.We suggest an approach that proceeds in two phases: In the first one, we cluster attributes that are highly correlated in order to identify factor combinations that—statistically speaking—are near duplicates. In the second phase, we choose factors from each cluster that are highly associated with a target variable. The implementation requires some form of non-linear canonical correlation analysis. We define a correlation measure for two blocks of factors that will be employed as a measure of similarity within the clustering process. Such measures, in turn, are based on multiple indices of dependence. Few indices have been introduced cf. Wolff (Stochastica 4(3):175–188, 1980), ‘Few indices have been introduced in the literature’. All of them, however, are hard to interpret if the number of dimensions considerably exceeds two. For that reason we come up with signed measures that can be interpreted in the usual way.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
We use Event-driven Process Chains (EPC) as a modeling language. Details can be found in Becker and Schütte (1996).

References

Becker, J. & Schütte, R. (1996). Handelsinformationssysteme. Verl. Moderne Industrie, Landsberg/Lech.
Google Scholar
Hall, M. A. (1999). Correlation-based feature subset selection for machine learning. PhD thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand.
Google Scholar
Kiesl, H. (2003). Ordinale Streuungsmasse: Theoretische Fundierung und statistische Anwendung. PhD thesis, Universität Bamberg.
Google Scholar
Renyi, A. (1958). On measures of dependence. Acta mathematica hungarica, 9, 441–451.
Google Scholar
Rüschendorf, L. (1976). Asymptotic distributions of multivariate rank order statistics. The Annals of Statistics, 4, 912–923.
Article MathSciNet MATH Google Scholar
Rüschendorf, L. (2009). On the distributional transform, sklar’s theorem, and the empirical copula process. Journal of Statistical Planning and Inference, 139, 3921–3927.
Article MathSciNet MATH Google Scholar
Schmid, F., Blumentritt, T., Gaißer, S., Ruppert, M., & Schmidt, R. (2010). Copula-based measures of multivariate association. In F. Durante, W. Härdle, P. Jaworski, & T. Rychlik (Eds.), Workshop on copula theory and its applications, Warsaw. Berlin Heidelberg: Springer-Verlag.
Google Scholar
Witting, H., & Müller-Funk, U. (1995). Mathematische Statistik II. Stuttgart: Teubner Verlag.
Book MATH Google Scholar
Wolff, E. F. (1980). N-dimensional measures of dependence. Stochastica, 4(3), 175–188.
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

European Research Center for Information Systems (ERCIS), University of Münster, Münster, Germany
Nina Büchel, Kay. F. Hildebrand & Ulrich Müller-Funk

Authors

Nina Büchel
View author publications
You can also search for this author in PubMed Google Scholar
Kay. F. Hildebrand
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Müller-Funk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nina Büchel .

Editor information

Editors and Affiliations

University of Essex Department of Mathematical Sciences, Colchester, United Kingdom
Berthold Lausen
Ghent University Department of Marketing, Ghent, Belgium
Dirk Van den Poel
University of Marburg Databionics, FB 12, Marburg, Germany
Alfred Ultsch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Büchel, N., Hildebrand, K.F., Müller-Funk, U. (2013). Factor Preselection and Multiple Measures of Dependence. In: Lausen, B., Van den Poel, D., Ultsch, A. (eds) Algorithms from and for Nature and Life. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-00035-0_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-00035-0_22
Published: 16 July 2013
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-00034-3
Online ISBN: 978-3-319-00035-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics