Abstract
Factor selection or factor reduction is carried out to reduce the complexity of a data analysis problems (classification, regression) or to improve the fit of a model (via parameter estimation). In data mining there are special needs for a process by which relevant factors of influence are identified in order to achieve a balance between bias and noise. Insurance companies, for example, face data sets that contain hundreds of attributes or factors per object. With a large number of factors, the selection procedure requires a suitable process model. A process like that becomes compelling once data analysis is to be (semi) automated.We suggest an approach that proceeds in two phases: In the first one, we cluster attributes that are highly correlated in order to identify factor combinations that—statistically speaking—are near duplicates. In the second phase, we choose factors from each cluster that are highly associated with a target variable. The implementation requires some form of non-linear canonical correlation analysis. We define a correlation measure for two blocks of factors that will be employed as a measure of similarity within the clustering process. Such measures, in turn, are based on multiple indices of dependence. Few indices have been introduced cf. Wolff (Stochastica 4(3):175–188, 1980), ‘Few indices have been introduced in the literature’. All of them, however, are hard to interpret if the number of dimensions considerably exceeds two. For that reason we come up with signed measures that can be interpreted in the usual way.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
We use Event-driven Process Chains (EPC) as a modeling language. Details can be found in Becker and Schütte (1996).
References
Becker, J. & Schütte, R. (1996). Handelsinformationssysteme. Verl. Moderne Industrie, Landsberg/Lech.
Hall, M. A. (1999). Correlation-based feature subset selection for machine learning. PhD thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand.
Kiesl, H. (2003). Ordinale Streuungsmasse: Theoretische Fundierung und statistische Anwendung. PhD thesis, Universität Bamberg.
Renyi, A. (1958). On measures of dependence. Acta mathematica hungarica, 9, 441–451.
Rüschendorf, L. (1976). Asymptotic distributions of multivariate rank order statistics. The Annals of Statistics, 4, 912–923.
Rüschendorf, L. (2009). On the distributional transform, sklar’s theorem, and the empirical copula process. Journal of Statistical Planning and Inference, 139, 3921–3927.
Schmid, F., Blumentritt, T., Gaißer, S., Ruppert, M., & Schmidt, R. (2010). Copula-based measures of multivariate association. In F. Durante, W. Härdle, P. Jaworski, & T. Rychlik (Eds.), Workshop on copula theory and its applications, Warsaw. Berlin Heidelberg: Springer-Verlag.
Witting, H., & Müller-Funk, U. (1995). Mathematische Statistik II. Stuttgart: Teubner Verlag.
Wolff, E. F. (1980). N-dimensional measures of dependence. Stochastica, 4(3), 175–188.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer International Publishing Switzerland
About this paper
Cite this paper
Büchel, N., Hildebrand, K.F., Müller-Funk, U. (2013). Factor Preselection and Multiple Measures of Dependence. In: Lausen, B., Van den Poel, D., Ultsch, A. (eds) Algorithms from and for Nature and Life. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-00035-0_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-00035-0_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-00034-3
Online ISBN: 978-3-319-00035-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)