Abstract
In this article we consider objects for which we have a matrix of dissimilarities and we are interested in their links with covariates. We focus on state sequences for which pairwise dissimilarities are given for instance by edit distances. The methods discussed apply however to any kind of objects and measures of dissimilarities. We start with a generalization of the analysis of variance (ANOVA) to assess the link of complex objects (e.g. sequences) with a given categorical variable. The trick is to show that discrepancy among objects can be derived from the sole pairwise dissimilarities, which permits then to identify factors that most reduce this discrepancy.We present a general statistical test and introduce an original way of rendering the results for state sequences. We then generalize the method to the case with more than one factor and discuss its advantages and limitations especially regarding interpretation. Finally, we introduce a new tree method for analyzing discrepancy of complex objects that exploits the former test as splitting criterion. We demonstrate the scope of the methods presented through a study of the factors that most discriminate Swiss occupational trajectories. All methods presented are freely accessible in our TraMineR package for the R statistical environment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Anderson, M.J.: A new method for non-parametric multivariate analysis of variance. Austral Ecology 26, 32–46 (2001)
Batagelj, V.: Generalized Ward and related clustering problems. In: Bock, H. (ed.) Classification and related methods of data analysis, pp. 67–74. North-Holland, Amsterdam (1988)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification And Regression Trees. Chapman and Hall, New York (1984)
Excoffier, L., Smouse, P.E., Quattro, J.M.: Analysis of Molecular Variance Inferred from Metric Distances among DNA Haplotypes: Application to Human Mitochondrial DNA Restriction Data. Genetics 131, 479–491 (1992)
Gabadinho, A., Ritschard, G., Studer, M., Müller, N.S.: Mining Sequence Data in R with the TraMineR package: A User’s Guide. Technical report, Department of Econometrics and Laboratory of Demography, University of Geneva, Geneva (2009), http://mephisto.unige.ch/traminer/
Gansner, E.R., North, S.C.: An Open Graph Visualization System and Its Applications to software engineering. Software - Practice and Experience 30, 1203–1233 (1999)
Geurts, P., Wehenkel, L., d’Alché Buc, F.: Kernelizing the output of tree-based methods. In: Cohen, W.W., Moore, A. (eds.) ICML. ACM International Conference Proceeding Series, vol. 148, pp. 345–352. ACM, New York (2006)
Gower, J.C.: Some Distance Properties of Latent Root and Vector Methods Used in Multivariate Analysis. Biometrika 53(3/4), 325–338 (1966), http://www.jstor.org/stable/2333639
Gower, J.C., Krzanowski, W.J.: Analysis of distance for structured multivariate data and extensions to multivariate analysis of variance. Journal of the Royal Statistical Society: Series C (Applied Statistics) 48(4), 505–519 (1999)
Kass, G.V.: An exploratory technique for investigating large quantities of categorical data. Applied Statistics 29(2), 119–127 (1980)
Levy, R., Gauthier, J.-A., Widmer, E.: Entre contraintes institutionnelle et domestique : les parcours de vie masculins et féminins en Suisse. Cahiers canadiens de sociologie 31(4), 461–489 (2006)
McArdle, B.H., Anderson, M.J.: Fitting Multivariate Models to Community Data: A Comment on Distance-Based Redundancy Analysis. Ecology 82(1), 290–297 (2001), http://www.jstor.org/stable/2680104
Moore, D.S., McCabe, G., Duckworth, W., Sclove, S.: Bootstrap Methods and Permutation Tests. In: The Practice of Business Statistics: Using Data for Decisions, W. H. Freeman, New York (2003)
Piccarreta, R., Billari, F.C.: Clustering work and family trajectories by using a divisive algorithm. Journal of the Royal Statistical Society A 170(4), 1061–1078 (2007)
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2008) ISBN 3-900051-07-0, http://www.r-project.org
Scherer, S.: Early Career Patterns: A Comparison of Great Britain and West Germany. European Sociological Review 17(2), 119–144 (2001)
Shaw, R.G., Mitchell-Olds, T.: Anova for Unbalanced Data: An Overview. Ecology 74(6), 1638–1645 (1993), http://www.jstor.org/stable/1939922
Snedecor, G.W., Cochran, W.G.: Statistical methods, 8th edn. Iowa State University Press (1989)
Späth, H.: Cluster analyse algorithmen. R. Oldenbourg Verlag, München (1975)
Zapala, M.A., Schork, N.J.: Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables. Proceedings of the National Academy of Sciences of the United States of America 103(51), 19430–19435 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Studer, M., Ritschard, G., Gabadinho, A., Müller, N.S. (2010). Discrepancy Analysis of Complex Objects Using Dissimilarities. In: Guillet, F., Ritschard, G., Zighed, D.A., Briand, H. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 292. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00580-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-00580-0_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00579-4
Online ISBN: 978-3-642-00580-0
eBook Packages: EngineeringEngineering (R0)