Abstract
In some situations, the distribution of the error terms of a multivariate linear regression model may depart from normality. This problem has been addressed, for example, by specifying a different parametric distribution family for the error terms, such as multivariate skewed and/or heavy-tailed distributions. A new solution is proposed, which is obtained by modelling the error term distribution through a finite mixture of multi-dimensional Gaussian components. The multivariate linear regression model is studied under this assumption. Identifiability conditions are proved and maximum likelihood estimation of the model parameters is performed using the EM algorithm. The number of mixture components is chosen through model selection criteria; when this number is equal to one, the proposal results in the classical approach. The performances of the proposed approach are evaluated through Monte Carlo experiments and compared to the ones of other approaches. In conclusion, the results obtained from the analysis of a real dataset are presented.
Similar content being viewed by others
References
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csaki, B.F. (eds.) Second International Symposium on Information Theory, pp. 267–281. Academiai Kiado, Budapest (1973)
Azzalini, A., Capitanio, A.: Statistical applications of the multivariate skew normal distribution. J. R. Stat. Soc. Ser. B 61, 579–602 (1999)
Azzalini, A., Capitanio, A.: Distributions generated by perturbation of symmetry, with emphasis on a multivariate skew t-distribution. J. R. Stat. Soc. Ser. B 65, 367–389 (2003)
Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)
Bartolucci, F., Scaccia, L.: The use of mixtures for dealing with non-normal regression errors. Comput. Stat. Data Anal. 48, 821–834 (2005)
Batsidis, A., Zografos, K.: Statistical inference for location and scale of elliptically contoured models with monotone missing data. J. Stat. Plan. Inference 136, 2606–2629 (2006)
Batsidis, A., Zografos, K.: Multivariate linear regression model with elliptically contoured distributed errors and monotone missing dependent variables. Commun. Stat. Theory 37, 349–372 (2008)
Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated classification likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725 (2000)
Bozdogan, H.: Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika 52, 345–370 (1987)
Bozdogan, H.: Mixture-model cluster analysis using model selection criteria and a new informational measure of complexity. In: Bozdogan, H. (ed.) Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modelling: an Informational Approach, pp. 69–113. Kluwer Academic, Boston (1994)
Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recogn. 28, 781–793 (1995)
Cook, R.D., Weisberg, S.: An Introduction to Regression Graphics. Wiley, New York (1994)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–22 (1977)
DeSarbo, W.S., Cron, W.L.: A maximum likelihood methodology for clusterwise linear regression. J. Classif. 5, 249–282 (1988)
Diaz-Garcia, J.A., Rojas, M.G., Leiva-Sanchez, V.: Influence diagnostics for elliptical multivariate linear regression models. Commun. Stat. Theory 32, 625–642 (2003)
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall, London (1993)
Fama, E.F.: The behaviour of stock market prices. J. Bus. 38, 34–105 (1965)
Ferreira, J.T.A.S., Steel, M.F.J.: Bayesian multivariate regression analysis with a new class of skewed distributions. Research Report 419, Department of Statistics, University of Warwick (2003)
Ferreira, J.T.A.S., Steel, M.F.J.: Bayesian multivariate skewed regression modeling with an application to firm size. In: Genton, M.G. (ed.) Skew-Elliptical Distributions and Their Applications: a Journey Beyond Normality, pp. 174–189. CRC Chapman & Hall, Boca Raton (2004)
Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998)
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002)
Fraley, C., Raftery, A.E.: Enhanced software for model-based clustering. J. Classif. 20, 263–286 (2003)
Fraley, C., Raftery, A.E.: MCLUST version 3 for R: normal mixture modeling and model-based clustering. Technical Report No. 504, Department of Statistics, University of Washington (2006)
Galea, M., Paula, G.A., Bolfarine, H.: Local influence in elliptical linear regression models. Statistician 46, 71–79 (1997)
Galimberti, G., Soffritti, G.: Model-based methods to identify multiple cluster structures in a data set. Comput. Stat. Data Anal. 52, 520–532 (2007)
Grün, B., Leisch, F.: FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. J. Stat. Softw. 28 (2008a). URL http://www.jstatsoft.org/v26/i04/
Grün, B., Leisch, F.: Finite mixtures of generalized linear regression models. In: Shalabh, Heumann, C. (eds.) Recent Advances in Linear Models and Related Areas, pp. 205–230. Physica Verlag, Heidelberg (2008b)
Hennig, C.: Identifiability of models for clusterwise linear regression. J. Classif. 17, 273–296 (2000)
Hennig, C.: Fixed point clusters for linear regression: computation and comparison. J. Classif. 19, 249–276 (2002)
Hosmer, D.W. Jr.: Maximum likelihood estimates of the parameters of a mixture of two regression lines. Commun. Stat. Simul. 3, 995–1006 (1974)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Leisch, F.: FlexMix: a general framework for finite mixture models and latent class regression in R. J. Stat. Softw. 11 (2004). URL http://www.jstatsoft.org/v11/i08
Liu, C.: Bayesian robust multivariate linear regression with incomplete data. J. Am. Stat. Assoc. 91, 1219–1227 (1996)
Liu, S.: Local influence in multivariate elliptical linear regression models. Linear Algebra Appl. 354, 159–174 (2002)
Looney, S.W., Gulledge, T.R.: Use of the correlation coefficient with normal probability plots. Am. Stat 39, 75–79 (1985)
Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data Anal. 53, 3872–3882 (2009a)
Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection for clustering with Gaussian mixture models. Biometrics 65, 701–709 (2009b)
McColl, J.H.: Multivariate Probability. Arnold, London (2004)
McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley, Chichester (2008)
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, Chichester (2000)
R Development Core Team: R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria (2008). URL http://www.R-project.org
Raftery, A.E., Dean, N.: Variable selection for model-based cluster analysis. J. Am. Stat. Assoc. 101, 168–178 (2006)
Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Statist. Assoc. 66, 846–850 (1971)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Seidel, W., Mosler, K., Alker, M.: A cautionary note on likelihood ratio tests in mixture models. Ann. Inst. Stat. Math 52, 481–487 (2000)
Srivastava, M.S.: Methods of Multivariate Statistics. Wiley, New York (2002)
Steele, R.J., Raftery, A.E.: Performance of Bayesian model selection criteria for Gaussian mixture models. Technical Report No. 559, Department of Statistics, University of Washington (2009)
Sutradhar, B.C., Ali, M.M.: Estimation of the parameters of a regression model with a multivariate t error variable. Commun. Stat. Theory 15, 429–450 (1986)
Sutton, J.: Gibrat’s legacy. J. Econ. Lit. 35, 40–59 (1997)
Wedel, M., Steenkamp, J.-B.E.M.: A clusterwise regression method for simultaneous fuzzy market structuring and benefit segmentation. J. Mark. Res. 28, 385–396 (1991)
Yakowitz, S.J., Spragins, J.D.: On the identifiability of finite mixtures. Ann. Math. Stat. 39, 209–214 (1968)
Zellner, A.: Bayesian and non-Bayesian analysis of the regression model with multivariate student-t error terms. J. Am. Stat. Assoc. 71, 400–405 (1976)
Author information
Authors and Affiliations
Corresponding author
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Soffritti, G., Galimberti, G. Multivariate linear regression with non-normal errors: a solution based on mixture models. Stat Comput 21, 523–536 (2011). https://doi.org/10.1007/s11222-010-9190-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-010-9190-3