Abstract
This paper focuses on a privacy paradigm centered around providing access to researchers to remotely carry out analyses on sensitive data stored behind firewalls. We develop and demonstrate a method for accurate estimation of structural equation models (SEMs) for arbitrarily partitioned data. We show that under a certain set of assumptions our method for estimation across these partitions achieves identical results as estimation with the full data. We consider two situations: (i) a standard setting with a trusted central server and (ii) a round-robin setting in which none of the parties are fully trusted, and extend them in two specific ways. First, we formulate our methods specifically for SEMs, which have become increasingly common models in psychology, human development, and the behavioral sciences. Secondly, our methods work for horizontal, vertical, and complex partitions without needing different routines. In application, this method will serve to increase opportunities for research by allowing SEM estimation without transfer or combination of data. We demonstrate our methods with both simulated and real data examples.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arbuckle, J.L., Marcoulides, G.A., Schumacker, R.E.: Full information estimation in the presence of incomplete data. Adv. Struct. Equ. Model. Issues Tech. 243, 277 (1996)
Boker, S., Neale, M., Maes, H., Wilde, M., Spiegel, M., Brick, T., Spies, J., Estabrook, R., Kenny, S., Bates, T., et al.: Openmx: an open source extended structural equation modeling framework. Psychometrika 76(2), 306–317 (2011)
Boker, S.M., Brick, T.R., Pritikin, J.N., Wang, Y., von Oertzen, T., Brown, D., Lach, J., Estabrook, R., Hunter, M.D., Maes, H.H., et al.: Maintained individual data distributed likelihood estimation (middle). Multivar. Behav. Res. 50(6), 706–720 (2015)
CALIT. Personal data for the public good. Technical report, California Institute for Telecommunications and Information Technology (2014)
de Montjoye, Y.-A., Shmueli, E., Wang, S.S., Pentland, A.S.: OpenPDS: protecting the privacy of metadata through safeanswers. PloS one 9(7), e98790 (2014)
Dufau, S., Duñabeitia, J.A., Moret-Tatay, C., McGonigal, A., Peeters, D., Alario, F.-X., Balota, D.A., Brysbaert, M., Carreiras, M., Ferrand, L., et al.: Smart phone, smart science: how the use of smartphones can revolutionize research in cognitive science. PloS one 6(9), e24974 (2011)
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D.-Z., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008)
Fienberg, S.E., Fulp, W.J., Slavkovic, A.B., Wrobel, T.A.: “Secure” log-linear and logistic regression analysis of distributed databases. In: Domingo-Ferrer, J., Franconi, L. (eds.) Privacy in Statistical Databases. LNCS, vol. 4302, pp. 277–290. Springer, Heidelberg (2006)
Fienberg, S.E., Nardi, Y., Slavković, A.B.: Valid statistical analysis for logistic regression with multiple sources. In: Gal, C.S., Kantor, P.B., Lesk, M.E. (eds.) ISIPS 2008. LNCS, vol. 5661, pp. 82–94. Springer, Heidelberg (2009)
Gaye, A., Marcon, Y., Isaeva, J., LaFlamme, P., Turner, A., Jones, E.M., Minion, J., Boyd, A.W., Newby, C.J., Nuotio, M.-L., et al.: DataSHIELD: taking the analysis to the data, not the data to the analysis. Int. J. Epidemiol. 43(6), 1929–1944 (2014)
Gillespie, N.: Direction of causation and comorbidity models mutualism, sibling / spousal interaction. Presentation at Advanced Genetic Epidemiology Statistical Workshop 2015, Richmond, VA (2015)
Gillespie, N.A., Henders, A.K., Davenport, T.A., Hermens, D.F., Wright, M.J., Martin, N.G., Hickie, I.B.: The brisbane longitudinal twin study: pathways to cannabis use, abuse, and dependence project–current status, preliminary results, and future directions. Twin Res. Hum. Genet. 16(01), 21–33 (2013)
Goldwasser, S.: Multi party computations: past and present. In: Proceedings of the Sixteenth Annual ACM Symposium on Principles of Distributed Computing, pp. 1–6. ACM (1997)
Hall, R., Fienberg, S.E.: Privacy-preserving record linkage. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 269–283. Springer, Heidelberg (2010)
Haynsworth, E.V.: On the schur complement. Technical report, DTIC Document (1968)
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E.S., Spicer, K., De Wolf, P.-P.: Statistical Disclosure Control. John Wiley & Sons, Hoboken (2012)
Karr, A.F., Fulp, W.J., Vera, F., Young, S.S., Lin, X., Reiter, J.P.: Secure, privacy-preserving analysis of distributed databases. Technometrics 49(3), 335–345 (2007)
Karr, A.F., Lin, X., Sanil, A.P., Reiter, J.P.: Privacy-preserving analysis of vertically partitioned data using secure matrix products. J. Official Stat. 25(1), 125 (2009)
Kupek, E.: Beyond logistic regression: structural equations modelling for binary variables and its application to investigating unobserved confounders. BMC Med. Res. Methodol. 6(1), 1 (2006)
Lindell, Y., Pinkas, B.: Secure multiparty computation for privacy-preserving data mining. J. Priv. Confidentiality 1(1), 5 (2009)
McArdle, J.J., McDonald, R.P.: Some algebraic properties of the reticular action model for moment structures. Br. J. Math. Stat. Psychol. 37(2), 234–251 (1984)
Miller, G.: The smartphone psychology manifesto. Perspect. Psychol. Sci. 7(3), 221–237 (2012)
Raab, G.M., Dibben, C., Burton, P.: Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation. In: Statistical Data Confidentiality Work Session, UNECE, October 2015
Schur, I.: Neue begründung der theorie der gruppencharaktere (1905)
Slavkovic, A.B., Nardi, Y., Tibbits, M.M.: “Secure” logistic regression of horizontally and vertically partitioned distributed databases. In: Seventh IEEE International Conference on Data Mining Workshops (ICDM Workshops 2007), pp. 723–728. IEEE (2007)
Willenborg, L., De Waal, T.: Statistical Disclosure Control in Practice, vol. 111. Springer, New York (1996)
Yao, A.C-C.: Protocols for secure computations. In: FOCS 82, pp. 160–164 (1982)
Acknowledgements
This work was supported in part by NSF grants Big Data Social Sciences IGERT DGE-1144860 to Pennsylvania State University, and BCS-0941553 and SES-1534433 to the Department of Statistics, Pennsylvania State University. The work was also in part supported by the National Center for Advancing Translational Sciences, National Institutes of Health, through Grant UL1 TR000127. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
1.1 A.1 RAM Algebra
We briefly exhibit here the method we use for defining SEMs and transforming the model parameters to model implied means and covariance matrices. These model implied matrices are then used to calculate log likelihoods iteratively given the data. Optimizing over these matrices is equivalent to optimizing over the model parameters, giving us our estimates.
The SEM path diagram has a one-to-one relationship with the Multivariate normal mean and covariance matrices for the manifest variables. We construct this relationship through the use of RAM matrix algebra. For this we define five matrices denoted A, S, F, M, and I. These matrices contain both fixed and free model parameters. The free parameters are to be estimated and will be changed during optimization, while the fixed parameters do not change. In these matrices, free parameters are denoted with a greek symbol and the fixed parameters are designated by a constant number.
Recall the path diagram shown in Fig. 4. For this example model, the RAM algebra proceeds as follows. The A (“asymmetric”) matrix defines all regression parameters or one-headed arrows in the path diagram. It has number of rows and columns equal to the number of combined latent and manifest variables, with the column designating the path origin and the row designation the destination.

The S (“symmetric”) matrix defines are variance parameters or two-headed arrows in the path diagram in the same way as the A matrix.

The F (“filter”) matrix acts a filter for the manifest variables. It has columns equal to the combined number of latent and manifest variables but rows equal only the number of manifest variables. For each manifest variable it has a one on the diagonal.

The M (“mean”) matrix defines the mean parameters if any for the latent and manifest variables. These are not always included in the path diagrams.

Finally an I (“identity”) matrix is included, with columns and rows equal to the number of combined latent and manifest variables.

Using these matrices, we obtain the corresponding model implied mean (\(\mu \)) and covariance matrices (\(\varSigma \)) of the manifest variables based on the chosen parameters. The following equations give this crucial relationship.
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Snoke, J., Brick, T., Slavković, A. (2016). Accurate Estimation of Structural Equation Models with Remote Partitioned Data. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds) Privacy in Statistical Databases. PSD 2016. Lecture Notes in Computer Science(), vol 9867. Springer, Cham. https://doi.org/10.1007/978-3-319-45381-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-45381-1_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45380-4
Online ISBN: 978-3-319-45381-1
eBook Packages: Computer ScienceComputer Science (R0)