Abstract
Testing for Conditional Independence (CI) is a fundamental task for causal discovery but is particularly challenging in mixed discrete-continuous data. In this context, inadequate assumptions or discretization of continuous variables reduce the CI test’s statistical power, which yields incorrect learned causal structures. In this work, we present a non-parametric CI test leveraging k-nearest neighbor (kNN) methods that are adaptive to mixed discrete-continuous data. In particular, a kNN-based conditional mutual information estimator serves as the test statistic, and the p-value is calculated using a kNN-based local permutation scheme. We prove the CI test’s statistical validity and power in mixed discrete-continuous data, which yields consistency when used in constraint-based causal discovery. An extensive evaluation of synthetic and real-world data shows that the proposed CI test outperforms state-of-the-art approaches in the accuracy of CI testing and causal discovery, particularly in settings with low sample sizes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Code and Appendix can be found on https://github.com/hpi-epic/mCMIkNN.
References
Andrews, B., Ramsey, J., Cooper, G.F.: Scoring bayesian networks of mixed variables. Int. J. Data Sci. Analytics 6(1), 3–18 (2018)
Antos, A., Kontoyiannis, I.: Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithms 19(3–4), 163–193 (2001)
Baba, K., Shibata, R., Sibuya, M.: Partial correlation and conditional correlation as measures of conditional independence. Aust. N. Z. J. Stat. 46(4), 657–664 (2004)
Berrett, T.B., Wang, Y., Barber, R.F., Samworth, R.J.: The conditional permutation test for independence while controlling for confounders. J. Roy. Stat. Soc. B (Statistical Methodology) 82(1), 175–197 (2020)
Bradley, J.V.: Distribution-Free Statistical Tests. Prentice-Hall, Inc. XII, Englewood Cliffs, N. J. (1968)
Cabeli, V., Verny, L., Sella, N., Uguzzoni, G., Verny, M., Isambert, H.: Learning clinical networks from medical records based on information estimates in mixed-type data. PLoS Comput. Biol. 16(5), 1–19 (2020)
Cheng, L., Guo, R., Moraffah, R., Sheth, P., Candan, K.S., Liu, H.: Evaluation methods and measures for causal learning algorithms. IEEE Trans. Artif. Intell. 3, 924–943 (2022)
Colombo, D., Maathuis, M.H.: Order-independent constraint-based causal structure learning. J. Mach. Learn. Res. 15(116), 3921–3962 (2014)
Cui, R., Groot, P., Heskes, T.: Copula PC algorithm for causal discovery from mixed data. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9852, pp. 377–392. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46227-1_24
Cui, R., Groot, P., Schauer, M., Heskes, T.: Learning the causal structure of copula models with latent variables. In: Globerson, A., Silva, R. (eds.) Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI, pp. 188–197 (2018)
Dawid, A.P.: Conditional independence. Encycl. stat. sci. update 2, 146–153 (1998)
Deckert, A.C., Kummerfeld, E.: Investigating the effect of binning on causal discovery. In: Proceedings of 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2574–2581 (2019)
Edwards, D.: Introduction to Graphical Modelling. Springer (2012)
Ernst, M.D.: Permutation methods: a basis for exact inference. Stat. Sci. 19(4), 676–685 (2004)
Frenzel, S., Pompe, B.: Partial mutual information for coupling analysis of multivariate time series. Phys. Rev. Lett. 99(20), 204101 (2007)
Gao, W., Kannan, S., Oh, S., Viswanath, P.: Estimating mutual information for discrete-continuous mixtures. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5988–5999 (2017)
Glymour, C., Zhang, K., Spirtes, P.: Review of causal discovery methods based on graphical models. Front. genetics 10, 524 (2019)
Gray, R.M.: Entropy and Information Theory. Springer (2011)
Guo, R., Cheng, L., Li, J., Hahn, P.R., Liu, H.: A survey of learning causality with data: problems and methods. ACM Comput. Surv. 53(4), 1–37 (2020)
Hagedorn, C., Huegle, J., Schlosser, R.: Understanding unforeseen production downtimes in manufacturing processes using log data-driven causal reasoning. J. Intell. Manuf. 33(7), 2027–2043 (2022)
Hagedorn, C., Lange, C., Huegle, J., Schlosser, R.: GPU acceleration for information-theoretic constraint-based causal discovery. In: Proceedings of The KDD 2022 Workshop on Causal Discovery, pp. 30–60 (2022)
Higgins, J.J.: An Introduction to Modern Nonparametric Statistics. Brooks/Cole Pacific Grove, CA (2004)
Huang, T.M.: Testing conditional independence using maximal nonlinear conditional correlation. Ann. Stat. 38(4), 2047–2091 (2010)
Huegle, J., Hagedorn, C., Boehme, L., Poerschke, M., Umland, J., Schlosser, R.: MANM-CS: data generation for benchmarking causal structure learning from mixed discrete-continuous and nonlinear data. In: WHY-21 @ NeurIPS 2021 (2021)
Huegle, J., Hagedorn, C., Uflacker, M.: How causal structural knowledge adds decision-support in monitoring of automotive body shop assembly lines. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp. 5246–5248 (2020)
Kalisch, M., Bühlmann, P.: Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J. Mach. Learn. Res. 8, 613–636 (2007)
Kalisch, M., Mächler, M., Colombo, D., Maathuis, M.H., Bühlmann, P.: Causal inference using graphical models with the R package pcalg. J. Stat. Softw. 47(11), 1–26 (2012)
Kim, I., Neykov, M., Balakrishnan, S., Wasserman, L.: Local permutation tests for conditional independence. Ann. Stat. 50(6), 3388–3414 (2022)
Kozachenko, L.F., Leonenko, N.N.: Sample estimate of the entropy of a random vector. Probl. Inf. Transm. 23(2), 9–16 (1987)
Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E 69(6), 066138 (2004)
Lehmann, E.L., D’Abrera, H.J.M.: Nonparametrics: Statistical Methods Based on Ranks (1975)
Li, C., Fan, X.: On nonparametric conditional independence tests for continuous variables. Wiley Interdiscip. Rev. Comput. Stat. 12(3) (2020)
Malinsky, D., Danks, D.: Causal discovery algorithms: a practical guide. Philos Compass 13(1), e12470 (2018)
Mandros, P., Kaltenpoth, D., Boley, M., Vreeken, J.: Discovering functional dependencies from mixed-type data. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1404–1414 (2020)
Margaritis, D.: Distribution-free learning of bayesian network structure in continuous domains. In: Proceedings of the National Conference on Artificial Intelligence, pp. 825–830. AAAI (2005)
Marx, A., Yang, L., van Leeuwen, M.: Estimating conditional mutual information for discrete-continuous mixtures using multi-dimensional adaptive histograms. In: Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pp. 387–395 (2021)
Mesner, O.C., Shalizi, C.R.: Conditional mutual information estimation for mixed, discrete and continuous data. IEEE Trans. Inf. Theory 67(1), 464–484 (2021)
Pearl, J.: Causality: Models, Reasoning, and Inference. Cambridge University Press, 1st edn. (2000)
Pearson, K.: X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond. Edinb. Dublin Philos. Mag. J. Sci. 50(302), 157–175 (1900)
Raghu, V.K., Poon, A., Benos, P.V.: Evaluation of causal structure learning methods on mixed data types. In: Proceedings of 2018 ACM SIGKDD Workshop on Causal Disocvery, vol. 92, pp. 48–65 (2018)
Reisach, A., Seiler, C., Weichwald, S.: Beware of the simulated dag! causal discovery benchmarks may be easy to game. In: Advances in Neural Information Processing Systems, vol. 34, pp. 27772–27784 (2021)
Rohekar, R.Y., Nisimov, S., Gurwicz, Y., Novik, G.: Iterative causal discovery in the possible presence of latent confounders and selection bias. Adv. Neural. Inf. Process. Syst. 34, 2454–2465 (2021)
Runge, J.: Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In: International Conference on Artificial Intelligence and Statistics, pp. 938–947. PMLR (2018)
Scutari, M.: Learning bayesian networks with the bnlearn R package. J. Stat. Softw. 35, 1–22 (2010)
Shah, R.D., Peters, J.: The hardness of conditional independence testing and the generalised covariance measure. Ann. Stat. 48(3), 1514–1538 (2020)
Spirtes, P., Glymour, C.N., Scheines, R.: Causation, Prediction, and Search. MIT Press, Adaptive Computation and Machine Learning (2000)
Strobl, E.V.: A constraint-based algorithm for causal discovery with cycles, latent variables and selection bias. Int. J. Data Sci. Analytics 8(1), 33–56 (2019)
Tsagris, M., Borboudakis, G., Lagani, V., Tsamardinos, I.: Constraint-based causal discovery with mixed data. Int. J. Data Sci. Analytics 6(1), 19–30 (2018)
Tsamardinos, I., Borboudakis, G.: Permutation testing improves bayesian network learning. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 322–337. Springer, Berlin Heidelberg, Berlin, Heidelberg (2010)
Yu, K., et al.: Causality-based feature selection: methods and evaluations. ACM Comput. Surv. 53(5), 1–36 (2020)
Zan, L., Meynaoui, A., Assaad, C.K., Devijver, E., Gaussier, E.: A conditional mutual information estimator for mixed data and an associated conditional independence test. Entropy 24(9), 1234 (2022)
Zhang, K., Peters, J., Janzing, D., Schölkopf, B.: Kernel-based conditional independence test and application in causal discovery. In: Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pp. 804–813 (2011)
Zhao, P., Lai, L.: Analysis of KNN information estimators for smooth distributions. IEEE Trans. Inf. Theory 66(6), 3798–3826 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Huegle, J., Hagedorn, C., Schlosser, R. (2023). A KNN-Based Non-Parametric Conditional Independence Test for Mixed Data and Application in Causal Discovery. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14169. Springer, Cham. https://doi.org/10.1007/978-3-031-43412-9_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-43412-9_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43411-2
Online ISBN: 978-3-031-43412-9
eBook Packages: Computer ScienceComputer Science (R0)