Skip to main content

A KNN-Based Non-Parametric Conditional Independence Test for Mixed Data and Application in Causal Discovery

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases: Research Track (ECML PKDD 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14169))

  • 1952 Accesses

Abstract

Testing for Conditional Independence (CI) is a fundamental task for causal discovery but is particularly challenging in mixed discrete-continuous data. In this context, inadequate assumptions or discretization of continuous variables reduce the CI test’s statistical power, which yields incorrect learned causal structures. In this work, we present a non-parametric CI test leveraging k-nearest neighbor (kNN) methods that are adaptive to mixed discrete-continuous data. In particular, a kNN-based conditional mutual information estimator serves as the test statistic, and the p-value is calculated using a kNN-based local permutation scheme. We prove the CI test’s statistical validity and power in mixed discrete-continuous data, which yields consistency when used in constraint-based causal discovery. An extensive evaluation of synthetic and real-world data shows that the proposed CI test outperforms state-of-the-art approaches in the accuracy of CI testing and causal discovery, particularly in settings with low sample sizes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Code and Appendix can be found on https://github.com/hpi-epic/mCMIkNN.

References

  1. Andrews, B., Ramsey, J., Cooper, G.F.: Scoring bayesian networks of mixed variables. Int. J. Data Sci. Analytics 6(1), 3–18 (2018)

    Article  Google Scholar 

  2. Antos, A., Kontoyiannis, I.: Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithms 19(3–4), 163–193 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  3. Baba, K., Shibata, R., Sibuya, M.: Partial correlation and conditional correlation as measures of conditional independence. Aust. N. Z. J. Stat. 46(4), 657–664 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  4. Berrett, T.B., Wang, Y., Barber, R.F., Samworth, R.J.: The conditional permutation test for independence while controlling for confounders. J. Roy. Stat. Soc. B (Statistical Methodology) 82(1), 175–197 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bradley, J.V.: Distribution-Free Statistical Tests. Prentice-Hall, Inc. XII, Englewood Cliffs, N. J. (1968)

    MATH  Google Scholar 

  6. Cabeli, V., Verny, L., Sella, N., Uguzzoni, G., Verny, M., Isambert, H.: Learning clinical networks from medical records based on information estimates in mixed-type data. PLoS Comput. Biol. 16(5), 1–19 (2020)

    Article  Google Scholar 

  7. Cheng, L., Guo, R., Moraffah, R., Sheth, P., Candan, K.S., Liu, H.: Evaluation methods and measures for causal learning algorithms. IEEE Trans. Artif. Intell. 3, 924–943 (2022)

    Article  Google Scholar 

  8. Colombo, D., Maathuis, M.H.: Order-independent constraint-based causal structure learning. J. Mach. Learn. Res. 15(116), 3921–3962 (2014)

    MathSciNet  MATH  Google Scholar 

  9. Cui, R., Groot, P., Heskes, T.: Copula PC algorithm for causal discovery from mixed data. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9852, pp. 377–392. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46227-1_24

    Chapter  Google Scholar 

  10. Cui, R., Groot, P., Schauer, M., Heskes, T.: Learning the causal structure of copula models with latent variables. In: Globerson, A., Silva, R. (eds.) Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI, pp. 188–197 (2018)

    Google Scholar 

  11. Dawid, A.P.: Conditional independence. Encycl. stat. sci. update 2, 146–153 (1998)

    Google Scholar 

  12. Deckert, A.C., Kummerfeld, E.: Investigating the effect of binning on causal discovery. In: Proceedings of 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2574–2581 (2019)

    Google Scholar 

  13. Edwards, D.: Introduction to Graphical Modelling. Springer (2012)

    Google Scholar 

  14. Ernst, M.D.: Permutation methods: a basis for exact inference. Stat. Sci. 19(4), 676–685 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  15. Frenzel, S., Pompe, B.: Partial mutual information for coupling analysis of multivariate time series. Phys. Rev. Lett. 99(20), 204101 (2007)

    Article  Google Scholar 

  16. Gao, W., Kannan, S., Oh, S., Viswanath, P.: Estimating mutual information for discrete-continuous mixtures. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5988–5999 (2017)

    Google Scholar 

  17. Glymour, C., Zhang, K., Spirtes, P.: Review of causal discovery methods based on graphical models. Front. genetics 10, 524 (2019)

    Article  Google Scholar 

  18. Gray, R.M.: Entropy and Information Theory. Springer (2011)

    Google Scholar 

  19. Guo, R., Cheng, L., Li, J., Hahn, P.R., Liu, H.: A survey of learning causality with data: problems and methods. ACM Comput. Surv. 53(4), 1–37 (2020)

    Google Scholar 

  20. Hagedorn, C., Huegle, J., Schlosser, R.: Understanding unforeseen production downtimes in manufacturing processes using log data-driven causal reasoning. J. Intell. Manuf. 33(7), 2027–2043 (2022)

    Article  Google Scholar 

  21. Hagedorn, C., Lange, C., Huegle, J., Schlosser, R.: GPU acceleration for information-theoretic constraint-based causal discovery. In: Proceedings of The KDD 2022 Workshop on Causal Discovery, pp. 30–60 (2022)

    Google Scholar 

  22. Higgins, J.J.: An Introduction to Modern Nonparametric Statistics. Brooks/Cole Pacific Grove, CA (2004)

    Google Scholar 

  23. Huang, T.M.: Testing conditional independence using maximal nonlinear conditional correlation. Ann. Stat. 38(4), 2047–2091 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  24. Huegle, J., Hagedorn, C., Boehme, L., Poerschke, M., Umland, J., Schlosser, R.: MANM-CS: data generation for benchmarking causal structure learning from mixed discrete-continuous and nonlinear data. In: WHY-21 @ NeurIPS 2021 (2021)

    Google Scholar 

  25. Huegle, J., Hagedorn, C., Uflacker, M.: How causal structural knowledge adds decision-support in monitoring of automotive body shop assembly lines. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp. 5246–5248 (2020)

    Google Scholar 

  26. Kalisch, M., Bühlmann, P.: Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J. Mach. Learn. Res. 8, 613–636 (2007)

    MATH  Google Scholar 

  27. Kalisch, M., Mächler, M., Colombo, D., Maathuis, M.H., Bühlmann, P.: Causal inference using graphical models with the R package pcalg. J. Stat. Softw. 47(11), 1–26 (2012)

    Article  Google Scholar 

  28. Kim, I., Neykov, M., Balakrishnan, S., Wasserman, L.: Local permutation tests for conditional independence. Ann. Stat. 50(6), 3388–3414 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  29. Kozachenko, L.F., Leonenko, N.N.: Sample estimate of the entropy of a random vector. Probl. Inf. Transm. 23(2), 9–16 (1987)

    MATH  Google Scholar 

  30. Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E 69(6), 066138 (2004)

    Article  MathSciNet  Google Scholar 

  31. Lehmann, E.L., D’Abrera, H.J.M.: Nonparametrics: Statistical Methods Based on Ranks (1975)

    Google Scholar 

  32. Li, C., Fan, X.: On nonparametric conditional independence tests for continuous variables. Wiley Interdiscip. Rev. Comput. Stat. 12(3) (2020)

    Google Scholar 

  33. Malinsky, D., Danks, D.: Causal discovery algorithms: a practical guide. Philos Compass 13(1), e12470 (2018)

    Article  Google Scholar 

  34. Mandros, P., Kaltenpoth, D., Boley, M., Vreeken, J.: Discovering functional dependencies from mixed-type data. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1404–1414 (2020)

    Google Scholar 

  35. Margaritis, D.: Distribution-free learning of bayesian network structure in continuous domains. In: Proceedings of the National Conference on Artificial Intelligence, pp. 825–830. AAAI (2005)

    Google Scholar 

  36. Marx, A., Yang, L., van Leeuwen, M.: Estimating conditional mutual information for discrete-continuous mixtures using multi-dimensional adaptive histograms. In: Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pp. 387–395 (2021)

    Google Scholar 

  37. Mesner, O.C., Shalizi, C.R.: Conditional mutual information estimation for mixed, discrete and continuous data. IEEE Trans. Inf. Theory 67(1), 464–484 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  38. Pearl, J.: Causality: Models, Reasoning, and Inference. Cambridge University Press, 1st edn. (2000)

    Google Scholar 

  39. Pearson, K.: X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond. Edinb. Dublin Philos. Mag. J. Sci. 50(302), 157–175 (1900)

    Article  MATH  Google Scholar 

  40. Raghu, V.K., Poon, A., Benos, P.V.: Evaluation of causal structure learning methods on mixed data types. In: Proceedings of 2018 ACM SIGKDD Workshop on Causal Disocvery, vol. 92, pp. 48–65 (2018)

    Google Scholar 

  41. Reisach, A., Seiler, C., Weichwald, S.: Beware of the simulated dag! causal discovery benchmarks may be easy to game. In: Advances in Neural Information Processing Systems, vol. 34, pp. 27772–27784 (2021)

    Google Scholar 

  42. Rohekar, R.Y., Nisimov, S., Gurwicz, Y., Novik, G.: Iterative causal discovery in the possible presence of latent confounders and selection bias. Adv. Neural. Inf. Process. Syst. 34, 2454–2465 (2021)

    Google Scholar 

  43. Runge, J.: Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In: International Conference on Artificial Intelligence and Statistics, pp. 938–947. PMLR (2018)

    Google Scholar 

  44. Scutari, M.: Learning bayesian networks with the bnlearn R package. J. Stat. Softw. 35, 1–22 (2010)

    Article  Google Scholar 

  45. Shah, R.D., Peters, J.: The hardness of conditional independence testing and the generalised covariance measure. Ann. Stat. 48(3), 1514–1538 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  46. Spirtes, P., Glymour, C.N., Scheines, R.: Causation, Prediction, and Search. MIT Press, Adaptive Computation and Machine Learning (2000)

    MATH  Google Scholar 

  47. Strobl, E.V.: A constraint-based algorithm for causal discovery with cycles, latent variables and selection bias. Int. J. Data Sci. Analytics 8(1), 33–56 (2019)

    Article  Google Scholar 

  48. Tsagris, M., Borboudakis, G., Lagani, V., Tsamardinos, I.: Constraint-based causal discovery with mixed data. Int. J. Data Sci. Analytics 6(1), 19–30 (2018)

    Article  Google Scholar 

  49. Tsamardinos, I., Borboudakis, G.: Permutation testing improves bayesian network learning. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 322–337. Springer, Berlin Heidelberg, Berlin, Heidelberg (2010)

    Chapter  Google Scholar 

  50. Yu, K., et al.: Causality-based feature selection: methods and evaluations. ACM Comput. Surv. 53(5), 1–36 (2020)

    Article  Google Scholar 

  51. Zan, L., Meynaoui, A., Assaad, C.K., Devijver, E., Gaussier, E.: A conditional mutual information estimator for mixed data and an associated conditional independence test. Entropy 24(9), 1234 (2022)

    Article  MathSciNet  Google Scholar 

  52. Zhang, K., Peters, J., Janzing, D., Schölkopf, B.: Kernel-based conditional independence test and application in causal discovery. In: Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pp. 804–813 (2011)

    Google Scholar 

  53. Zhao, P., Lai, L.: Analysis of KNN information estimators for smooth distributions. IEEE Trans. Inf. Theory 66(6), 3798–3826 (2019)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johannes Huegle .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huegle, J., Hagedorn, C., Schlosser, R. (2023). A KNN-Based Non-Parametric Conditional Independence Test for Mixed Data and Application in Causal Discovery. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14169. Springer, Cham. https://doi.org/10.1007/978-3-031-43412-9_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43412-9_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43411-2

  • Online ISBN: 978-3-031-43412-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics