Abstract
Typically, classification algorithms use correlation analysis to make decisions. However, these decisions and the models they learn are not easily understandable for the typical user. Causal discovery is the field that studies the means to find causal relationships in observational data. Although highly interpretable, causal discovery algorithms tend to not perform so well in classification problems. This paper aims to propose a hybrid decision tree approach (SC tree) that mixes causal discovery with correlation analysis through the implementation of a custom metric to split the data in the tree’s construction (Semi-causal gain ratio). In the results, the proposed methodology obtained a significant performance improvement (11.26% mean error rate) when compared to several causal baselines CDT-PS (23.67% ) and CDT-SPS (25.14%), matching closely the performance of J48 (10.20%), used as a correlation baseline, in ten binary data sets. Besides, when compared with PC in discrete data sets, the proposed approach obtained substantial improvement (16.17% against 28.07% in terms of mean error rate).
Similar content being viewed by others
Notes
“For each of the separate levels of the co-variable set h = 1, 2, ..., q, the response variable is distributed at random with respect to the sub-populations, i.e. the data in the respective rows of the hth table can be regarded as a successive set of simple random samples of sizes {Nhi.} from a fixed population corresponding to the marginal total distribution of the response variable {Nh.j}” [10].
We used the WEKA jar file provided kindly by the authors to compare with our methodology.
We used the WEKA implementation.
References
Agresti, A.: An introduction to categorical data analysis. Wiley, New York (2018)
Birch, M.: The detection of partial association, i: the 2\(\times \) 2 case. J. Roy. Stat. Soc.: Ser. B (Methodol.) 26(2), 313–324 (1964)
Cochran, W.G.: Some methods for strengthening the common \(\chi \) 2 tests. Biometrics 10(4), 417–451 (1954)
DeFries, R., Agarwala, M., Baquie, S., Choksi, P., Khanwilkar, S., Mondal, P., Nagendra, H., Uperlainen, J.: Improved household living standards can restore dry tropical forests. Biotropica (2021)
Domingos, P.M.: The role of occam’s razor in knowledge discovery. Data Min. Knowl. Discov. 3(4), 409–425 (1999). https://doi.org/10.1023/A:1009868929893
Glymour, C., Zhang, K., Spirtes, P.: Review of causal discovery methods based on graphical models. Front. Genet. (2019). https://doi.org/10.3389/fgene.2019.00524
Guo, R., Cheng, L., Li, J., Hahn, P.R., Liu, H.: A survey of learning causality with data: problems and methods. ACM Comput. Surv. (2020). https://doi.org/10.1145/3397269
Jin, Z., Li, J., Liu, L., Le, T.D., Sun, B., Wang, R.: Discovery of causal rules using partial association. In: Proceedings IEEE International Conference on Data Mining, ICDM pp. 309–318 (2012). https://doi.org/10.1109/ICDM.2012.36
KENT, J.T.: Information gain and a general measure of correlation. Biometrika 70(1), 163–173 (1983). https://doi.org/10.1093/biomet/70.1.163
Landis, J.R., Heyman, E.R., Koch, G.G.: Average partial association in three-way contingency tables: a review and discussion of alternative tests. Int. Stat. Rev. 46(3), 237 (2006). https://doi.org/10.2307/1402373
Li, F., Gao, L., Ma, X., Yang, X.: Detection of driver pathways using mutated gene network in cancer. Mol. BioSyst. 12, 2135–2141 (2016). https://doi.org/10.1039/C6MB00084C
Li, J., Ma, S., Le, T., Liu, L., Liu, J.: Causal decision trees. IEEE Trans. Knowl. Data Eng. 29(2), 257–271 (2017). https://doi.org/10.1109/TKDE.2016.2619350
Luma-Osmani, S., Ismaili, F., Zenuni, X., Raufi, B.: A systematic literature review in causal association rules mining. In: 2020 11th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 0048–0054 (2020). https://doi.org/10.1109/IEMCON51383.2020.9284908
Ma, S., Statnikov, A.: Methods for computational causal discovery in biomedicine. Behaviormetrika 44(1), 165–191 (2017). https://doi.org/10.1007/s41237-016-0013-5
Mantas, C.J., Abellán, J.: Credal-c4.5: Decision tree based on imprecise probabilities to classify noisy data. Expert Syst. Appl. 41(10), 4625–4637 (2014). https://doi.org/10.1016/j.eswa.2014.01.017. http://www.sciencedirect.com/science/article/pii/S0957417414000384
Marx, A., Vreeken, J.: Testing conditional independence on discrete data using stochastic complexity. arXiv preprint arXiv:1903.04829 (2019)
Mooij, J.M., Cremers, J., Others: An empirical study of one of the simplest causal prediction algorithms. In: UAI 2015 Workshop on Advances in Causal Inference, 1504, pp. 30–39 (2015)
Pearl, J., Verma, T.S.: A theory of inferred causation. In: Studies in Logic and the Foundations of Mathematics, vol. 134, pp. 789–811. Elsevier (1995)
Piltaver, R., Luštrek, M., Gams, M., Martinšić-Ipšić, S.: What makes classification trees comprehensible? Expert Syst. Appl. 62, 333–346 (2016)
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). https://doi.org/10.1007/BF00116251
Samothrakis, S., Perez, D., Lucas, S.: Training Gradient Boosting Machines Using Curve-Fitting and Information-Theoretic Features for Causal Direction Detection, pp. 331–338. Springer International Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-21810-2_11
Spirtes, P., Glymour, C.N., Scheines, R., Heckerman, D.: Causation, prediction, and search. MIT press (2000)
Tangirala, S.: Evaluating the impact of gini index and information gain on classification using decision tree classifier algorithm. Int. J. Adv. Comput. Sci. Appl. 11(2), 612–619 (2020)
Theil, H.: Statistical decomposition analysis; with applications in the social and administrative sciences. Tech. rep. (1972)
Verma, T.S., Pearl, J.: On the equivalence of causal models. arXiv preprint arXiv:1304.1108 (2013)
Yu, K., Li, J., Liu, L.: A Review on Algorithms for Constraint-based Causal Discovery (2016)
Zhang, W., Wang, S.L.: An integrated framework for identifying mutated driver pathway and cancer progression. IEEE/ACM Trans. Comput. Biol. Bioinf. 16(2), 455–464 (2019). https://doi.org/10.1109/TCBB.2017.2788016
Zhang, X., Baral, C., Kim, S.: An algorithm to learn causal relations between genes from steady state data: Simulation and its application to melanoma dataset. In: Miksch, S., Hunter, J., Keravnou, E.T. (eds.) Artificial Intelligence in Medicine, pp. 524–534. Springer, Berlin (2005)
Zhou, Q., Liao, F., Mou, C., Wang, P.: Measuring interpretability for different types of machine learning models. In: M. Ganji, L. Rashidi, B.C.M. Fung, C. Wang (eds.) Trends and Applications in Knowledge Discovery and Data Mining - PAKDD 2018 Workshops, BDASC, BDM, ML4Cyber, PAISI, DaMEMO, Melbourne, VIC, Australia, June 3, 2018, Revised Selected Papers, Lecture Notes in Computer Science, vol. 11154, pp. 295–308. Springer (2018). https://doi.org/10.1007/978-3-030-04503-6_29
Acknowledgements
This research was carried out in the context of the project FailStopper (DSAIPA/DS/0086/2018) and supported by the Fundação para a Ciência e Tecnologia (FCT), Portugal for the PhD Grant SFRH/BD/146197/2019.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Nogueira, A.R., Ferreira, C.A. & Gama, J. Semi-causal decision trees. Prog Artif Intell 11, 105–119 (2022). https://doi.org/10.1007/s13748-021-00262-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-021-00262-2