Abstract
The size of data collected around the world is growing exponentially, and it has become popular as big data. The volume and velocity of big data are facilitating the transition of machine learning (ML), deep learning (DL) and artificial intelligence (AI) from research laboratories to real life. There are numerous other claims made about Big Data. Can we, however, rely on data blindly? What happens when a dataset used to train ML models has a hidden statistical paradox? Data, like fossil fuels, is valuable, but it must be refined carefully for accurate outcomes. Statistical paradoxes are hard to observe in classical data cleaning and analysis techniques. Still, they are required to be investigated separately in training datasets. In this paper, we discuss the impact of Simpson’s paradox on categorical data and demonstrate its effects on AI and ML application scenarios. Next, we provide an algorithm to automatically identify the confounding variable and detect Simpson’s paradox within categorical datasets. The algorithm experiments on datasets from two real-world case studies. The outcome of the algorithm uncovers the existence of the paradox and indicates that Simpson’s paradox is severely harmful in automatic data analysis, especially in AI, ML and DL.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of VLDB’1994 - the 20th International Conference on Very Large Data Bases, pp. 487–499. Morgan Kaufmann (1994)
Alipourfard, N., Fennell, P.G., Lerman, K.: Can you trust the trend? Discovering Simpson’s paradoxes in social data. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, pp. 19–27. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3159652.3159684
Alipourfard, N., Fennell, P.G., Lerman, K.: Using Simpson’s paradox to discover interesting patterns in behavioral data. In: Proceedings of the Twelfth International AAAI Conference on Web and Social Media. AAAI Publications (2018)
Bickel, P.J., Hammel, E.A., O’Connell, J.W.: Sex bias in graduate admissions: data from Berkeley. Science 187(4175), 398–404 (1975). https://doi.org/10.1126/science.187.4175.398
Blyth, C.R.: On Simpson’s paradox and the sure-thing principle. J. Am. Stat. Assoc. 67(338), 364–366 (1972)
Cattell, R.B.: P-technique factorization and the determination of individual dynamic structure. J. Clin. Psychol. 8, 5–10 (1952)
Charig, C.R., Webb, D.R., Payne, S.R., Wickham, J.E.: Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy. BMJ 292(6524), 879–882 (1986). https://doi.org/10.1136/bmj.292.6524.879
Conger, A.J.: A revised definition for suppressor variables: a guide to their identification and interpretation. Educ. Psychol. Meas. 34(1), 35–46 (1974)
Dawid, A.P.: Conditional independence in statistical theory. J. Roy. Stat. Soc. Ser. B (Methodol.) 41(1), 1–15 (1979). https://doi.org/10.1111/j.2517-6161.1979.tb01052.x
Draheim, D.: DEXA’2019 keynote presentation: future perspectives of association rule mining based on partial conditionalization, Linz, Austria, August 2019. https://doi.org/10.13140/RG.2.2.17763.48163
Draheim, D.: Future perspectives of association rule mining based on partial conditionalization. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) Proceedings of DEXA’2019 - the 30th International Conference on Database and Expert Systems Applications. LNCS, vol. 11706, p. xvi. Springer, Heidelberg (2019)
Fisher, R.A.: III. The influence of rainfall on the yield of wheat at Rothamsted. Philos. Trans. R. Soc. London Ser. B 213(402–410), 89–142 (1925). Containing Papers of a Biological Character
Freitas, A.A., McGarry, K.J., Correa, E.S.: Integrating Bayesian networks and Simpson’s paradox in data mining. In: Texts in Philosophy. College Publications (2007)
Kaushik, M., Sharma, R., Peious, S.A., Draheim, D.: Impact-driven discretization of numerical factors: case of two- and three-partitioning. In: Srirama, S.N., Lin, J.C.-W., Bhatnagar, R., Agarwal, S., Reddy, P.K. (eds.) BDA 2021. LNCS, vol. 13147, pp. 244–260. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-93620-4_18
Kaushik, M., Sharma, R., Peious, S.A., Shahin, M., Ben Yahia, S., Draheim, D.: On the potential of numerical association rule mining. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds.) FDSE 2020. CCIS, vol. 1306, pp. 3–20. Springer, Singapore (2020). https://doi.org/10.1007/978-981-33-4370-2_1
Kaushik, M., Sharma, R., Peious, S.A., Shahin, M., Yahia, S.B., Draheim, D.: A systematic assessment of numerical association rule mining methods. SN Comput. Sci. 2(5), 1–13 (2021). https://doi.org/10.1007/s42979-021-00725-2
Kievit, R., Frankenhuis, W., Waldorp, L., Borsboom, D.: Simpson’s paradox in psychological science: a practical guide. Front. Psychol. 4, 513 (2013). https://doi.org/10.3389/fpsyg.2013.00513
Kim, Y.: The 9 pitfalls of data science. Am. Stat. 74(3), 307 (2020). https://doi.org/10.1080/00031305.2020.1790216
King, G., Roberts, M.: EI: A(n R) program for ecological inference. Harvard University (2012)
Ma, H.Y., Lin, D.K.J.: Effect of Simpson’s paradox on market basket analysis. J. Chin. Stat. Assoc. 42(2), 209–221 (2004). https://doi.org/10.29973/JCSA.200406.0007
MacKinnon, D.P., Fairchild, A.J., Fritz, M.S.: Mediation analysis. Ann. Rev. Psychol. 58(1), 593–614 (2007). https://doi.org/10.1146/annurev.psych.58.110405.085542. pMID: 16968208
Pearl, J.: Causal inference without counterfactuals: comment. J. Am. Stat. Assoc. 95(450), 428–431 (2000)
Pearl, J.: Understanding Simpson’s paradox. SSRN Electron. J. 68 (2013). https://doi.org/10.2139/ssrn.2343788
Pearson Karl, L.A., Leslie, B.M.: Genetic (reproductive) selection: inheritance of fertility in man, and of fecundity in thoroughbred racehorses. Philos. Trans. R. Soc. Lond. Ser. A 192, 257–330 (1899)
Quinlan, J.: Combining instance-based and model-based learning. In: Machine Learning Proceedings 1993, pp. 236–243. Elsevier (1993). https://doi.org/10.1016/B978-1-55860-307-3.50037-X
Robinson, W.S.: Ecological correlations and the behavior of individuals. Am. Sociol. Rev. 15(3), 351–357 (1950)
Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika 70(1), 41–55 (1983)
Sharma, R., Peious, S.A.: Towards unification of decision support technologies: statistical reasoning. OLAP and Association Rule Mining. https://github.com/rahulgla/unification
Simpson, E.H.: The interpretation of interaction in contingency tables. J. Roy. Stat. Soc.: Ser. B (Methodol.) 13(2), 238–241 (1951)
Sprenger, J., Weinberger, N.: Simpson’s paradox. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy, Summer 2021 edn. Metaphysics Research Lab, Stanford University (2021)
Srikant, R., Agrawal, R.: Mining quantitative association rules in large relational tables. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pp. 1–12 (1996)
Tu, Y.K., Gunnell, D., Gilthorpe, M.S.: Simpson’s Paradox, Lord’s Paradox, and Suppression Effects are the same phenomenon-the reversal paradox. Emerg. Themes Epidemiol. 5(1), 1–9 (2008)
Von Kugelgen, J., Gresele, L., Scholkopf, B.: Simpson’s paradox in COVID-19 case fatality rates: a mediation analysis of age-related causal effects. IEEE Trans. Artif. Intell. 2(1), 18–27 (2021). https://doi.org/10.1109/tai.2021.3073088
Xu, C., Brown, S.M., Grant, C.: Detecting Simpson’s paradox. In: The Thirty-First International Flairs Conference (2018)
Yule, G.U.: Notes on the theory of association of attributes in statistics. Biometrika 2(2), 121–134 (1903)
Acknowledgements
This work has been partially conducted in the project “ICT programme” which was supported by the European Union through the European Social Fund.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sharma, R., Garayev, H., Kaushik, M., Peious, S.A., Tiwari, P., Draheim, D. (2022). Detecting Simpson’s Paradox: A Machine Learning Perspective. In: Strauss, C., Cuzzocrea, A., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2022. Lecture Notes in Computer Science, vol 13426. Springer, Cham. https://doi.org/10.1007/978-3-031-12423-5_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-12423-5_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12422-8
Online ISBN: 978-3-031-12423-5
eBook Packages: Computer ScienceComputer Science (R0)