Skip to main content

Detecting Simpson’s Paradox: A Machine Learning Perspective

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2022)

Abstract

The size of data collected around the world is growing exponentially, and it has become popular as big data. The volume and velocity of big data are facilitating the transition of machine learning (ML), deep learning (DL) and artificial intelligence (AI) from research laboratories to real life. There are numerous other claims made about Big Data. Can we, however, rely on data blindly? What happens when a dataset used to train ML models has a hidden statistical paradox? Data, like fossil fuels, is valuable, but it must be refined carefully for accurate outcomes. Statistical paradoxes are hard to observe in classical data cleaning and analysis techniques. Still, they are required to be investigated separately in training datasets. In this paper, we discuss the impact of Simpson’s paradox on categorical data and demonstrate its effects on AI and ML application scenarios. Next, we provide an algorithm to automatically identify the confounding variable and detect Simpson’s paradox within categorical datasets. The algorithm experiments on datasets from two real-world case studies. The outcome of the algorithm uncovers the existence of the paradox and indicates that Simpson’s paradox is severely harmful in automatic data analysis, especially in AI, ML and DL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of VLDB’1994 - the 20th International Conference on Very Large Data Bases, pp. 487–499. Morgan Kaufmann (1994)

    Google Scholar 

  2. Alipourfard, N., Fennell, P.G., Lerman, K.: Can you trust the trend? Discovering Simpson’s paradoxes in social data. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, pp. 19–27. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3159652.3159684

  3. Alipourfard, N., Fennell, P.G., Lerman, K.: Using Simpson’s paradox to discover interesting patterns in behavioral data. In: Proceedings of the Twelfth International AAAI Conference on Web and Social Media. AAAI Publications (2018)

    Google Scholar 

  4. Bickel, P.J., Hammel, E.A., O’Connell, J.W.: Sex bias in graduate admissions: data from Berkeley. Science 187(4175), 398–404 (1975). https://doi.org/10.1126/science.187.4175.398

    Article  Google Scholar 

  5. Blyth, C.R.: On Simpson’s paradox and the sure-thing principle. J. Am. Stat. Assoc. 67(338), 364–366 (1972)

    Article  MathSciNet  Google Scholar 

  6. Cattell, R.B.: P-technique factorization and the determination of individual dynamic structure. J. Clin. Psychol. 8, 5–10 (1952)

    Article  Google Scholar 

  7. Charig, C.R., Webb, D.R., Payne, S.R., Wickham, J.E.: Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy. BMJ 292(6524), 879–882 (1986). https://doi.org/10.1136/bmj.292.6524.879

    Article  Google Scholar 

  8. Conger, A.J.: A revised definition for suppressor variables: a guide to their identification and interpretation. Educ. Psychol. Meas. 34(1), 35–46 (1974)

    Article  Google Scholar 

  9. Dawid, A.P.: Conditional independence in statistical theory. J. Roy. Stat. Soc. Ser. B (Methodol.) 41(1), 1–15 (1979). https://doi.org/10.1111/j.2517-6161.1979.tb01052.x

    Article  MathSciNet  MATH  Google Scholar 

  10. Draheim, D.: DEXA’2019 keynote presentation: future perspectives of association rule mining based on partial conditionalization, Linz, Austria, August 2019. https://doi.org/10.13140/RG.2.2.17763.48163

  11. Draheim, D.: Future perspectives of association rule mining based on partial conditionalization. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) Proceedings of DEXA’2019 - the 30th International Conference on Database and Expert Systems Applications. LNCS, vol. 11706, p. xvi. Springer, Heidelberg (2019)

    Google Scholar 

  12. Fisher, R.A.: III. The influence of rainfall on the yield of wheat at Rothamsted. Philos. Trans. R. Soc. London Ser. B 213(402–410), 89–142 (1925). Containing Papers of a Biological Character

    Google Scholar 

  13. Freitas, A.A., McGarry, K.J., Correa, E.S.: Integrating Bayesian networks and Simpson’s paradox in data mining. In: Texts in Philosophy. College Publications (2007)

    Google Scholar 

  14. Kaushik, M., Sharma, R., Peious, S.A., Draheim, D.: Impact-driven discretization of numerical factors: case of two- and three-partitioning. In: Srirama, S.N., Lin, J.C.-W., Bhatnagar, R., Agarwal, S., Reddy, P.K. (eds.) BDA 2021. LNCS, vol. 13147, pp. 244–260. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-93620-4_18

    Chapter  Google Scholar 

  15. Kaushik, M., Sharma, R., Peious, S.A., Shahin, M., Ben Yahia, S., Draheim, D.: On the potential of numerical association rule mining. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds.) FDSE 2020. CCIS, vol. 1306, pp. 3–20. Springer, Singapore (2020). https://doi.org/10.1007/978-981-33-4370-2_1

    Chapter  Google Scholar 

  16. Kaushik, M., Sharma, R., Peious, S.A., Shahin, M., Yahia, S.B., Draheim, D.: A systematic assessment of numerical association rule mining methods. SN Comput. Sci. 2(5), 1–13 (2021). https://doi.org/10.1007/s42979-021-00725-2

    Article  Google Scholar 

  17. Kievit, R., Frankenhuis, W., Waldorp, L., Borsboom, D.: Simpson’s paradox in psychological science: a practical guide. Front. Psychol. 4, 513 (2013). https://doi.org/10.3389/fpsyg.2013.00513

    Article  Google Scholar 

  18. Kim, Y.: The 9 pitfalls of data science. Am. Stat. 74(3), 307 (2020). https://doi.org/10.1080/00031305.2020.1790216

    Article  Google Scholar 

  19. King, G., Roberts, M.: EI: A(n R) program for ecological inference. Harvard University (2012)

    Google Scholar 

  20. Ma, H.Y., Lin, D.K.J.: Effect of Simpson’s paradox on market basket analysis. J. Chin. Stat. Assoc. 42(2), 209–221 (2004). https://doi.org/10.29973/JCSA.200406.0007

  21. MacKinnon, D.P., Fairchild, A.J., Fritz, M.S.: Mediation analysis. Ann. Rev. Psychol. 58(1), 593–614 (2007). https://doi.org/10.1146/annurev.psych.58.110405.085542. pMID: 16968208

  22. Pearl, J.: Causal inference without counterfactuals: comment. J. Am. Stat. Assoc. 95(450), 428–431 (2000)

    Google Scholar 

  23. Pearl, J.: Understanding Simpson’s paradox. SSRN Electron. J. 68 (2013). https://doi.org/10.2139/ssrn.2343788

  24. Pearson Karl, L.A., Leslie, B.M.: Genetic (reproductive) selection: inheritance of fertility in man, and of fecundity in thoroughbred racehorses. Philos. Trans. R. Soc. Lond. Ser. A 192, 257–330 (1899)

    Article  Google Scholar 

  25. Quinlan, J.: Combining instance-based and model-based learning. In: Machine Learning Proceedings 1993, pp. 236–243. Elsevier (1993). https://doi.org/10.1016/B978-1-55860-307-3.50037-X

  26. Robinson, W.S.: Ecological correlations and the behavior of individuals. Am. Sociol. Rev. 15(3), 351–357 (1950)

    Article  Google Scholar 

  27. Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika 70(1), 41–55 (1983)

    Article  MathSciNet  Google Scholar 

  28. Sharma, R., Peious, S.A.: Towards unification of decision support technologies: statistical reasoning. OLAP and Association Rule Mining. https://github.com/rahulgla/unification

  29. Simpson, E.H.: The interpretation of interaction in contingency tables. J. Roy. Stat. Soc.: Ser. B (Methodol.) 13(2), 238–241 (1951)

    MathSciNet  MATH  Google Scholar 

  30. Sprenger, J., Weinberger, N.: Simpson’s paradox. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy, Summer 2021 edn. Metaphysics Research Lab, Stanford University (2021)

    Google Scholar 

  31. Srikant, R., Agrawal, R.: Mining quantitative association rules in large relational tables. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pp. 1–12 (1996)

    Google Scholar 

  32. Tu, Y.K., Gunnell, D., Gilthorpe, M.S.: Simpson’s Paradox, Lord’s Paradox, and Suppression Effects are the same phenomenon-the reversal paradox. Emerg. Themes Epidemiol. 5(1), 1–9 (2008)

    Article  Google Scholar 

  33. Von Kugelgen, J., Gresele, L., Scholkopf, B.: Simpson’s paradox in COVID-19 case fatality rates: a mediation analysis of age-related causal effects. IEEE Trans. Artif. Intell. 2(1), 18–27 (2021). https://doi.org/10.1109/tai.2021.3073088

    Article  Google Scholar 

  34. Xu, C., Brown, S.M., Grant, C.: Detecting Simpson’s paradox. In: The Thirty-First International Flairs Conference (2018)

    Google Scholar 

  35. Yule, G.U.: Notes on the theory of association of attributes in statistics. Biometrika 2(2), 121–134 (1903)

    Article  Google Scholar 

Download references

Acknowledgements

This work has been partially conducted in the project “ICT programme” which was supported by the European Union through the European Social Fund.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rahul Sharma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sharma, R., Garayev, H., Kaushik, M., Peious, S.A., Tiwari, P., Draheim, D. (2022). Detecting Simpson’s Paradox: A Machine Learning Perspective. In: Strauss, C., Cuzzocrea, A., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2022. Lecture Notes in Computer Science, vol 13426. Springer, Cham. https://doi.org/10.1007/978-3-031-12423-5_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-12423-5_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-12422-8

  • Online ISBN: 978-3-031-12423-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics