Detecting Simpson’s Paradox: A Machine Learning Perspective

Sharma, Rahul; Garayev, Huseyn; Kaushik, Minakshi; Peious, Sijo Arakkal; Tiwari, Prayag; Draheim, Dirk

doi:10.1007/978-3-031-12423-5_25

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13426))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

Abstract

The size of data collected around the world is growing exponentially, and it has become popular as big data. The volume and velocity of big data are facilitating the transition of machine learning (ML), deep learning (DL) and artificial intelligence (AI) from research laboratories to real life. There are numerous other claims made about Big Data. Can we, however, rely on data blindly? What happens when a dataset used to train ML models has a hidden statistical paradox? Data, like fossil fuels, is valuable, but it must be refined carefully for accurate outcomes. Statistical paradoxes are hard to observe in classical data cleaning and analysis techniques. Still, they are required to be investigated separately in training datasets. In this paper, we discuss the impact of Simpson’s paradox on categorical data and demonstrate its effects on AI and ML application scenarios. Next, we provide an algorithm to automatically identify the confounding variable and detect Simpson’s paradox within categorical datasets. The algorithm experiments on datasets from two real-world case studies. The outcome of the algorithm uncovers the existence of the paradox and indicates that Simpson’s paradox is severely harmful in automatic data analysis, especially in AI, ML and DL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Why Not to Trust Big Data: Discussing Statistical Paradoxes

A Machine Learning Perspective on Big Data Analysis

DREAMER: a computational framework to evaluate readiness of datasets for machine learning

Article Open access 04 June 2024

References

Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of VLDB’1994 - the 20th International Conference on Very Large Data Bases, pp. 487–499. Morgan Kaufmann (1994)
Google Scholar
Alipourfard, N., Fennell, P.G., Lerman, K.: Can you trust the trend? Discovering Simpson’s paradoxes in social data. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, pp. 19–27. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3159652.3159684
Alipourfard, N., Fennell, P.G., Lerman, K.: Using Simpson’s paradox to discover interesting patterns in behavioral data. In: Proceedings of the Twelfth International AAAI Conference on Web and Social Media. AAAI Publications (2018)
Google Scholar
Bickel, P.J., Hammel, E.A., O’Connell, J.W.: Sex bias in graduate admissions: data from Berkeley. Science 187(4175), 398–404 (1975). https://doi.org/10.1126/science.187.4175.398
Article Google Scholar
Blyth, C.R.: On Simpson’s paradox and the sure-thing principle. J. Am. Stat. Assoc. 67(338), 364–366 (1972)
Article MathSciNet Google Scholar
Cattell, R.B.: P-technique factorization and the determination of individual dynamic structure. J. Clin. Psychol. 8, 5–10 (1952)
Article Google Scholar
Charig, C.R., Webb, D.R., Payne, S.R., Wickham, J.E.: Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy. BMJ 292(6524), 879–882 (1986). https://doi.org/10.1136/bmj.292.6524.879
Article Google Scholar
Conger, A.J.: A revised definition for suppressor variables: a guide to their identification and interpretation. Educ. Psychol. Meas. 34(1), 35–46 (1974)
Article Google Scholar
Dawid, A.P.: Conditional independence in statistical theory. J. Roy. Stat. Soc. Ser. B (Methodol.) 41(1), 1–15 (1979). https://doi.org/10.1111/j.2517-6161.1979.tb01052.x
Article MathSciNet MATH Google Scholar
Draheim, D.: DEXA’2019 keynote presentation: future perspectives of association rule mining based on partial conditionalization, Linz, Austria, August 2019. https://doi.org/10.13140/RG.2.2.17763.48163
Draheim, D.: Future perspectives of association rule mining based on partial conditionalization. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) Proceedings of DEXA’2019 - the 30th International Conference on Database and Expert Systems Applications. LNCS, vol. 11706, p. xvi. Springer, Heidelberg (2019)
Google Scholar
Fisher, R.A.: III. The influence of rainfall on the yield of wheat at Rothamsted. Philos. Trans. R. Soc. London Ser. B 213(402–410), 89–142 (1925). Containing Papers of a Biological Character
Google Scholar
Freitas, A.A., McGarry, K.J., Correa, E.S.: Integrating Bayesian networks and Simpson’s paradox in data mining. In: Texts in Philosophy. College Publications (2007)
Google Scholar
Kaushik, M., Sharma, R., Peious, S.A., Draheim, D.: Impact-driven discretization of numerical factors: case of two- and three-partitioning. In: Srirama, S.N., Lin, J.C.-W., Bhatnagar, R., Agarwal, S., Reddy, P.K. (eds.) BDA 2021. LNCS, vol. 13147, pp. 244–260. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-93620-4_18
Chapter Google Scholar
Kaushik, M., Sharma, R., Peious, S.A., Shahin, M., Ben Yahia, S., Draheim, D.: On the potential of numerical association rule mining. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds.) FDSE 2020. CCIS, vol. 1306, pp. 3–20. Springer, Singapore (2020). https://doi.org/10.1007/978-981-33-4370-2_1
Chapter Google Scholar
Kaushik, M., Sharma, R., Peious, S.A., Shahin, M., Yahia, S.B., Draheim, D.: A systematic assessment of numerical association rule mining methods. SN Comput. Sci. 2(5), 1–13 (2021). https://doi.org/10.1007/s42979-021-00725-2
Article Google Scholar
Kievit, R., Frankenhuis, W., Waldorp, L., Borsboom, D.: Simpson’s paradox in psychological science: a practical guide. Front. Psychol. 4, 513 (2013). https://doi.org/10.3389/fpsyg.2013.00513
Article Google Scholar
Kim, Y.: The 9 pitfalls of data science. Am. Stat. 74(3), 307 (2020). https://doi.org/10.1080/00031305.2020.1790216
Article Google Scholar
King, G., Roberts, M.: EI: A(n R) program for ecological inference. Harvard University (2012)
Google Scholar
Ma, H.Y., Lin, D.K.J.: Effect of Simpson’s paradox on market basket analysis. J. Chin. Stat. Assoc. 42(2), 209–221 (2004). https://doi.org/10.29973/JCSA.200406.0007
MacKinnon, D.P., Fairchild, A.J., Fritz, M.S.: Mediation analysis. Ann. Rev. Psychol. 58(1), 593–614 (2007). https://doi.org/10.1146/annurev.psych.58.110405.085542. pMID: 16968208
Pearl, J.: Causal inference without counterfactuals: comment. J. Am. Stat. Assoc. 95(450), 428–431 (2000)
Google Scholar
Pearl, J.: Understanding Simpson’s paradox. SSRN Electron. J. 68 (2013). https://doi.org/10.2139/ssrn.2343788
Pearson Karl, L.A., Leslie, B.M.: Genetic (reproductive) selection: inheritance of fertility in man, and of fecundity in thoroughbred racehorses. Philos. Trans. R. Soc. Lond. Ser. A 192, 257–330 (1899)
Article Google Scholar
Quinlan, J.: Combining instance-based and model-based learning. In: Machine Learning Proceedings 1993, pp. 236–243. Elsevier (1993). https://doi.org/10.1016/B978-1-55860-307-3.50037-X
Robinson, W.S.: Ecological correlations and the behavior of individuals. Am. Sociol. Rev. 15(3), 351–357 (1950)
Article Google Scholar
Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika 70(1), 41–55 (1983)
Article MathSciNet Google Scholar
Sharma, R., Peious, S.A.: Towards unification of decision support technologies: statistical reasoning. OLAP and Association Rule Mining. https://github.com/rahulgla/unification
Simpson, E.H.: The interpretation of interaction in contingency tables. J. Roy. Stat. Soc.: Ser. B (Methodol.) 13(2), 238–241 (1951)
MathSciNet MATH Google Scholar
Sprenger, J., Weinberger, N.: Simpson’s paradox. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy, Summer 2021 edn. Metaphysics Research Lab, Stanford University (2021)
Google Scholar
Srikant, R., Agrawal, R.: Mining quantitative association rules in large relational tables. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pp. 1–12 (1996)
Google Scholar
Tu, Y.K., Gunnell, D., Gilthorpe, M.S.: Simpson’s Paradox, Lord’s Paradox, and Suppression Effects are the same phenomenon-the reversal paradox. Emerg. Themes Epidemiol. 5(1), 1–9 (2008)
Article Google Scholar
Von Kugelgen, J., Gresele, L., Scholkopf, B.: Simpson’s paradox in COVID-19 case fatality rates: a mediation analysis of age-related causal effects. IEEE Trans. Artif. Intell. 2(1), 18–27 (2021). https://doi.org/10.1109/tai.2021.3073088
Article Google Scholar
Xu, C., Brown, S.M., Grant, C.: Detecting Simpson’s paradox. In: The Thirty-First International Flairs Conference (2018)
Google Scholar
Yule, G.U.: Notes on the theory of association of attributes in statistics. Biometrika 2(2), 121–134 (1903)
Article Google Scholar

Download references

Acknowledgements

This work has been partially conducted in the project “ICT programme” which was supported by the European Union through the European Social Fund.

Author information

Authors and Affiliations

Information Systems Group, Tallinn University of Technology, Akadeemia tee 15a, 12618, Tallinn, Estonia
Rahul Sharma, Minakshi Kaushik, Sijo Arakkal Peious & Dirk Draheim
University of Tartu, Tartu, Estonia
Huseyn Garayev
Department of Computer Science, Aalto University, Espoo, Finland
Prayag Tiwari

Authors

Rahul Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Huseyn Garayev
View author publications
You can also search for this author in PubMed Google Scholar
Minakshi Kaushik
View author publications
You can also search for this author in PubMed Google Scholar
Sijo Arakkal Peious
View author publications
You can also search for this author in PubMed Google Scholar
Prayag Tiwari
View author publications
You can also search for this author in PubMed Google Scholar
Dirk Draheim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rahul Sharma .

Editor information

Editors and Affiliations

University of Vienna, Vienna, Austria
Christine Strauss
University of Calabria, Rende, Italy
Alfredo Cuzzocrea
Johannes Kepler University of Linz, Linz, Austria
Gabriele Kotsis
Vienna University of Technology, Vienna, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sharma, R., Garayev, H., Kaushik, M., Peious, S.A., Tiwari, P., Draheim, D. (2022). Detecting Simpson’s Paradox: A Machine Learning Perspective. In: Strauss, C., Cuzzocrea, A., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2022. Lecture Notes in Computer Science, vol 13426. Springer, Cham. https://doi.org/10.1007/978-3-031-12423-5_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-12423-5_25
Published: 29 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12422-8
Online ISBN: 978-3-031-12423-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Detecting Simpson’s Paradox: A Machine Learning Perspective

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Why Not to Trust Big Data: Discussing Statistical Paradoxes

A Machine Learning Perspective on Big Data Analysis

DREAMER: a computational framework to evaluate readiness of datasets for machine learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Detecting Simpson’s Paradox: A Machine Learning Perspective

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Why Not to Trust Big Data: Discussing Statistical Paradoxes

A Machine Learning Perspective on Big Data Analysis

DREAMER: a computational framework to evaluate readiness of datasets for machine learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation