Skip to main content

Why Not to Trust Big Data: Discussing Statistical Paradoxes

  • Conference paper
  • First Online:
Database Systems for Advanced Applications. DASFAA 2022 International Workshops (DASFAA 2022)

Abstract

Big data is driving the growth of businesses, data is the money, big data is the fuel of the twenty-first century, and there are many other claims over Big Data. Can we, however, rely on big data blindly? What happens if the training data set of a machine learning module is incorrect and contains a statistical paradox? Data, like fossil fuels, is valuable, but it must be refined carefully for the best results. Statistical paradoxes are difficult to observe in datasets, but they are significant to analyse in every small or big dataset. In this paper, we discuss the role of statistical paradoxes on Big data. Mainly we discuss the impact of Berkson’s paradox and Simpson’s paradox on different types of data and demonstrate how they affect big data. We provide that statistical paradoxes are more common in a variety of data and they lead to wrong conclusions potentially with harmful consequences. Experiments on two real-world datasets and a case study indicate that statistical paradoxes are severely harmful to big data and automatic data analysis techniques.

This work has been partially conducted in the project “ICT programme” which was supported by the European Union through the European Social Fund.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. California Department of Developmental Services CDDS expenditures. https://kaggle.com/wduckett/californiaddsexpenditures

  2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of VLDB 1994 - The 20th International Conference on Very Large Data Bases, pp. 487–499. Morgan Kaufmann (1994)

    Google Scholar 

  3. Alipourfard, N., Fennell, P.G., Lerman, K.: Can you trust the trend? Discovering Simpson’s paradoxes in social data. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, pp. 19–27. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3159652.3159684

  4. Alipourfard, N., Fennell, P.G., Lerman, K.: Using Simpson’s paradox to discover interesting patterns in behavioral data. In: Proceedings of the Twelfth International AAAI Conference on Web and Social Media. AAAI Publications (2018)

    Google Scholar 

  5. Berkson, J.: Limitations of the application of fourfold table analysis to hospital data. Biometrics Bull. 2(3), 47–53 (1946). http://www.jstor.org/stable/3002000

  6. Blyth, C.R.: On Simpson’s paradox and the sure-thing principle. J. Am. Stat. Assoc. 67(338), 364–366 (1972)

    Article  MathSciNet  Google Scholar 

  7. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  8. Cattell, R.B.: P-technique factorization and the determination of individual dynamic structure. J. Clin. Psychol. (1952)

    Google Scholar 

  9. Commission, E., Centre, J.R., Wenzl, T.: Smoking and COVID-19: a review of studies suggesting a protective effect of smoking against COVID-19. Publications Office (2020). https://doi.org/10.2760/564217

  10. Conger, A.J.: A revised definition for suppressor variables: a guide to their identification and interpretation. Educ. Psychol. Measur. 34(1), 35–46 (1974)

    Article  Google Scholar 

  11. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    Article  MATH  Google Scholar 

  12. Dawid, A.P.: Conditional independence in statistical theory. J. Roy. Stat. Soc.: Ser. B (Methodol.) 41(1), 1–15 (1979). https://doi.org/10.1111/j.2517-6161.1979.tb01052.x

    Article  MathSciNet  MATH  Google Scholar 

  13. Draheim, D.: DEXA’2019 keynote presentation: future perspectives of association rule mining based on partial conditionalization, Linz, Austria, 28th August 2019. https://doi.org/10.13140/RG.2.2.17763.48163

  14. Draheim, D.: Future perspectives of association rule mining based on partial conditionalization. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., A Min Tjoa, Khalil, I. (eds.) Database and Expert Systems Applications. LNCS, vol. 11706, p. xvi. Springer, Heidelberg (2019)(2019)

    Google Scholar 

  15. Fisher, R.A.: The use of multiple measurement in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936). https://doi.org/10.1111/j.1469-1809.1936.tb02137.x

    Article  Google Scholar 

  16. Fisher, R.A.: III. The influence of rainfall on the yield of wheat at rothamsted. Philos. Trans. R. Soc. London Ser. B Containing Papers Biological Character 213(402–410), 89–142 (1925)

    Google Scholar 

  17. Freitas, A.A., McGarry, K.J., Correa, E.S.: Integrating Bayesian networks and Simpson’s paradox in data mining. In: Texts in Philosophy. College Publications (2007)

    Google Scholar 

  18. Griffith, G.J., et al.: Collider bias undermines our understanding of COVID-19 disease risk and severity. Nat. Commun. 11(1), 5749 (2020). https://doi.org/10.1038/s41467-020-19478-2

    Article  MathSciNet  Google Scholar 

  19. Kaushik, M., Sharma, R., Peious, S.A., Draheim, D.: Impact-driven discretization of numerical factors: case of two- and three-partitioning. In: Srirama, S.N., Lin, J.C.-W., Bhatnagar, R., Agarwal, S., Reddy, P.K. (eds.) BDA 2021. LNCS, vol. 13147, pp. 244–260. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-93620-4_18

    Chapter  Google Scholar 

  20. Kaushik, M., Sharma, R., Peious, S.A., Shahin, M., Ben Yahia, S., Draheim, D.: On the potential of numerical association rule mining. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds.) FDSE 2020. CCIS, vol. 1306, pp. 3–20. Springer, Singapore (2020). https://doi.org/10.1007/978-981-33-4370-2_1

    Chapter  Google Scholar 

  21. Kaushik, M., Sharma, R., Peious, S.A., Shahin, M., Yahia, S.B., Draheim, D.: A systematic assessment of numerical association rule mining methods. SN Comput. Sci. 2(5), 1–13 (2021). https://doi.org/10.1007/s42979-021-00725-2

    Article  Google Scholar 

  22. Kievit, R., Frankenhuis, W., Waldorp, L., Borsboom, D.: Simpson’s paradox in psychological science: a practical guide. Front. Psychol. 4, 513 (2013). https://doi.org/10.3389/fpsyg.2013.00513

    Article  Google Scholar 

  23. Kim, Y.: The 9 pitfalls of data science. Am. Stat. 74(3), 307–307 (2020). https://doi.org/10.1080/00031305.2020.1790216

    Article  Google Scholar 

  24. King, G., Roberts, M.: EI: A (n R) program for ecological inference. Harvard University (2012)

    Google Scholar 

  25. Ma, H.Y., Lin, D.K.J.: Effect of Simpson’s paradox on market basket analysis. J. Chin. Stat. Assoc. 42(2), 209–221 (2004). https://doi.org/10.29973/JCSA.200406.0007

  26. MacKinnon, D.P., Fairchild, A.J., Fritz, M.S.: Mediation analysis. Annu. Rev. Psychol. 58(1), 593–614 (2007). https://doi.org/10.1146/annurev.psych.58.110405.085542. pMID: 16968208

  27. Pearl, J.: Causal inference without counterfactuals: comment. J. Am. Stat. Assoc. 95(450), 428–431 (2000)

    Google Scholar 

  28. Pearl, J.: Understanding Simpson’s paradox. SSRN Electron. J. 68 (2013). https://doi.org/10.2139/ssrn.2343788

  29. Pearson Karl, L.A., Leslie, B.M.: Genetic (reproductive) selection: inheritance of fertility in man, and of fecundity in thoroughbred racehorses. Philos. Trans. R. Soc. Lond. Ser. A 192, 257–330 (1899)

    Article  Google Scholar 

  30. Quinlan, J.: Combining instance-based and model-based learning. In: Machine Learning Proceedings 1993, pp. 236–243. Elsevier (1993). https://doi.org/10.1016/B978-1-55860-307-3.50037-X

  31. Robinson, W.S.: Ecological correlations and the behavior of individuals. Am. Sociol. Rev. 15(3), 351–357 (1950)

    Article  Google Scholar 

  32. Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika 70(1), 41–55 (1983)

    Article  MathSciNet  Google Scholar 

  33. Simpson, E.H.: The interpretation of interaction in contingency tables. J. Roy. Stat. Soc.: Ser. B (Methodol.) 13(2), 238–241 (1951)

    MathSciNet  MATH  Google Scholar 

  34. Srikant, R., Agrawal, R.: Mining quantitative association rules in large relational tables. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pp. 1–12 (1996)

    Google Scholar 

  35. Taylor, S.A., Mickel, A.E.: Simpson’s paradox: a data set and discrimination case study exercise. J. Stat. Educ. 22(1), 8 (2014). https://doi.org/10.1080/10691898.2014.11889697

    Article  Google Scholar 

  36. Tu, Y.K., Gunnell, D., Gilthorpe, M.S.: Simpson’s paradox, lord’s paradox, and suppression effects are the same phenomenon-the reversal paradox. Emerg. Themes Epidemiol. 5(1), 1–9 (2008)

    Article  Google Scholar 

  37. Von Kugelgen, J., Gresele, L., Scholkopf, B.: Simpson’s paradox in COVID-19 case fatality rates: a mediation analysis of age-related causal effects. IEEE Trans. Artif. Intell. 2(1), 18–27 (2021). https://doi.org/10.1109/tai.2021.3073088

    Article  Google Scholar 

  38. Yule, G.U.: Notes on the theory of association of attributes in statistics. Biometrika 2(2), 121–134 (1903)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rahul Sharma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sharma, R. et al. (2022). Why Not to Trust Big Data: Discussing Statistical Paradoxes. In: Rage, U.K., Goyal, V., Reddy, P.K. (eds) Database Systems for Advanced Applications. DASFAA 2022 International Workshops. DASFAA 2022. Lecture Notes in Computer Science, vol 13248. Springer, Cham. https://doi.org/10.1007/978-3-031-11217-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-11217-1_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-11216-4

  • Online ISBN: 978-3-031-11217-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics