Abstract
Big data is driving the growth of businesses, data is the money, big data is the fuel of the twenty-first century, and there are many other claims over Big Data. Can we, however, rely on big data blindly? What happens if the training data set of a machine learning module is incorrect and contains a statistical paradox? Data, like fossil fuels, is valuable, but it must be refined carefully for the best results. Statistical paradoxes are difficult to observe in datasets, but they are significant to analyse in every small or big dataset. In this paper, we discuss the role of statistical paradoxes on Big data. Mainly we discuss the impact of Berkson’s paradox and Simpson’s paradox on different types of data and demonstrate how they affect big data. We provide that statistical paradoxes are more common in a variety of data and they lead to wrong conclusions potentially with harmful consequences. Experiments on two real-world datasets and a case study indicate that statistical paradoxes are severely harmful to big data and automatic data analysis techniques.
This work has been partially conducted in the project “ICT programme” which was supported by the European Union through the European Social Fund.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
California Department of Developmental Services CDDS expenditures. https://kaggle.com/wduckett/californiaddsexpenditures
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of VLDB 1994 - The 20th International Conference on Very Large Data Bases, pp. 487–499. Morgan Kaufmann (1994)
Alipourfard, N., Fennell, P.G., Lerman, K.: Can you trust the trend? Discovering Simpson’s paradoxes in social data. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, pp. 19–27. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3159652.3159684
Alipourfard, N., Fennell, P.G., Lerman, K.: Using Simpson’s paradox to discover interesting patterns in behavioral data. In: Proceedings of the Twelfth International AAAI Conference on Web and Social Media. AAAI Publications (2018)
Berkson, J.: Limitations of the application of fourfold table analysis to hospital data. Biometrics Bull. 2(3), 47–53 (1946). http://www.jstor.org/stable/3002000
Blyth, C.R.: On Simpson’s paradox and the sure-thing principle. J. Am. Stat. Assoc. 67(338), 364–366 (1972)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Cattell, R.B.: P-technique factorization and the determination of individual dynamic structure. J. Clin. Psychol. (1952)
Commission, E., Centre, J.R., Wenzl, T.: Smoking and COVID-19: a review of studies suggesting a protective effect of smoking against COVID-19. Publications Office (2020). https://doi.org/10.2760/564217
Conger, A.J.: A revised definition for suppressor variables: a guide to their identification and interpretation. Educ. Psychol. Measur. 34(1), 35–46 (1974)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Dawid, A.P.: Conditional independence in statistical theory. J. Roy. Stat. Soc.: Ser. B (Methodol.) 41(1), 1–15 (1979). https://doi.org/10.1111/j.2517-6161.1979.tb01052.x
Draheim, D.: DEXA’2019 keynote presentation: future perspectives of association rule mining based on partial conditionalization, Linz, Austria, 28th August 2019. https://doi.org/10.13140/RG.2.2.17763.48163
Draheim, D.: Future perspectives of association rule mining based on partial conditionalization. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., A Min Tjoa, Khalil, I. (eds.) Database and Expert Systems Applications. LNCS, vol. 11706, p. xvi. Springer, Heidelberg (2019)(2019)
Fisher, R.A.: The use of multiple measurement in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936). https://doi.org/10.1111/j.1469-1809.1936.tb02137.x
Fisher, R.A.: III. The influence of rainfall on the yield of wheat at rothamsted. Philos. Trans. R. Soc. London Ser. B Containing Papers Biological Character 213(402–410), 89–142 (1925)
Freitas, A.A., McGarry, K.J., Correa, E.S.: Integrating Bayesian networks and Simpson’s paradox in data mining. In: Texts in Philosophy. College Publications (2007)
Griffith, G.J., et al.: Collider bias undermines our understanding of COVID-19 disease risk and severity. Nat. Commun. 11(1), 5749 (2020). https://doi.org/10.1038/s41467-020-19478-2
Kaushik, M., Sharma, R., Peious, S.A., Draheim, D.: Impact-driven discretization of numerical factors: case of two- and three-partitioning. In: Srirama, S.N., Lin, J.C.-W., Bhatnagar, R., Agarwal, S., Reddy, P.K. (eds.) BDA 2021. LNCS, vol. 13147, pp. 244–260. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-93620-4_18
Kaushik, M., Sharma, R., Peious, S.A., Shahin, M., Ben Yahia, S., Draheim, D.: On the potential of numerical association rule mining. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds.) FDSE 2020. CCIS, vol. 1306, pp. 3–20. Springer, Singapore (2020). https://doi.org/10.1007/978-981-33-4370-2_1
Kaushik, M., Sharma, R., Peious, S.A., Shahin, M., Yahia, S.B., Draheim, D.: A systematic assessment of numerical association rule mining methods. SN Comput. Sci. 2(5), 1–13 (2021). https://doi.org/10.1007/s42979-021-00725-2
Kievit, R., Frankenhuis, W., Waldorp, L., Borsboom, D.: Simpson’s paradox in psychological science: a practical guide. Front. Psychol. 4, 513 (2013). https://doi.org/10.3389/fpsyg.2013.00513
Kim, Y.: The 9 pitfalls of data science. Am. Stat. 74(3), 307–307 (2020). https://doi.org/10.1080/00031305.2020.1790216
King, G., Roberts, M.: EI: A (n R) program for ecological inference. Harvard University (2012)
Ma, H.Y., Lin, D.K.J.: Effect of Simpson’s paradox on market basket analysis. J. Chin. Stat. Assoc. 42(2), 209–221 (2004). https://doi.org/10.29973/JCSA.200406.0007
MacKinnon, D.P., Fairchild, A.J., Fritz, M.S.: Mediation analysis. Annu. Rev. Psychol. 58(1), 593–614 (2007). https://doi.org/10.1146/annurev.psych.58.110405.085542. pMID: 16968208
Pearl, J.: Causal inference without counterfactuals: comment. J. Am. Stat. Assoc. 95(450), 428–431 (2000)
Pearl, J.: Understanding Simpson’s paradox. SSRN Electron. J. 68 (2013). https://doi.org/10.2139/ssrn.2343788
Pearson Karl, L.A., Leslie, B.M.: Genetic (reproductive) selection: inheritance of fertility in man, and of fecundity in thoroughbred racehorses. Philos. Trans. R. Soc. Lond. Ser. A 192, 257–330 (1899)
Quinlan, J.: Combining instance-based and model-based learning. In: Machine Learning Proceedings 1993, pp. 236–243. Elsevier (1993). https://doi.org/10.1016/B978-1-55860-307-3.50037-X
Robinson, W.S.: Ecological correlations and the behavior of individuals. Am. Sociol. Rev. 15(3), 351–357 (1950)
Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika 70(1), 41–55 (1983)
Simpson, E.H.: The interpretation of interaction in contingency tables. J. Roy. Stat. Soc.: Ser. B (Methodol.) 13(2), 238–241 (1951)
Srikant, R., Agrawal, R.: Mining quantitative association rules in large relational tables. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pp. 1–12 (1996)
Taylor, S.A., Mickel, A.E.: Simpson’s paradox: a data set and discrimination case study exercise. J. Stat. Educ. 22(1), 8 (2014). https://doi.org/10.1080/10691898.2014.11889697
Tu, Y.K., Gunnell, D., Gilthorpe, M.S.: Simpson’s paradox, lord’s paradox, and suppression effects are the same phenomenon-the reversal paradox. Emerg. Themes Epidemiol. 5(1), 1–9 (2008)
Von Kugelgen, J., Gresele, L., Scholkopf, B.: Simpson’s paradox in COVID-19 case fatality rates: a mediation analysis of age-related causal effects. IEEE Trans. Artif. Intell. 2(1), 18–27 (2021). https://doi.org/10.1109/tai.2021.3073088
Yule, G.U.: Notes on the theory of association of attributes in statistics. Biometrika 2(2), 121–134 (1903)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sharma, R. et al. (2022). Why Not to Trust Big Data: Discussing Statistical Paradoxes. In: Rage, U.K., Goyal, V., Reddy, P.K. (eds) Database Systems for Advanced Applications. DASFAA 2022 International Workshops. DASFAA 2022. Lecture Notes in Computer Science, vol 13248. Springer, Cham. https://doi.org/10.1007/978-3-031-11217-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-11217-1_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11216-4
Online ISBN: 978-3-031-11217-1
eBook Packages: Computer ScienceComputer Science (R0)