Skip to main content

Advertisement

Log in

What can Venn diagrams teach us about doing data science better?

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

Data science is about deriving insight, learning and understanding from data. This process may be automated via the use of advanced algorithms or scaffolded cognitively via the use of graphs. While much emphasis is currently placed on machine learning, there is still much to learn about the role of the data scientist, in particular the thinking process by which he reaches conclusions. The thinking process of the data scientist needs to be scaffolded as the human brain is easily overwhelmed by many variables. Graphs are a form of data abstraction and constitute an essential part of the data scientist’s toolkit. Graphs are also a viable scaffold on which the data scientist may gain familiarity with data. But the process of extracting insight from graphs is not always a trivial or straightforward process; it requires interpretative logic as well. Generalizing from the example of a simple graph type, the Venn diagram, we discuss various logical fallacies that can be committed when interpreting a Venn diagram. Amidst various considerations that dictate how a graph should be tackled, we explain why context is most important, and should form the first guiding principle during data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Availability of data and material

Not applicable.

Abbreviations

DS:

Data science

AI:

Artificial intelligence

ML:

Machine learning

References

  1. Cao, L.: Data science: a comprehensive overview. ACM Comput. Surv. 50(3), Article 43 (2017). https://doi.org/10.1145/3076253

  2. Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019). https://doi.org/10.1038/s42256-019-0048-x

    Article  Google Scholar 

  3. Fellous, J.M., Sapiro, G., Rossi, A., Mayberg, H., Ferrante, M.: Explainable artificial intelligence for neuroscience: behavioral neurostimulation. Front. Neurosci. 13, 1346 (2019). https://doi.org/10.3389/fnins.2019.01346

    Article  Google Scholar 

  4. Meng, X.-L.: Conducting highly principled data science: a Statistician’s job and joy. Stat. Prob. Lett. 136, 51–57 (2018). https://doi.org/10.1016/j.spl.2018.02.053

    Article  MathSciNet  MATH  Google Scholar 

  5. Kim, M., Zimmermann, T., DeLine, R., Begel, A.: The emerging role of data scientists on software development teams. Paper presented at the Proceedings of the 38th International Conference on Software Engineering, Austin, Texas

  6. Few, S.: Show Me the Numbers: Designing Tables and Graphs to Enlighten. Analytics Press, Oakland (2012)

    Google Scholar 

  7. Halford, G.S., Baker, R., McCredden, J.E., Bain, J.D.: How many variables can humans process? Psychol. Sci. 16(1), 70–76 (2005). https://doi.org/10.1111/j.0956-7976.2005.00782.x

    Article  Google Scholar 

  8. Matejka J, Fitzmaurice Ge (2017) Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing. In: the 2017 CHI Conference, pp 1290–1294. https://doi.org/10.1145/3025453.3025912

  9. O’Donoghue, S.I., Baldi, B.F., Clark, S.J., Darling, A.E., Hogan, J.M., Kaur, S., Maier-Hein, L., McCarthy, D.J., Moore, W.J., Stenau, E., Swedlow, J.R., Vuong, J., Procter, J.B.: Visualization of biomedical data. Ann. Rev. Biomed. Data Sci. 1(1), 275–304 (2018). https://doi.org/10.1146/annurev-biodatasci-080917-013424

    Article  Google Scholar 

  10. Knaflic, C.N.: Storytelling with data: a data visualization guide for business profession. Wiley, New York (2015)

    Book  Google Scholar 

  11. Wong, B.: Visualizing biological data. Nat. Methods 9(12), 1131 (2012). https://doi.org/10.1038/nmeth.2258

    Article  Google Scholar 

  12. Freedman, E.G., Shah, P.: Toward a model of knowledge-based graph comprehension. Paper presented at the Proceedings of the Second International Conference on Diagrammatic Representation and Inference

  13. Goh, W.W.B., Sze, C.C.: AI paradigms for teaching biotechnology. Trends Biotechnol. 37(1), 1–5 (2019). https://doi.org/10.1016/j.tibtech.2018.09.009

    Article  Google Scholar 

  14. Weissgerber, T.L., Milic, N.M., Winham, S.J., Garovic, V.D.: Beyond bar and line graphs: time for a new data presentation paradigm. PLoS Biol. 13(4), e1002128 (2015). https://doi.org/10.1371/journal.pbio.1002128

    Article  Google Scholar 

  15. Wilkinson, L.: The Grammar of Graphics (Statistics and Computing). Springer, New York (2005)

    MATH  Google Scholar 

  16. Cao, L.: Domain-driven data mining: challenges and prospects. IEEE Trans. Knowl. Data Eng. 22(6), 755–769 (2010). https://doi.org/10.1109/TKDE.2010.32

    Article  Google Scholar 

  17. Mark, N.: Networks: An Introduction. Oxford University Press Inc, Oxford (2010)

    MATH  Google Scholar 

  18. Wagemans, J., Feldman, J., Gepshtein, S., Kimchi, R., Pomerantz, J.R., van der Helm, P.A., van Leeuwen, C.: A century of Gestalt psychology in visual perception: II. Conceptual and theoretical foundations. Psychol. Bull. 138(6), 1218–1252 (2012). https://doi.org/10.1037/a0029334

    Article  Google Scholar 

  19. Wing, J.: Computational thinking. Commun. ACM 49, 33–35 (2006). https://doi.org/10.1145/1118178.1118215

    Article  Google Scholar 

  20. Finzer, E.: The data science education dilemma. Technol. Innov. Stat. Educ. 7(2), 1–9 (2013)

    Google Scholar 

  21. Wang, D., Cheng, L., Wang, M., Wu, R., Li, P., Li, B., Zhang, Y., Gu, Y., Zhao, W., Wang, C., Guo, Z.: Extensive increase of microarray signals in cancers calls for novel normalization assumptions. Comput. Biol. Chem. 35(3), 126–130 (2011). https://doi.org/10.1016/j.compbiolchem.2011.04.006

    Article  Google Scholar 

  22. O’Brien, R.M.: A consistent and general modified Venn diagram approach that provides insights into regression analysis. PLoS ONE 13(5), e0196740 (2018). https://doi.org/10.1371/journal.pone.0196740

    Article  Google Scholar 

  23. Conway, J.R., Lex, A., Gehlenborg, N.: UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33(18), 2938–2940 (2017). https://doi.org/10.1093/bioinformatics/btx364

    Article  Google Scholar 

  24. Goh, W.W.B., Wong, L.: The birth of Bio-data Science: trends, expectations, and applications. Genom. Proteom. Bioinformat. (2020). https://doi.org/10.1016/j.gpb.2020.01.002

    Article  Google Scholar 

  25. Buckingham Shum B, Hawksey M, Baker R, Jeffery N, Behrens J, Pea R (2013) Educational data scientists: a scarce breed. In: Proceedings of the third international conference on learning analytics and knowledge, pp 278–281. https://doi.org/10.1145/2460296.2460355

  26. Lipton, Z.C.: The Mythos of model interpretability. Queue 16(3), 31–57 (2018). https://doi.org/10.1145/3236386.3241340

    Article  Google Scholar 

  27. Semenova L, Rudin C (2019) A study in Rashomon curves and volumes: a new perspective on generalization and model simplicity in machine learning. https://arxiv.org/abs/1908.01755

  28. Halligan, S., Altman, D.G., Mallett, S.: Disadvantages of using the area under the receiver operating characteristic curve to assess imaging tests: a discussion and proposal for an alternative approach. Eur. Radiol. 25(4), 932–939 (2015). https://doi.org/10.1007/s00330-014-3487-0

    Article  Google Scholar 

  29. Ho, S.Y., Wong, L., Goh, W.W.B.: Avoid oversimplifications in machine learning: going beyond the class-prediction accuracy. Patterns 1(2), 100025 (2020). https://doi.org/10.1016/j.patter.2020.100025

    Article  Google Scholar 

  30. Nascimento, N., Alencar, P., Lucena, C., Cowan, D.: A context-aware machine learning-based approach. Paper presented at the Proceedings of the 28th Annual International Conference on Computer Science and Software Engineering, Markham, Ontario, Canada

  31. Weber, F., Schütte, R.: A domain-oriented analysis of the impact of machine learning—the case of retailing. Big Data Cognit. Comput. 3, 11 (2019). https://doi.org/10.3390/bdcc3010011

    Article  Google Scholar 

  32. Meng, X.-L.: Statistical paradises and paradoxes in big data (I): law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat 12, 685–726 (2018). https://doi.org/10.1214/18-AOAS1161SF

    Article  MathSciNet  MATH  Google Scholar 

Download references

Funding

WWBG and CCS gratefully acknowledge support from the Accelerating Creativity and Excellence (ACE) and EdeX grants from Nanyang Technological University, Singapore. WWBG also acknowledges an NRF-NSFC (Grant No. NRF2018NRF-NSFC003SB-006). LW acknowledges support from a Kwan Im Thong Hood Cho Temple Chair Professorship and the National Research Foundation Singapore under its AI Singapore Programme (Grant No. AISG-100E−2019-027 and AISG-100E−2019-028).

Author information

Authors and Affiliations

Authors

Contributions

HSY contributed to the initial drafting of the manuscript and the development of the figures. ST and CCS provided insight on the education and scientific logic implications. LW and WWBG initiated the project and supervised. All authors contributed towards writing.

Corresponding authors

Correspondence to Limsoon Wong or Wilson Wen Bin Goh.

Ethics declarations

Conflicts of interest

The authors have declared no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ho, S.Y., Tan, S., Sze, C.C. et al. What can Venn diagrams teach us about doing data science better?. Int J Data Sci Anal 11, 1–10 (2021). https://doi.org/10.1007/s41060-020-00230-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-020-00230-4

Keywords

Navigation