What can Venn diagrams teach us about doing data science better?

Ho, Sung Yang; Tan, Sophia; Sze, Chun Chau; Wong, Limsoon; Goh, Wilson Wen Bin

doi:10.1007/s41060-020-00230-4

What can Venn diagrams teach us about doing data science better?

Regular Paper
Published: 02 August 2020

Volume 11, pages 1–10, (2021)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

1287 Accesses
6 Citations
Explore all metrics

Abstract

Data science is about deriving insight, learning and understanding from data. This process may be automated via the use of advanced algorithms or scaffolded cognitively via the use of graphs. While much emphasis is currently placed on machine learning, there is still much to learn about the role of the data scientist, in particular the thinking process by which he reaches conclusions. The thinking process of the data scientist needs to be scaffolded as the human brain is easily overwhelmed by many variables. Graphs are a form of data abstraction and constitute an essential part of the data scientist’s toolkit. Graphs are also a viable scaffold on which the data scientist may gain familiarity with data. But the process of extracting insight from graphs is not always a trivial or straightforward process; it requires interpretative logic as well. Generalizing from the example of a simple graph type, the Venn diagram, we discuss various logical fallacies that can be committed when interpreting a Venn diagram. Amidst various considerations that dictate how a graph should be tackled, we explain why context is most important, and should form the first guiding principle during data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Knowledge Graphs: Opportunities and Challenges

Article Open access 03 April 2023

Explainable AI Methods - A Brief Overview

Making data visualization more efficient and effective: a survey

Article 19 November 2019

Availability of data and material

Not applicable.

Abbreviations

DS:: Data science
AI:: Artificial intelligence
ML:: Machine learning

References

Cao, L.: Data science: a comprehensive overview. ACM Comput. Surv. 50(3), Article 43 (2017). https://doi.org/10.1145/3076253
Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019). https://doi.org/10.1038/s42256-019-0048-x
Article Google Scholar
Fellous, J.M., Sapiro, G., Rossi, A., Mayberg, H., Ferrante, M.: Explainable artificial intelligence for neuroscience: behavioral neurostimulation. Front. Neurosci. 13, 1346 (2019). https://doi.org/10.3389/fnins.2019.01346
Article Google Scholar
Meng, X.-L.: Conducting highly principled data science: a Statistician’s job and joy. Stat. Prob. Lett. 136, 51–57 (2018). https://doi.org/10.1016/j.spl.2018.02.053
Article MathSciNet MATH Google Scholar
Kim, M., Zimmermann, T., DeLine, R., Begel, A.: The emerging role of data scientists on software development teams. Paper presented at the Proceedings of the 38th International Conference on Software Engineering, Austin, Texas
Few, S.: Show Me the Numbers: Designing Tables and Graphs to Enlighten. Analytics Press, Oakland (2012)
Google Scholar
Halford, G.S., Baker, R., McCredden, J.E., Bain, J.D.: How many variables can humans process? Psychol. Sci. 16(1), 70–76 (2005). https://doi.org/10.1111/j.0956-7976.2005.00782.x
Article Google Scholar
Matejka J, Fitzmaurice Ge (2017) Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing. In: the 2017 CHI Conference, pp 1290–1294. https://doi.org/10.1145/3025453.3025912
O’Donoghue, S.I., Baldi, B.F., Clark, S.J., Darling, A.E., Hogan, J.M., Kaur, S., Maier-Hein, L., McCarthy, D.J., Moore, W.J., Stenau, E., Swedlow, J.R., Vuong, J., Procter, J.B.: Visualization of biomedical data. Ann. Rev. Biomed. Data Sci. 1(1), 275–304 (2018). https://doi.org/10.1146/annurev-biodatasci-080917-013424
Article Google Scholar
Knaflic, C.N.: Storytelling with data: a data visualization guide for business profession. Wiley, New York (2015)
Book Google Scholar
Wong, B.: Visualizing biological data. Nat. Methods 9(12), 1131 (2012). https://doi.org/10.1038/nmeth.2258
Article Google Scholar
Freedman, E.G., Shah, P.: Toward a model of knowledge-based graph comprehension. Paper presented at the Proceedings of the Second International Conference on Diagrammatic Representation and Inference
Goh, W.W.B., Sze, C.C.: AI paradigms for teaching biotechnology. Trends Biotechnol. 37(1), 1–5 (2019). https://doi.org/10.1016/j.tibtech.2018.09.009
Article Google Scholar
Weissgerber, T.L., Milic, N.M., Winham, S.J., Garovic, V.D.: Beyond bar and line graphs: time for a new data presentation paradigm. PLoS Biol. 13(4), e1002128 (2015). https://doi.org/10.1371/journal.pbio.1002128
Article Google Scholar
Wilkinson, L.: The Grammar of Graphics (Statistics and Computing). Springer, New York (2005)
MATH Google Scholar
Cao, L.: Domain-driven data mining: challenges and prospects. IEEE Trans. Knowl. Data Eng. 22(6), 755–769 (2010). https://doi.org/10.1109/TKDE.2010.32
Article Google Scholar
Mark, N.: Networks: An Introduction. Oxford University Press Inc, Oxford (2010)
MATH Google Scholar
Wagemans, J., Feldman, J., Gepshtein, S., Kimchi, R., Pomerantz, J.R., van der Helm, P.A., van Leeuwen, C.: A century of Gestalt psychology in visual perception: II. Conceptual and theoretical foundations. Psychol. Bull. 138(6), 1218–1252 (2012). https://doi.org/10.1037/a0029334
Article Google Scholar
Wing, J.: Computational thinking. Commun. ACM 49, 33–35 (2006). https://doi.org/10.1145/1118178.1118215
Article Google Scholar
Finzer, E.: The data science education dilemma. Technol. Innov. Stat. Educ. 7(2), 1–9 (2013)
Google Scholar
Wang, D., Cheng, L., Wang, M., Wu, R., Li, P., Li, B., Zhang, Y., Gu, Y., Zhao, W., Wang, C., Guo, Z.: Extensive increase of microarray signals in cancers calls for novel normalization assumptions. Comput. Biol. Chem. 35(3), 126–130 (2011). https://doi.org/10.1016/j.compbiolchem.2011.04.006
Article Google Scholar
O’Brien, R.M.: A consistent and general modified Venn diagram approach that provides insights into regression analysis. PLoS ONE 13(5), e0196740 (2018). https://doi.org/10.1371/journal.pone.0196740
Article Google Scholar
Conway, J.R., Lex, A., Gehlenborg, N.: UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33(18), 2938–2940 (2017). https://doi.org/10.1093/bioinformatics/btx364
Article Google Scholar
Goh, W.W.B., Wong, L.: The birth of Bio-data Science: trends, expectations, and applications. Genom. Proteom. Bioinformat. (2020). https://doi.org/10.1016/j.gpb.2020.01.002
Article Google Scholar
Buckingham Shum B, Hawksey M, Baker R, Jeffery N, Behrens J, Pea R (2013) Educational data scientists: a scarce breed. In: Proceedings of the third international conference on learning analytics and knowledge, pp 278–281. https://doi.org/10.1145/2460296.2460355
Lipton, Z.C.: The Mythos of model interpretability. Queue 16(3), 31–57 (2018). https://doi.org/10.1145/3236386.3241340
Article Google Scholar
Semenova L, Rudin C (2019) A study in Rashomon curves and volumes: a new perspective on generalization and model simplicity in machine learning. https://arxiv.org/abs/1908.01755
Halligan, S., Altman, D.G., Mallett, S.: Disadvantages of using the area under the receiver operating characteristic curve to assess imaging tests: a discussion and proposal for an alternative approach. Eur. Radiol. 25(4), 932–939 (2015). https://doi.org/10.1007/s00330-014-3487-0
Article Google Scholar
Ho, S.Y., Wong, L., Goh, W.W.B.: Avoid oversimplifications in machine learning: going beyond the class-prediction accuracy. Patterns 1(2), 100025 (2020). https://doi.org/10.1016/j.patter.2020.100025
Article Google Scholar
Nascimento, N., Alencar, P., Lucena, C., Cowan, D.: A context-aware machine learning-based approach. Paper presented at the Proceedings of the 28th Annual International Conference on Computer Science and Software Engineering, Markham, Ontario, Canada
Weber, F., Schütte, R.: A domain-oriented analysis of the impact of machine learning—the case of retailing. Big Data Cognit. Comput. 3, 11 (2019). https://doi.org/10.3390/bdcc3010011
Article Google Scholar
Meng, X.-L.: Statistical paradises and paradoxes in big data (I): law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat 12, 685–726 (2018). https://doi.org/10.1214/18-AOAS1161SF
Article MathSciNet MATH Google Scholar

Download references

Funding

WWBG and CCS gratefully acknowledge support from the Accelerating Creativity and Excellence (ACE) and EdeX grants from Nanyang Technological University, Singapore. WWBG also acknowledges an NRF-NSFC (Grant No. NRF2018NRF-NSFC003SB-006). LW acknowledges support from a Kwan Im Thong Hood Cho Temple Chair Professorship and the National Research Foundation Singapore under its AI Singapore Programme (Grant No. AISG-100E−2019-027 and AISG-100E−2019-028).

Author information

Authors and Affiliations

School of Biological Sciences, Nanyang Technological University, Singapore, 637551, Singapore
Sung Yang Ho, Chun Chau Sze & Wilson Wen Bin Goh
Teaching Learning and Pedagogy Division, Nanyang Technological University, Singapore, 637551, Singapore
Sophia Tan
Department of Computer Science, National University of Singapore, Singapore, 117417, Singapore
Limsoon Wong

Authors

Sung Yang Ho
View author publications
You can also search for this author in PubMed Google Scholar
Sophia Tan
View author publications
You can also search for this author in PubMed Google Scholar
Chun Chau Sze
View author publications
You can also search for this author in PubMed Google Scholar
Limsoon Wong
View author publications
You can also search for this author in PubMed Google Scholar
Wilson Wen Bin Goh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

HSY contributed to the initial drafting of the manuscript and the development of the figures. ST and CCS provided insight on the education and scientific logic implications. LW and WWBG initiated the project and supervised. All authors contributed towards writing.

Corresponding authors

Correspondence to Limsoon Wong or Wilson Wen Bin Goh.

Ethics declarations

Conflicts of interest

The authors have declared no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ho, S.Y., Tan, S., Sze, C.C. et al. What can Venn diagrams teach us about doing data science better?. Int J Data Sci Anal 11, 1–10 (2021). https://doi.org/10.1007/s41060-020-00230-4

Download citation

Received: 07 April 2020
Accepted: 16 July 2020
Published: 02 August 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s41060-020-00230-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

What can Venn diagrams teach us about doing data science better?

Abstract

Access this article

Similar content being viewed by others

Knowledge Graphs: Opportunities and Challenges

Explainable AI Methods - A Brief Overview

Making data visualization more efficient and effective: a survey

Availability of data and material

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

What can Venn diagrams teach us about doing data science better?

Abstract

Access this article

Similar content being viewed by others

Knowledge Graphs: Opportunities and Challenges

Explainable AI Methods - A Brief Overview

Making data visualization more efficient and effective: a survey

Availability of data and material

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation