Abstract
Data science is about deriving insight, learning and understanding from data. This process may be automated via the use of advanced algorithms or scaffolded cognitively via the use of graphs. While much emphasis is currently placed on machine learning, there is still much to learn about the role of the data scientist, in particular the thinking process by which he reaches conclusions. The thinking process of the data scientist needs to be scaffolded as the human brain is easily overwhelmed by many variables. Graphs are a form of data abstraction and constitute an essential part of the data scientist’s toolkit. Graphs are also a viable scaffold on which the data scientist may gain familiarity with data. But the process of extracting insight from graphs is not always a trivial or straightforward process; it requires interpretative logic as well. Generalizing from the example of a simple graph type, the Venn diagram, we discuss various logical fallacies that can be committed when interpreting a Venn diagram. Amidst various considerations that dictate how a graph should be tackled, we explain why context is most important, and should form the first guiding principle during data analysis.
Similar content being viewed by others
Availability of data and material
Not applicable.
Abbreviations
- DS:
-
Data science
- AI:
-
Artificial intelligence
- ML:
-
Machine learning
References
Cao, L.: Data science: a comprehensive overview. ACM Comput. Surv. 50(3), Article 43 (2017). https://doi.org/10.1145/3076253
Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019). https://doi.org/10.1038/s42256-019-0048-x
Fellous, J.M., Sapiro, G., Rossi, A., Mayberg, H., Ferrante, M.: Explainable artificial intelligence for neuroscience: behavioral neurostimulation. Front. Neurosci. 13, 1346 (2019). https://doi.org/10.3389/fnins.2019.01346
Meng, X.-L.: Conducting highly principled data science: a Statistician’s job and joy. Stat. Prob. Lett. 136, 51–57 (2018). https://doi.org/10.1016/j.spl.2018.02.053
Kim, M., Zimmermann, T., DeLine, R., Begel, A.: The emerging role of data scientists on software development teams. Paper presented at the Proceedings of the 38th International Conference on Software Engineering, Austin, Texas
Few, S.: Show Me the Numbers: Designing Tables and Graphs to Enlighten. Analytics Press, Oakland (2012)
Halford, G.S., Baker, R., McCredden, J.E., Bain, J.D.: How many variables can humans process? Psychol. Sci. 16(1), 70–76 (2005). https://doi.org/10.1111/j.0956-7976.2005.00782.x
Matejka J, Fitzmaurice Ge (2017) Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing. In: the 2017 CHI Conference, pp 1290–1294. https://doi.org/10.1145/3025453.3025912
O’Donoghue, S.I., Baldi, B.F., Clark, S.J., Darling, A.E., Hogan, J.M., Kaur, S., Maier-Hein, L., McCarthy, D.J., Moore, W.J., Stenau, E., Swedlow, J.R., Vuong, J., Procter, J.B.: Visualization of biomedical data. Ann. Rev. Biomed. Data Sci. 1(1), 275–304 (2018). https://doi.org/10.1146/annurev-biodatasci-080917-013424
Knaflic, C.N.: Storytelling with data: a data visualization guide for business profession. Wiley, New York (2015)
Wong, B.: Visualizing biological data. Nat. Methods 9(12), 1131 (2012). https://doi.org/10.1038/nmeth.2258
Freedman, E.G., Shah, P.: Toward a model of knowledge-based graph comprehension. Paper presented at the Proceedings of the Second International Conference on Diagrammatic Representation and Inference
Goh, W.W.B., Sze, C.C.: AI paradigms for teaching biotechnology. Trends Biotechnol. 37(1), 1–5 (2019). https://doi.org/10.1016/j.tibtech.2018.09.009
Weissgerber, T.L., Milic, N.M., Winham, S.J., Garovic, V.D.: Beyond bar and line graphs: time for a new data presentation paradigm. PLoS Biol. 13(4), e1002128 (2015). https://doi.org/10.1371/journal.pbio.1002128
Wilkinson, L.: The Grammar of Graphics (Statistics and Computing). Springer, New York (2005)
Cao, L.: Domain-driven data mining: challenges and prospects. IEEE Trans. Knowl. Data Eng. 22(6), 755–769 (2010). https://doi.org/10.1109/TKDE.2010.32
Mark, N.: Networks: An Introduction. Oxford University Press Inc, Oxford (2010)
Wagemans, J., Feldman, J., Gepshtein, S., Kimchi, R., Pomerantz, J.R., van der Helm, P.A., van Leeuwen, C.: A century of Gestalt psychology in visual perception: II. Conceptual and theoretical foundations. Psychol. Bull. 138(6), 1218–1252 (2012). https://doi.org/10.1037/a0029334
Wing, J.: Computational thinking. Commun. ACM 49, 33–35 (2006). https://doi.org/10.1145/1118178.1118215
Finzer, E.: The data science education dilemma. Technol. Innov. Stat. Educ. 7(2), 1–9 (2013)
Wang, D., Cheng, L., Wang, M., Wu, R., Li, P., Li, B., Zhang, Y., Gu, Y., Zhao, W., Wang, C., Guo, Z.: Extensive increase of microarray signals in cancers calls for novel normalization assumptions. Comput. Biol. Chem. 35(3), 126–130 (2011). https://doi.org/10.1016/j.compbiolchem.2011.04.006
O’Brien, R.M.: A consistent and general modified Venn diagram approach that provides insights into regression analysis. PLoS ONE 13(5), e0196740 (2018). https://doi.org/10.1371/journal.pone.0196740
Conway, J.R., Lex, A., Gehlenborg, N.: UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33(18), 2938–2940 (2017). https://doi.org/10.1093/bioinformatics/btx364
Goh, W.W.B., Wong, L.: The birth of Bio-data Science: trends, expectations, and applications. Genom. Proteom. Bioinformat. (2020). https://doi.org/10.1016/j.gpb.2020.01.002
Buckingham Shum B, Hawksey M, Baker R, Jeffery N, Behrens J, Pea R (2013) Educational data scientists: a scarce breed. In: Proceedings of the third international conference on learning analytics and knowledge, pp 278–281. https://doi.org/10.1145/2460296.2460355
Lipton, Z.C.: The Mythos of model interpretability. Queue 16(3), 31–57 (2018). https://doi.org/10.1145/3236386.3241340
Semenova L, Rudin C (2019) A study in Rashomon curves and volumes: a new perspective on generalization and model simplicity in machine learning. https://arxiv.org/abs/1908.01755
Halligan, S., Altman, D.G., Mallett, S.: Disadvantages of using the area under the receiver operating characteristic curve to assess imaging tests: a discussion and proposal for an alternative approach. Eur. Radiol. 25(4), 932–939 (2015). https://doi.org/10.1007/s00330-014-3487-0
Ho, S.Y., Wong, L., Goh, W.W.B.: Avoid oversimplifications in machine learning: going beyond the class-prediction accuracy. Patterns 1(2), 100025 (2020). https://doi.org/10.1016/j.patter.2020.100025
Nascimento, N., Alencar, P., Lucena, C., Cowan, D.: A context-aware machine learning-based approach. Paper presented at the Proceedings of the 28th Annual International Conference on Computer Science and Software Engineering, Markham, Ontario, Canada
Weber, F., Schütte, R.: A domain-oriented analysis of the impact of machine learning—the case of retailing. Big Data Cognit. Comput. 3, 11 (2019). https://doi.org/10.3390/bdcc3010011
Meng, X.-L.: Statistical paradises and paradoxes in big data (I): law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat 12, 685–726 (2018). https://doi.org/10.1214/18-AOAS1161SF
Funding
WWBG and CCS gratefully acknowledge support from the Accelerating Creativity and Excellence (ACE) and EdeX grants from Nanyang Technological University, Singapore. WWBG also acknowledges an NRF-NSFC (Grant No. NRF2018NRF-NSFC003SB-006). LW acknowledges support from a Kwan Im Thong Hood Cho Temple Chair Professorship and the National Research Foundation Singapore under its AI Singapore Programme (Grant No. AISG-100E−2019-027 and AISG-100E−2019-028).
Author information
Authors and Affiliations
Contributions
HSY contributed to the initial drafting of the manuscript and the development of the figures. ST and CCS provided insight on the education and scientific logic implications. LW and WWBG initiated the project and supervised. All authors contributed towards writing.
Corresponding authors
Ethics declarations
Conflicts of interest
The authors have declared no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ho, S.Y., Tan, S., Sze, C.C. et al. What can Venn diagrams teach us about doing data science better?. Int J Data Sci Anal 11, 1–10 (2021). https://doi.org/10.1007/s41060-020-00230-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-020-00230-4