Abstract
Datasets are crucial for data-driven decision-making in businesses and organizations, allowing for the optimization of processes and the identification of improvement opportunities. They are foundational for training machine learning models to recognize patterns and make predictions, encompassing various forms of data from simple text and numbers to complex images and graphs. Furthermore, datasets are vital for educating future data scientists and machine learning engineers, offering hands-on experience that enhances analytical skills and practical knowledge. However, a gap in university curricula regarding dataset creation skills is noted, which is essential for ensuring the quality of datasets and the robustness of decision processes. This paper proposes to address this educational gap by developing methodologies for teaching dataset creation within university settings, mindful of the distinct skill levels of students compared to professionals.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abasi, R.: Google dorks: Use cases and Adaption study. Master’s thesis, University of Turku (2020)
Arnold, M., et al.: Factsheets: increasing trust in AI services through supplier’s declarations of conformity. IBM J. Res. Dev. 63(4/5), 6:1-6:3 (2019)
Artstein, R.: Inter-Annotator Agreement. Handbook of Linguistic Annotation, pp. 297–313 (2017)
Bender, E.M., Friedman, B.: Data statements for natural language processing: toward mitigating system bias and enabling better science. Trans. Assoc. Comput. Linguis. 6, 587–604 (2018)
Bilokon, P., Bilokon, O., Amen, S.: A compendium of data sources for data science, machine learning, and artificial intelligence. arXiv preprint arXiv:2309.05682 (2023)
Brickley, D., Burgess, M., Noy, N.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: The World Wide Web Conference, pp. 1365–1375 (2019)
Dalla Torre, P., Fantozzi, P., Naldi, M.: Analysing the inner structure of episodes in house, md through network analysis. In: Investigating Medical Drama TV Series: Approaches and Perspectives. 14th Media Mutations International Conference. Media Mutations Publishing (2023)
Dalla Torre, P., Fantozzi, P., Naldi, M.: Deep learning-based lexical character identification in TV series. Digital Scholarship Humanit. 38(4), 1453–1465 (2023)
Davani, A.M., Díaz, M., Prabhakaran, V.: Dealing with disagreements: looking beyond the majority vote in subjective annotations. Trans. Assoc. Comput. Linguist. 10, 92–110 (2022)
Destercke, S., Buche, P., Charnomordic, B.: Evaluating data reliability: an evidential answer with application to a web-enabled data warehouse. IEEE Trans. Knowl. Data Eng. 25(1), 92–105 (2011)
Drosou, M., Jagadish, H.V., Pitoura, E., Stoyanovich, J.: Diversity in big data: a review. Big data 5(2), 73–84 (2017)
El Arass, M., Souissi, N.: Data lifecycle: from big data to smartdata. In: 2018 IEEE 5th International Congress on Information Science and Technology (CiSt), pp. 80–87. IEEE (2018)
Fantozzi, P., Rotondi, V., Rizzolli, M., Dalla Torre, P., Naldi, M.: Detecting moral features in tv series with a transformer architecture through dictionary-based word embedding. Information 15(3), 128 (2024)
Forte, A., Guzdial, M.: Motivation and nonmajors in computer science: identifying discrete audiences for introductory courses. IEEE Trans. Educ. 48(2), 248–253 (2005)
Gebru, T., et al.: Datasheets for datasets. Commun. ACM 64(12), 86–92 (2021)
Hubert Ofner, M., Straub, K., Otto, B., Oesterle, H.: Management of the master data lifecycle: a framework for analysis. J. Enterp. Inf. Manag. 26(4), 472–491 (2013)
Maiden, B., Perry, B.: Dealing with free-riders in assessed group work: results from a study at a UK university. Assess. Eval. High. Educ. 36(4), 451–464 (2011)
Marshall, P., Rajguru, N., Slosar, A.: Bayesian evidence as a tool for comparing datasets. Phys. Rev. D 73(6), 067302 (2006)
Mitchell, M., et al.: Model cards for model reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 220–229 (2019)
Noble, D.F.: Assessing the reliability of open source information. In: Proceedings of 7th International Conference on Information Fusion. Citeseer (2004)
Pushkarna, M., Zaldivar, A., Kjartansson, O.: Data cards: purposeful and transparent dataset documentation for responsible AI. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1776–1826 (2022)
Rahul, K., Banyal, R.K.: Data life cycle management in big data analytics. Proc. Comput. Sci. 173, 364–371 (2020)
Ramdeo, S., Balwant, P., Fraser, S.H.: Not another team assignment! student perceptions towards teamwork at university management programs. High. Educ. Skills Work-Based Learn. 12(6), 1122–1137 (2022)
Shah, S.I.H., Peristeras, V., Magnisalis, I.: DaLiF: a data lifecycle framework for data-driven governments. J. Big Data 8(1), 89 (2021)
Stobierski, T.: 8 steps in the data life cycle. https://online.hbs.edu/blog/post/data-life-cycle. Accessed 10 Apr 2024
Wilson, R.J.: Introduction to Graph Theory. Pearson Education India (1979)
Zahid, R., et al.: Secure data management life cycle for government big-data ecosystem: design and development perspective. Systems 11(8), 380 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fantozzi, P., Laura, L., Naldi, M. (2024). Teaching Dataset Creation in a Classroom Environment. In: Herodotou, C., et al. Methodologies and Intelligent Systems for Technology Enhanced Learning, 14th International Conference. MIS4TEL 2024. Lecture Notes in Networks and Systems, vol 1171. Springer, Cham. https://doi.org/10.1007/978-3-031-73538-7_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-73538-7_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73537-0
Online ISBN: 978-3-031-73538-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)