Skip to main content

Abstract

Datasets are crucial for data-driven decision-making in businesses and organizations, allowing for the optimization of processes and the identification of improvement opportunities. They are foundational for training machine learning models to recognize patterns and make predictions, encompassing various forms of data from simple text and numbers to complex images and graphs. Furthermore, datasets are vital for educating future data scientists and machine learning engineers, offering hands-on experience that enhances analytical skills and practical knowledge. However, a gap in university curricula regarding dataset creation skills is noted, which is essential for ensuring the quality of datasets and the robustness of decision processes. This paper proposes to address this educational gap by developing methodologies for teaching dataset creation within university settings, mindful of the distinct skill levels of students compared to professionals.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/HumanSignal/label-studio/.

  2. 2.

    https://en.wikipedia.org/wiki/Doctor_Cha.

References

  1. Abasi, R.: Google dorks: Use cases and Adaption study. Master’s thesis, University of Turku (2020)

    Google Scholar 

  2. Arnold, M., et al.: Factsheets: increasing trust in AI services through supplier’s declarations of conformity. IBM J. Res. Dev. 63(4/5), 6:1-6:3 (2019)

    Article  Google Scholar 

  3. Artstein, R.: Inter-Annotator Agreement. Handbook of Linguistic Annotation, pp. 297–313 (2017)

    Google Scholar 

  4. Bender, E.M., Friedman, B.: Data statements for natural language processing: toward mitigating system bias and enabling better science. Trans. Assoc. Comput. Linguis. 6, 587–604 (2018)

    Article  Google Scholar 

  5. Bilokon, P., Bilokon, O., Amen, S.: A compendium of data sources for data science, machine learning, and artificial intelligence. arXiv preprint arXiv:2309.05682 (2023)

  6. Brickley, D., Burgess, M., Noy, N.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: The World Wide Web Conference, pp. 1365–1375 (2019)

    Google Scholar 

  7. Dalla Torre, P., Fantozzi, P., Naldi, M.: Analysing the inner structure of episodes in house, md through network analysis. In: Investigating Medical Drama TV Series: Approaches and Perspectives. 14th Media Mutations International Conference. Media Mutations Publishing (2023)

    Google Scholar 

  8. Dalla Torre, P., Fantozzi, P., Naldi, M.: Deep learning-based lexical character identification in TV series. Digital Scholarship Humanit. 38(4), 1453–1465 (2023)

    Article  Google Scholar 

  9. Davani, A.M., Díaz, M., Prabhakaran, V.: Dealing with disagreements: looking beyond the majority vote in subjective annotations. Trans. Assoc. Comput. Linguist. 10, 92–110 (2022)

    Article  Google Scholar 

  10. Destercke, S., Buche, P., Charnomordic, B.: Evaluating data reliability: an evidential answer with application to a web-enabled data warehouse. IEEE Trans. Knowl. Data Eng. 25(1), 92–105 (2011)

    Article  Google Scholar 

  11. Drosou, M., Jagadish, H.V., Pitoura, E., Stoyanovich, J.: Diversity in big data: a review. Big data 5(2), 73–84 (2017)

    Article  Google Scholar 

  12. El Arass, M., Souissi, N.: Data lifecycle: from big data to smartdata. In: 2018 IEEE 5th International Congress on Information Science and Technology (CiSt), pp. 80–87. IEEE (2018)

    Google Scholar 

  13. Fantozzi, P., Rotondi, V., Rizzolli, M., Dalla Torre, P., Naldi, M.: Detecting moral features in tv series with a transformer architecture through dictionary-based word embedding. Information 15(3), 128 (2024)

    Article  Google Scholar 

  14. Forte, A., Guzdial, M.: Motivation and nonmajors in computer science: identifying discrete audiences for introductory courses. IEEE Trans. Educ. 48(2), 248–253 (2005)

    Article  Google Scholar 

  15. Gebru, T., et al.: Datasheets for datasets. Commun. ACM 64(12), 86–92 (2021)

    Article  Google Scholar 

  16. Hubert Ofner, M., Straub, K., Otto, B., Oesterle, H.: Management of the master data lifecycle: a framework for analysis. J. Enterp. Inf. Manag. 26(4), 472–491 (2013)

    Article  Google Scholar 

  17. Maiden, B., Perry, B.: Dealing with free-riders in assessed group work: results from a study at a UK university. Assess. Eval. High. Educ. 36(4), 451–464 (2011)

    Article  Google Scholar 

  18. Marshall, P., Rajguru, N., Slosar, A.: Bayesian evidence as a tool for comparing datasets. Phys. Rev. D 73(6), 067302 (2006)

    Article  Google Scholar 

  19. Mitchell, M., et al.: Model cards for model reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 220–229 (2019)

    Google Scholar 

  20. Noble, D.F.: Assessing the reliability of open source information. In: Proceedings of 7th International Conference on Information Fusion. Citeseer (2004)

    Google Scholar 

  21. Pushkarna, M., Zaldivar, A., Kjartansson, O.: Data cards: purposeful and transparent dataset documentation for responsible AI. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1776–1826 (2022)

    Google Scholar 

  22. Rahul, K., Banyal, R.K.: Data life cycle management in big data analytics. Proc. Comput. Sci. 173, 364–371 (2020)

    Article  Google Scholar 

  23. Ramdeo, S., Balwant, P., Fraser, S.H.: Not another team assignment! student perceptions towards teamwork at university management programs. High. Educ. Skills Work-Based Learn. 12(6), 1122–1137 (2022)

    Article  Google Scholar 

  24. Shah, S.I.H., Peristeras, V., Magnisalis, I.: DaLiF: a data lifecycle framework for data-driven governments. J. Big Data 8(1), 89 (2021)

    Article  Google Scholar 

  25. Stobierski, T.: 8 steps in the data life cycle. https://online.hbs.edu/blog/post/data-life-cycle. Accessed 10 Apr 2024

  26. Wilson, R.J.: Introduction to Graph Theory. Pearson Education India (1979)

    Google Scholar 

  27. Zahid, R., et al.: Secure data management life cycle for government big-data ecosystem: design and development perspective. Systems 11(8), 380 (2023)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luigi Laura .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fantozzi, P., Laura, L., Naldi, M. (2024). Teaching Dataset Creation in a Classroom Environment. In: Herodotou, C., et al. Methodologies and Intelligent Systems for Technology Enhanced Learning, 14th International Conference. MIS4TEL 2024. Lecture Notes in Networks and Systems, vol 1171. Springer, Cham. https://doi.org/10.1007/978-3-031-73538-7_19

Download citation

Publish with us

Policies and ethics