skip to main content
10.1145/3478431.3499397acmconferencesArticle/Chapter ViewAbstractPublication PagessigcseConference Proceedingsconference-collections
research-article

Identifying Common Errors in Open-Ended Machine Learning Projects

Published: 22 February 2022 Publication History

Abstract

Machine learning (ML) is one of the fastest growing subfields in Computer Science, and it is important to identify ways to improve ML education. A key way to do so is by understanding the common errors that students make when writing ML programs, so they can be addressed. Prior work investigating ML errors has focused on an instructor perspective, but has not looked at student programming artifacts, such as projects and code submissions to understand how these errors occur and which are most common. To address this, we qualitatively coded over 2,500 cells of code from 19 final team projects (63 students) in an upper-division machine learning course. By isolating and codifying common errors and misconceptions across projects, we can identify what ML errors students struggle with. In our results, we found that library usage, hyperparameter tuning, and misusing test data were among the most common errors, and we give examples of how and when they occur. We then provide suggestions on why these misconceptions may occur, and how instructors and software designers can possibly mitigate these errors.

Supplementary Material

MP4 File (SIGCSE22-V1fp545v.mp4)
Identifying Common Errors in Open-Ended Machine Learning Projects - Presentation

References

[1]
Virginia Braun and Victoria Clarke. 2012. Thematic analysis. (2012).
[2]
Neil CC Brown and Amjad Altadmri. 2017. Novice Java programming mistakes: Large-scale data vs. educator beliefs. ACM Transactions on Computing Education (TOCE), Vol. 17, 2 (2017), 1--21.
[3]
Ricardo Caceffo, Pablo Frank-Bolton, Renan Souza, and Rodolfo Azevedo. 2019. Identifying and validating java misconceptions toward a cs1 concept inventory. In Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education. 23--29.
[4]
Souti Chattopadhyay, Ishita Prasad, Austin Z Henley, Anita Sarma, and Titus Barik. 2020. What's Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1--12.
[5]
Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, Vol. 5, 2 (2017), 153--163.
[6]
Holger Danielsiek, Wolfgang Paul, and Jan Vahrenhold. 2012. Detecting and understanding students' misconceptions related to algorithms and data structures. In Proceedings of the 43rd ACM technical symposium on Computer Science Education. 21--26.
[7]
Amit Datta, Michael Carl Tschantz, and Anupam Datta. 2014. Automated experiments on ad privacy settings: A tale of opacity, choice, and discrimination. arXiv preprint arXiv:1408.6491 (2014).
[8]
Preet Kamal Dhillon and Gurleen Sidhu. 2012. Can software faults be analyzed using bad code smells?: An empirical study. Int J Sci Res Publ, Vol. 2, 10 (2012), 1--7.
[9]
Yihuan Dong, Samiha Marwan, Veronica Catete, Thomas Price, and Tiffany Barnes. 2019. Defining tinkering behavior in open-ended block-based programming assignments. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education. 1204--1210.
[10]
Gao Gao, Finn Voichick, Michelle Ichinco, and Caitlin Kelleher. 2020. Exploring Programmers' API Learning Processes: Collecting Web Resources as External Memory. In 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 1--10.
[11]
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning .Springer New York Inc., New York, NY, USA.
[12]
Felienne Hermans and Efthimia Aivaloglou. 2016. Do code smells hamper novice programming? A controlled experiment on Scratch programs. In 2016 IEEE 24th International Conference on Program Comprehension (ICPC). IEEE, 1--10.
[13]
Sean Kross and Philip J Guo. 2019. Practitioners teaching data science in industry and academia: Expectations, workflows, and challenges. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1--14.
[14]
Roberta Kwok. 2019. Junior AI researchers are in demand by universities and industry. Nature, Vol. 568, 7752 (2019), 581--584.
[15]
Gustavo A Lujan-Moreno, Phillip R Howard, Omar G Rojas, and Douglas C Montgomery. 2018. Design of experiments and response surface methodology to tune machine learning hyperparameters, with a random forest case-study. Expert Systems with Applications, Vol. 109 (2018), 195--205.
[16]
Joshua J Michalenko, Andrew S Lan, and Richard G Baraniuk. 2017. Data-mining textual responses to uncover misconception patterns. In Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale. 245--248.
[17]
Philipp Probst, Anne-Laure Boulesteix, and Bernd Bischl. 2019. Tunability: importance of hyperparameters of machine learning algorithms. The Journal of Machine Learning Research, Vol. 20, 1 (2019), 1934--1965.
[18]
Jonathan G Richens, Ciarán M Lee, and Saurabh Johri. 2020. Improving the accuracy of medical diagnosis with causal machine learning. Nature communications, Vol. 11, 1 (2020), 1--9.
[19]
R Benjamin Shapiro and Rebecca Fiebrink. 2019. Introduction to the special section: Launching an agenda for research on learning machine learning.
[20]
Elisabeth Sulmont, Elizabeth Patitsas, and Jeremy R Cooperstock. 2019 a. Can You Teach Me To Machine Learn?. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education. 948--954.
[21]
Elisabeth Sulmont, Elizabeth Patitsas, and Jeremy R Cooperstock. 2019 b. What is hard about teaching machine learning to non-majors? Insights from classifying instructors' learning goals. ACM Transactions on Computing Education (TOCE), Vol. 19, 4 (2019), 1--16.
[22]
Kyle Thayer, Sarah E Chasins, and Amy J Ko. 2021. A theory of robust API knowledge. ACM Transactions on Computing Education (TOCE), Vol. 21, 1 (2021), 1--32.
[23]
Gavriel Yarmish and Danny Kopec. 2007. Revisiting novice programmer errors. ACM SIGCSE Bulletin, Vol. 39, 2 (2007), 131--137.

Cited By

View all
  • (2025)PhysioML: A Web-Based Tool for Machine Learning Education with Real-Time Physiological DataProceedings of the 56th ACM Technical Symposium on Computer Science Education V. 110.1145/3641554.3701815(485-491)Online publication date: 12-Feb-2025
  • (2024)Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning SoftwareProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3688201(155-161)Online publication date: 22-Sep-2024
  • (2024)Assessing the Rigor of Machine Learning in Physiological Signal Processing ApplicationsSoutheastCon 202410.1109/SoutheastCon52093.2024.10500274(1525-1533)Online publication date: 15-Mar-2024
  • Show More Cited By

Index Terms

  1. Identifying Common Errors in Open-Ended Machine Learning Projects

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGCSE 2022: Proceedings of the 53rd ACM Technical Symposium on Computer Science Education - Volume 1
      February 2022
      1049 pages
      ISBN:9781450390705
      DOI:10.1145/3478431
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 February 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. computer science education
      2. data science
      3. machine learning

      Qualifiers

      • Research-article

      Conference

      SIGCSE 2022
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,787 of 5,146 submissions, 35%

      Upcoming Conference

      SIGCSE TS 2025
      The 56th ACM Technical Symposium on Computer Science Education
      February 26 - March 1, 2025
      Pittsburgh , PA , USA

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)95
      • Downloads (Last 6 weeks)10
      Reflects downloads up to 20 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)PhysioML: A Web-Based Tool for Machine Learning Education with Real-Time Physiological DataProceedings of the 56th ACM Technical Symposium on Computer Science Education V. 110.1145/3641554.3701815(485-491)Online publication date: 12-Feb-2025
      • (2024)Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning SoftwareProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3688201(155-161)Online publication date: 22-Sep-2024
      • (2024)Assessing the Rigor of Machine Learning in Physiological Signal Processing ApplicationsSoutheastCon 202410.1109/SoutheastCon52093.2024.10500274(1525-1533)Online publication date: 15-Mar-2024
      • (2023)A Thorough Reproducibility Study on Sentiment Classification: Methodology, Experimental Setting, ResultsInformation10.3390/info1402007614:2(76)Online publication date: 28-Jan-2023

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media