skip to main content
10.1145/3568812.3603474acmconferencesArticle/Chapter ViewAbstractPublication PagesicerConference Proceedingsconference-collections
abstract

Evaluating ChatGPT and GPT-4 for Visual Programming

Published: 13 September 2023 Publication History

Abstract

Generative AI has the potential to drastically improve the landscape of computing education by automatically generating personalized feedback and content. In particular, this potential lies in the advanced capabilities of state-of-the-art deep generative and large language models such as OpenAI’s Codex [7], ChatGPT [11], and GPT-4 [12]. In our work, we seek to investigate the capabilities of these models in visual programming domains popularly used for K-8 programming education, including domains like Scratch [17], Hour of Code: Maze Challenge by Code.org [4, 5], and Karel [13].
Recent works have shown us sparks of advanced capabilities of such models for various education scenarios in introductory Python programming [2, 14, 18, 20]. In fact, a study in 2022 had ranked Codex in the top quartile w.r.t students in a large Python programming course [8]. However, all these works consider only text-based Python programming and leave open the question of how well these models would perform for visual programming. The main research question is: Do state-of-the-art neural generative models show advanced capabilities for visual programming on par with their capabilities on text-based Python programming?
In our work, we evaluate these models for visual programming based on the following three settings designed to capture various generative and problem-solving capabilities:
We conduct our evaluation based on 10 representative tasks from two visual programming domains: Hour of Code: Maze Challenge by Code.org [4, 5] and Intro to Programming with Karel course by CodeHS.com [3, 13]. As illustrative examples, Figures 1, 2, and 3 show the output of GPT-4 in three settings for Maze18 task. We will provide the detailed analysis and prompts used in a longer version of this poster. Our preliminary results for ChatGPT (based on GPT-3.5) and GPT-4 show that these models perform poorly and produce incorrect output the majority of the time. These results highlight that state-of-the-art neural generative models like GPT-4 still struggle to combine spatial, logical, and programming skills crucial for visual programming. As the next step, it would be important to curate novel benchmarks that the research community can use to evaluate improvements in future versions of these models for visual programming.

References

[1]
Umair Z. Ahmed, Maria Christakis, Aleksandr Efremov, Nigel Fernandez, Ahana Ghosh, Abhik Roychoudhury, and Adish Singla. 2020. Synthesizing Tasks for Block-based Programming. In NeurIPS.
[2]
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Túlio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early Experiments with GPT-4. CoRR abs/2303.12712 (2023).
[3]
CodeHS. 2012. Intro to Programming with Karel the Dog. https://codehs.com/info/curriculum/introkarel.
[4]
Code.org. 2013. Code.org: Learn Computer Science. https://code.org/.
[5]
Code.org. 2013. Hour of Code: Classic Maze Challenge. https://studio.code.org/s/hourofcode.
[6]
Aleksandr Efremov, Ahana Ghosh, and Adish Singla. 2020. Zero-shot Learning of Hint Policy via Reinforcement Learning and Program Synthesis. In EDM.
[7]
Mark Chen et al.2021. Evaluating Large Language Models Trained on Code. CoRR abs/2107-03374 (2021).
[8]
James Finnie-Ansley, Paul Denny, Brett A. Becker, Andrew Luxton-Reilly, and James Prather. 2022. The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. In ACE.
[9]
Ahana Ghosh, Sebastian Tschiatschek, Sam Devlin, and Adish Singla. 2022. Adaptive Scaffolding in Block-Based Programming via Synthesizing New Tasks as Pop Quizzes. In AIED.
[10]
Samiha Marwan, Yang Shi, Ian Menezes, Min Chi, Tiffany Barnes, and Thomas W. Price. 2021. Just a Few Expert Constraints Can Help: Humanizing Data-Driven Subgoal Detection for Novice Programming. In EDM.
[11]
OpenAI. 2023. ChatGPT. https://openai.com/blog/chatgpt.
[12]
OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023).
[13]
Richard E Pattis, Jim Roberts, and Mark Stehlik. 1995. Karel the Robot: A Gentle Introduction to the Art of Programming. John Wiley & Sons, Inc.
[14]
Tung Phung, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, and Gustavo Soares. 2023. Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models. In EDM.
[15]
Chris Piech, Mehran Sahami, Jonathan Huang, and Leonidas J. Guibas. 2015. Autonomously Generating Hints by Inferring Problem Solving Policies. In L@S.
[16]
Thomas W. Price, Rui Zhi, and Tiffany Barnes. 2017. Hint Generation Under Uncertainty: The Effect of Hint Quality on Help-Seeking Behavior. In AIED.
[17]
Mitchel Resnick, John H. Maloney, Andrés Monroy-Hernández, Natalie Rusk, Evelyn Eastmond, Karen Brennan, Amon Millner, Eric Rosenbaum, Jay S. Silver, Brian Silverman, and Yasmin B. Kafai. 2009. Scratch: Programming for All. Communications of ACM 52, 11 (2009), 60–67.
[18]
Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In ICER.
[19]
Adish Singla and Nikitas Theodoropoulos. 2022. From {Solution Synthesis} to {Student Attempt Synthesis} for Block-Based Visual Programming Tasks. In EDM.
[20]
Jialu Zhang, José Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Verbruggen. 2022. Repairing Bugs in Python Assignments Using Large Language Models. CoRR abs/2209.14876 (2022).

Cited By

View all
  • (2025)What are the differences between student and ChatGPT-generated pseudocode? Detecting AI-generated pseudocode in high school programming using explainable machine learningEducation and Information Technologies10.1007/s10639-025-13385-zOnline publication date: 1-Feb-2025
  • (2024)ChatGPT: The End of Online Exam Integrity?Education Sciences10.3390/educsci1406065614:6(656)Online publication date: 17-Jun-2024
  • (2024)Large Language Models for Software Engineering: A Systematic Literature ReviewACM Transactions on Software Engineering and Methodology10.1145/3695988Online publication date: 20-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICER '23: Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 2
August 2023
140 pages
ISBN:9781450399753
DOI:10.1145/3568812
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2023

Check for updates

Author Tags

  1. ChatGPT
  2. block-based visual programming
  3. generative AI
  4. introductory programming education
  5. large language models

Qualifiers

  • Abstract
  • Research
  • Refereed limited

Funding Sources

Conference

ICER 2023
Sponsor:

Acceptance Rates

Overall Acceptance Rate 189 of 803 submissions, 24%

Upcoming Conference

ICER 2025
ACM Conference on International Computing Education Research
August 3 - 6, 2025
Charlottesville , VA , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)135
  • Downloads (Last 6 weeks)12
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)What are the differences between student and ChatGPT-generated pseudocode? Detecting AI-generated pseudocode in high school programming using explainable machine learningEducation and Information Technologies10.1007/s10639-025-13385-zOnline publication date: 1-Feb-2025
  • (2024)ChatGPT: The End of Online Exam Integrity?Education Sciences10.3390/educsci1406065614:6(656)Online publication date: 17-Jun-2024
  • (2024)Large Language Models for Software Engineering: A Systematic Literature ReviewACM Transactions on Software Engineering and Methodology10.1145/3695988Online publication date: 20-Sep-2024
  • (2024)An Eye for an AI: Evaluating GPT-4o's Visual Perception Skills and Geometric Reasoning Skills Using Computer Graphics QuestionsSIGGRAPH Asia 2024 Educator's Forum10.1145/3680533.3697064(1-8)Online publication date: 3-Dec-2024
  • (2024)Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint ValidationProceedings of the 14th Learning Analytics and Knowledge Conference10.1145/3636555.3636846(12-23)Online publication date: 18-Mar-2024
  • (2024)More Than Meets the AI: Evaluating the performance of GPT-4 on Computer Graphics assessment questionsProceedings of the 26th Australasian Computing Education Conference10.1145/3636243.3636263(182-191)Online publication date: 29-Jan-2024
  • (2024)ChatGPT Meets Iris Biometrics2024 IEEE International Joint Conference on Biometrics (IJCB)10.1109/IJCB62174.2024.10744525(1-10)Online publication date: 15-Sep-2024
  • (2024)How Good Is ChatGPT at Face Biometrics? A First Look Into Recognition, Soft Biometrics, and ExplainabilityIEEE Access10.1109/ACCESS.2024.337043712(34390-34401)Online publication date: 2024
  • (2024)Task Synthesis for Elementary Visual Programming in XLogoOnline EnvironmentArtificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky10.1007/978-3-031-64312-5_37(308-316)Online publication date: 2-Jul-2024
  • (2024)Comparing Large Language Models and Human Programmers for Generating Programming CodeAdvanced Science10.1002/advs.202412279Online publication date: 30-Dec-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media