abstract

Evaluating ChatGPT and GPT-4 for Visual Programming

Author:

Adish SinglaAuthors Info & Claims

ICER '23: Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 2

Pages 14 - 15

https://doi.org/10.1145/3568812.3603474

Published: 13 September 2023 Publication History

Get Access

Abstract

Generative AI has the potential to drastically improve the landscape of computing education by automatically generating personalized feedback and content. In particular, this potential lies in the advanced capabilities of state-of-the-art deep generative and large language models such as OpenAI’s Codex [7], ChatGPT [11], and GPT-4 [12]. In our work, we seek to investigate the capabilities of these models in visual programming domains popularly used for K-8 programming education, including domains like Scratch [17], Hour of Code: Maze Challenge by Code.org [4, 5], and Karel [13].

Recent works have shown us sparks of advanced capabilities of such models for various education scenarios in introductory Python programming [2, 14, 18, 20]. In fact, a study in 2022 had ranked Codex in the top quartile w.r.t students in a large Python programming course [8]. However, all these works consider only text-based Python programming and leave open the question of how well these models would perform for visual programming. The main research question is: Do state-of-the-art neural generative models show advanced capabilities for visual programming on par with their capabilities on text-based Python programming?

In our work, we evaluate these models for visual programming based on the following three settings designed to capture various generative and problem-solving capabilities:

We conduct our evaluation based on 10 representative tasks from two visual programming domains: Hour of Code: Maze Challenge by Code.org [4, 5] and Intro to Programming with Karel course by CodeHS.com [3, 13]. As illustrative examples, Figures 1, 2, and 3 show the output of GPT-4 in three settings for Maze18 task. We will provide the detailed analysis and prompts used in a longer version of this poster. Our preliminary results for ChatGPT (based on GPT-3.5) and GPT-4 show that these models perform poorly and produce incorrect output the majority of the time. These results highlight that state-of-the-art neural generative models like GPT-4 still struggle to combine spatial, logical, and programming skills crucial for visual programming. As the next step, it would be important to curate novel benchmarks that the research community can use to evaluate improvements in future versions of these models for visual programming.

References

[1]

Umair Z. Ahmed, Maria Christakis, Aleksandr Efremov, Nigel Fernandez, Ahana Ghosh, Abhik Roychoudhury, and Adish Singla. 2020. Synthesizing Tasks for Block-based Programming. In NeurIPS.

Abstract

References

Cited By

Index Terms

Recommendations

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors

Notional machines and introductory programming education

LLMs Still Can't Avoid Instanceof: An Investigation Into GPT-3.5, GPT-4 and Bard's Capacity to Handle Object-Oriented Programming Assignments

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

HTML Format

Share

Share this Publication link

Share on social media

Affiliations