ABSTRACT
Designing assessments in classroom contexts or having them generated automatically requires - among other things - knowledge about the difficulty of what is assessed. Estimates of difficulty can be derived empirically, usually by piloting items, or theoretically from models. Empirical results, in turn, can inform theory and refine models. In this article, we compare four methods of estimating the item difficulty for a typical topic of introductory programming courses: control flow. For a given set of items that have been tested empirically, we also collected expert ratings and additionally applied measures of code complexity both from software engineering and from computer science education research The results show that there is some overlap between empirical results and theoretical predictions. However, for the simple item format that we have been using, the models all fall short in offering enough explanatory power regarding the observed variance in difficulty. Empirical difficulty in turn can serve as the basis for rules that can be used for item generation in the future.
Supplemental Material
Available for Download
- Shulamyt Ajami, Yonatan Woodbridge, and Dror G. Feitelson. 2017. Syntax, Predicates, Idioms - What Really Affects Code Complexity?. In 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). 66–76. https://doi.org/10.1109/ICPC.2017.39Google ScholarDigital Library
- David Andrich and Ida Marais. 2019. A course in Rasch measurement theory. D. Andrich y I. Marais (Coords.), Measuring in the Educational, Social and Health Sciences (2019), 41–53.Google Scholar
- Abdul Baist and Aan Subhan Pamungkas. 2017. Analysis of Student Difficulties in Computer Programming. Volt: Jurnal Ilmiah Pendidikan Teknik Elektro 2 (2017), 81–92.Google ScholarCross Ref
- Morten Bastian and Andreas Mühling. 2019. Let’s Look a Layer Deeper: Design and First Results of a New Test System in the Context of Program Tracing. In Proceedings of the 19th Koli Calling International Conference on Computing Education Research. 1–2.Google ScholarDigital Library
- Christian P Brackmann, Marcos Román-González, Gregorio Robles, Jesús Moreno-León, Ana Casali, and Dante Barone. 2017. Development of computational thinking skills through unplugged activities in primary school. In Proceedings of the 12th workshop on primary and secondary computing education. 65–72.Google ScholarDigital Library
- Philip Sheridan Buffum, Eleni V. Lobene, Megan Hardy Frankosky, Kristy Elizabeth Boyer, Eric N. Wiebe, and James C. Lester. 2015. A Practical Guide to Developing and Validating Computer Science Knowledge Assessments with Application to Middle School. In Proceedings of the 46th ACM Technical Symposium on Computer Science Education(SIGCSE ’15). ACM, New York, NY, USA, 622–627. https://doi.org/10.1145/2676723.2677295Google ScholarDigital Library
- Tony Clear, Jenny Edwards, Raymond Lister, Beth Simon, Errol Thompson, and Jacqueline Whalley. 2008. The Teaching of Novice Computer Programmers: Bringing the Scholarly-Research Approach to Australia. In Proceedings of the Tenth Conference on Australasian Computing Education - Volume 78(ACE ’08). Australian Computer Society, Inc, AUS, 63–68.Google Scholar
- Michael Lamport Commons and Alexander Pekker. 2008. Presenting the formal theory of hierarchical complexity. World Futures 64, 5-7 (2008), 375–382.Google ScholarCross Ref
- Laura E De Ruiter and Marina U Bers. 2021. The Coding Stages Assessment: development and validation of an instrument for assessing young children’s proficiency in the ScratchJr programming language. Computer Science Education(2021), 1–30.Google Scholar
- Rodrigo Duran, Juha Sorva, and Sofia Leite. 2018. Towards an analysis of program complexity from a cognitive perspective. In Proceedings of the 2018 ACM Conference on International Computing Education Research. 21–30.Google ScholarDigital Library
- Rodrigo Duran, Juha Sorva, and Otto Seppälä. 2021. Rules of program behavior. ACM Transactions on Computing Education (TOCE) 21, 4 (2021), 1–37.Google ScholarDigital Library
- Michael Eid and Katharina Schmidt. 2014. Testtheorie und Testkonstruktion. Hogrefe Verlag.Google Scholar
- Javier Ferrer, Francisco Chicano, and Enrique Alba. 2013. Estimating software testing complexity. Information and Software Technology 55, 12 (2013), 2125–2139.Google ScholarDigital Library
- Brian D Gane, Maya Israel, Noor Elagha, Wei Yan, Feiya Luo, and James W Pellegrino. 2021. Design and validation of learning trajectory-based assessments for computational thinking in upper elementary grades. Computer Science Education 31, 2 (2021), 141–168.Google ScholarCross Ref
- Marcos Román González. 2015. Computational thinking test: Design guidelines and content validation. In Proceedings of EDULEARN15 conference. 2436–2444.Google Scholar
- Andrew F Hayes and Klaus Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication methods and measures 1, 1 (2007), 77–89.Google Scholar
- Robert T. Hughes. 1996. Expert judgement as an estimating method. Information and software technology 38, 2 (1996), 67–75.Google Scholar
- Cruz Izu, Carsten Schulte, Ashish Aggarwal, Quintin Cutts, Rodrigo Duran, Mirela Gutica, Birte Heinemann, Eileen Kraemer, Violetta Lonati, Claudio Mirolo, and Renske Weeda. 2019. Fostering Program Comprehension in Novice Programmers - Learning Activities and Learning Trajectories. In Proceedings of the Working Group Reports on Innovation and Technology in Computer Science Education (Aberdeen, Scotland Uk) (ITiCSE-WGR ’19). Association for Computing Machinery, New York, NY, USA, 27–52. https://doi.org/10.1145/3344429.3372501Google ScholarDigital Library
- Philipp Kather, Rodrigo Duran, and Jan Vahrenhold. 2021. Through (Tracking) Their Eyes: Abstraction and Complexity in Program Comprehension. ACM Trans. Comput. Educ. 22, 2, Article 17 (nov 2021), 33 pages. https://doi.org/10.1145/3480171Google ScholarDigital Library
- Bernhard Katzmarski and Rainer Koschke. 2012. Program complexity metrics and programmer opinions. In 2012 20th IEEE International Conference on Program Comprehension (ICPC). 17–26. https://doi.org/10.1109/ICPC.2012.6240486Google ScholarCross Ref
- M Konecki. 2014. Problems in programming education and means of their improvement. DAAAM international scientific book 2014 (2014), 459–470.Google Scholar
- Raymond Lister, Elizabeth S. Adams, Sue Fitzgerald, William Fone, John Hamer, Morten Lindholm, Robert McCartney, Jan Erik Moström, Kate Sanders, Otto Seppälä, 2004. A multi-national study of reading and tracing skills in novice programmers. ACM SIGCSE Bulletin 36, 4 (2004), 119–150.Google ScholarDigital Library
- Mike Lopez, Jacqueline Whalley, Phil Robbins, and Raymond Lister. 2008. Relationships between reading, tracing and writing skills in introductory programming. In Proceedings of the fourth international workshop on computing education research. 101–112.Google ScholarDigital Library
- Andrew Luxton-Reilly, Brett A. Becker, Yingjun Cao, Roger McDermott, Claudio Mirolo, Andreas Mühling, Andrew Petersen, Kate Sanders, Simon, and Jacqueline L. Whalley. 2017. Developing Assessments to Determine Mastery of Programming Fundamentals. In Proceedings of the 2017 ITiCSE Conference on Working Group Reports(ITiCSE-WGR ’17). ACM, New York, NY, USA, 47–69. https://doi.org/10.1145/3174781.3174784Google ScholarDigital Library
- Andrew Luxton-Reilly and Andrew Petersen. 2017. The Compound Nature of Novice Programming Assessments. In Proceedings of the Nineteenth Australasian Computing Education Conference (ACE 2017)(ICPS), Donna Teague and Raina Mason (Eds.). The Association for Computing Machinery, New York, New York, 26–35. https://doi.org/10.1145/3013499.3013500Google ScholarDigital Library
- Jack E Matson, Bruce E Barrett, and Joseph M Mellichamp. 1994. Software development cost estimation using function points. IEEE Transactions on Software Engineering 20, 4 (1994), 275–287.Google ScholarDigital Library
- Andreas Mühling, Alexander Ruf, and Peter Hubwieser. 2015. Design and first results of a psychometric test for measuring basic programming abilities. In Proceedings of the workshop in primary and secondary computing education. 2–10.Google ScholarDigital Library
- Greg L. Nelson, Andrew Hu, Benjamin Xie, and Amy J. Ko. 2019. Towards Validity for a Formative Assessment for Language-Specific Program Tracing Skills. In Proceedings of the 19th Koli Calling International Conference on Computing Education Research (Koli, Finland) (Koli Calling ’19). Association for Computing Machinery, New York, NY, USA, Article 20, 10 pages. https://doi.org/10.1145/3364510.3364525Google ScholarDigital Library
- Jun Rangie C. Obispo and Francisco Enrique Vicente Castro. 2018. Incidence of Einstellung Effect among Programming Students and its Relationship with Achievement.Google Scholar
- Miranda C. Parker, Leiny Garcia, Yvonne S. Kao, Diana Franklin, Susan Krause, and Mark Warschauer. 2022. A Pair of ACES: An Analysis of Isomorphic Questions on an Elementary Computing Assessment. In Proceedings of the 2022 ACM Conference on International Computing Education Research - Volume 1 (Lugano and Virtual Event, Switzerland) (ICER ’22). Association for Computing Machinery, New York, NY, USA, 2–14. https://doi.org/10.1145/3501385.3543979Google ScholarDigital Library
- Roy D. Pea. 1986. Language-Independent Conceptual “Bugs” in Novice Programming. Journal of Educational Computing Research 2, 1 (1986), 25–36. https://doi.org/10.2190/689T-1R2A-X4W4-29J2Google ScholarCross Ref
- D. N. Perkins, Chris Hancock, Renee Hobbs, Fay Martin, and Rebecca Simmons. 1986. Conditions of Learning in Novice Programmers. Journal of Educational Computing Research 2, 1 (1986), 37–55. https://doi.org/10.2190/GUJT-JCBJ-Q6QU-Q9PLGoogle ScholarCross Ref
- Marcos Román-González, Juan-Carlos Pérez-González, and Carmen Jiménez-Fernández. 2017. Which cognitive abilities underlie computational thinking? Criterion validity of the Computational Thinking Test. Computers in human behavior 72 (2017), 678–691.Google Scholar
- Jean Salac and Diana Franklin. 2020. If they build it, will they understand it? Exploring the relationship between student code and performance. In Proceedings of the 2020 ACM conference on innovation and technology in computer science education. 473–479.Google ScholarDigital Library
- Kate Sanders, Marzieh Ahmadzadeh, Tony Clear, Stephen H. Edwards, Mikey Goldweber, Chris Johnson, Raymond Lister, Robert McCartney, Elizabeth Patitsas, and Jaime Spacco. 2013. The Canterbury QuestionBank: Building a Repository of Multiple-Choice CS1 and CS2 Questions. In Proceedings of the ITiCSE Working Group Reports Conference on Innovation and Technology in Computer Science Education-Working Group Reports (Canterbury, England, United Kingdom) (ITiCSE -WGR ’13). Association for Computing Machinery, New York, NY, USA, 33–52. https://doi.org/10.1145/2543882.2543885Google ScholarDigital Library
- Shuhaida Shuhidan, Margaret Hamilton, and Daryl D’Souza. 2009. A Taxonomic Study of Novice Programming Summative Assessment. In Proceedings of the Eleventh Australasian Conference on Computing Education - Volume 95(ACE ’09). Australian Computer Society, Inc, AUS, 147–156.Google Scholar
- Juha Sorva. 2012. Visual program simulation in introductory programming education: Zugl.: Espoo, Aalto Univ. School of Science, Diss., 2012. Aalto University publication series Doctoral dissertations, Vol. 2012,61. Aalto Univ. School of Science, Espoo.Google Scholar
- Allison Elliott Tew. 2010. Assessing fundamental introductory computing concept knowledge in a language independent manner. Georgia Institute of Technology.Google Scholar
- Allison Elliott Tew and Mark Guzdial. 2011. The FCS1: a language independent assessment of CS1 knowledge. In Proceedings of the 42nd ACM technical symposium on Computer science education. 111–116.Google ScholarDigital Library
- Jacqueline L Whalley, Raymond Lister, Errol Thompson, Tony Clear, Phil Robbins, PK Ajith Kumar, and Christine Prasad. 2006. An Australasian study of reading and comprehension skills in novice programmers, using the Bloom and SOLO taxonomies. In Conferences in Research and Practice in Information Technology Series.Google Scholar
- Stelios Xinogalos. 2016. Designing and deploying programming courses: Strategies, tools, difficulties and pedagogy. Education and Information Technologies 21, 3 (2016), 559–588.Google ScholarDigital Library
Index Terms
- Comparing estimates of difficulty of programming constructs
Recommendations
Item Difficulty Analysis of English Vocabulary Questions
CSEDU 2016: Proceedings of the 8th International Conference on Computer Supported EducationThis study investigates the relations between several factors of question items in English vocabulary tests and the corresponding item difficulty. Designing the item difficulty of a test impacts the quality of the test itself. Our goal is suggesting a ...
A Systematic Review of Data-Driven Approaches to Item Difficulty Prediction
Artificial Intelligence in EducationAbstractAssessment quality and validity is heavily reliant on the quality of items included in an assessment or test. Difficulty is an essential factor that can determine items and tests’ overall quality. Therefore, item difficulty prediction is extremely ...
Towards validity for a formative assessment for language-specific program tracing skills
Koli Calling '19: Proceedings of the 19th Koli Calling International Conference on Computing Education ResearchFormative assessments can have positive effects on learning, but few exist for computing, even for basic skills such as program tracing. Instead, teachers often rely on overly broad test questions that lack the diagnostic granularity needed to measure ...
Comments