ABSTRACT
Reliable and valid exams are a crucial part of both sound research design and trustworthy assessment of student knowledge. Assessing and addressing item bias is a crucial step in building a validity argument for any assessment instrument. Despite calls for valid assessment tools in CS, item bias is rarely investigated. What kinds of item bias might appear in conventional CS1 exams? To investigate this, we examined responses to a final exam in a large CS1 course. We used differential item functioning (DIF) methods and specifically investigated bias related to binary gender and year of study. Although not a published assessment instrument, the exam had a similar format to many exams in higher education and research: students are asked to trace code and write programs, using paper and pencil. One item with significant DIF was detected on the exam, though the magnitude was negligible. This case study shows how to detect DIF items so that future researchers and practitioners can do these analyses.
- American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. 2014. Standards for educational and psychological testing. American Educational Research Association.Google Scholar
- Deborah L. Bandalos. 2018. Measurement theory and applications for the social sciences. Guilford Press.Google Scholar
- William C. M. Belzak. 2019. Testing Differential Item Functioning in Small Samples. Multivariate Behavioral Research (Oct 2019), 1--26. https://doi.org/10.1080/00273171.2019.1671162Google Scholar
- Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), Vol. 57, 1 (1995), 289--300.Google ScholarCross Ref
- Ryan Bockmon, Stephen Cooper, Jonathan Gratch, and Mohsen Dorodchi. 2019. (Re)Validating Cognitive Introductory Computing Instruments. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education - SIGCSE '19. ACM Press, 552--557. https://doi.org/10.1145/3287324.3287372Google ScholarDigital Library
- R. Philip Chalmers. 2012. mirt: A Multidimensional Item Response Theory Package for the R Environment. Journal of Statistical Software, Vol. 48, 6 (2012), 1--29. https://doi.org/10.18637/jss.v048.i06Google ScholarCross Ref
- Nick Cheng and Brian Harrington. 2017. The Code Mangler: Evaluating Coding Ability Without Writing any Code. In Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education - SIGCSE '17. ACM Press, 123--128. https://doi.org/10.1145/3017680.3017704Google ScholarDigital Library
- Howard Wainer David Thissen, Lynne Steinberg. 1988. Use of item response theory in the study of group differences in trace lines. Test validity (1988), 147.Google Scholar
- R. J. De Ayala. 2009. The theory and practice of item response theory .Guilford Press.Google Scholar
- Adrienne Decker. 2007. How Students Measure Up: an Assessment Instrument for Introductory Computer Science. Ph.D. Dissertation. State University of New York.Google Scholar
- Neil J Dorans and Edward Kulick. 1986. Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of educational measurement, Vol. 23, 4 (1986), 355--368.Google ScholarCross Ref
- Barbara Ericson, Shelly Engelman, Tom McKlin, and Ja'Quan Taylor. 2014. Project Rise up 4 CS: Increasing the Number of Black Students Who Pass Advanced Placement CS A. In Proceedings of the 45th ACM Technical Symposium on Computer Science Education (Atlanta, Georgia, USA) (SIGCSE '14). Association for Computing Machinery, New York, NY, USA, 439--444. https://doi.org/10.1145/2538862.2538937Google ScholarDigital Library
- Barbara J. Ericson, Miranda C. Parker, and Shelly Engelman. 2016. Sisters Rise Up 4 CS: Helping Female Students Pass the Advanced Placement Computer Science A Exam. In Proceedings of the 47th ACM Technical Symposium on Computing Science Education (Memphis, Tennessee, USA) (SIGCSE '16). Association for Computing Machinery, New York, NY, USA, 309--314. https://doi.org/10.1145/2839509.2844623Google ScholarDigital Library
- W. Holmes Finch. 2016. Detection of Differential Item Functioning for More Than Two Groups: A Monte Carlo Comparison of Methods. Applied Measurement in Education, Vol. 29, 1 (Jan 2016), 30--45. https://doi.org/10.1080/08957347.2015.1102916Google ScholarCross Ref
- Google and Gallup. 2015. Images of Computer Science: Perceptions Among Students, Parents, and Educators in the U.S. (2015).Google Scholar
- Geoffrey L. Herman, Craig Zilles, and Michael C. Loui. 2014. A psychometric evaluation of the digital logic concept inventory. Computer Science Education, Vol. 24, 4 (Oct 2014), 277--303. https://doi.org/10.1080/08993408.2014.970781Google ScholarCross Ref
- Michael G Jodoin and Mark J Gierl. 2001. Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied measurement in education, Vol. 14, 4 (2001), 329--349.Google Scholar
- Lisa C. Kaczmarczyk, Elizabeth R. Petrick, J. Philip East, and Geoffrey L. Herman. 2010. Identifying student misconceptions of programming. In Proceedings of the 41st ACM technical symposium on Computer science education - SIGCSE '10. ACM Press, 107. https://doi.org/10.1145/1734263.1734299Google Scholar
- Michael T. Kane. 2013. Validating the Interpretations and Uses of Test Scores: Validating the Interpretations and Uses of Test Scores. Journal of Educational Measurement, Vol. 50, 1 (Mar 2013), 1--73. https://doi.org/10.1111/jedm.12000Google ScholarCross Ref
- Rex B Kline. 2015. Principles and practice of structural equation modeling .Guilford publications.Google Scholar
- Sunbok Lee. 2017. Detecting Differential Item Functioning Using the Logistic Regression Procedure in Small Samples. Applied Psychological Measurement, Vol. 41, 1 (Jan 2017), 30--43. https://doi.org/10.1177/0146621616668015Google ScholarCross Ref
- D. Magis, S. Beland, F. Tuerlinckx, and P. De Boeck. 2010. A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, Vol. 42 (2010), 847--862.Google ScholarCross Ref
- Patrícia Martinková, Adéla Drabinová, Yuan-Ling Liaw, Elizabeth A. Sanders, Jenny L. McFarland, and Rebecca M. Price. 2017. Checking Equity: Why Differential Item Functioning Analysis Should Be a Routine Part of Developing Conceptual Assessments. CBE?Life Sciences Education, Vol. 16, 2 (Jun 2017), rm2. https://doi.org/10.1187/cbe.16--10-0307Google Scholar
- Adam W. Meade. 2010. A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, Vol. 95, 4 (Jul 2010), 728--743. https://doi.org/10.1037/a0018966Google ScholarCross Ref
- Samuel Messick. 1995. Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American psychologist, Vol. 50, 9 (1995), 741.Google Scholar
- Greg L. Nelson, Andrew Hu, Benjamin Xie, and Amy J. Ko. 2019. Towards validity for a formative assessment for language-specific program tracing skills. In Proceedings of the 19th Koli Calling International Conference on Computing Education Research. ACM, 1--10. https://doi.org/10.1145/3364510.3364525Google Scholar
- Thomas H. Park, Meen Chul Kim, Sukrit Chhabra, Brian Lee, and Andrea Forte. 2016. Reading Hierarchies in Code: Assessment of a Basic Computational Skill. In Proceedings of the 2016 ACM Conference on Innovation and Technology in Computer Science Education - ITiCSE '16. ACM Press, 302--307. https://doi.org/10.1145/2899415.2899435Google ScholarDigital Library
- Miranda C. Parker, Mark Guzdial, and Shelly Engleman. 2016. Replication, Validation, and Use of a Language Independent CS1 Knowledge Assessment. In Proceedings of the 2016 ACM Conference on International Computing Education Research - ICER '16. ACM Press, 93--101. https://doi.org/10.1145/2960310.2960316Google ScholarDigital Library
- R Core Team. 2020. R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/Google Scholar
- William Revelle and Richard E. Zinbarg. 2009. Coefficients Alpha, Beta, Omega, and the glb: Comments on Sijtsma. Psychometrika, Vol. 74, 1 (Mar 2009), 145--154. https://doi.org/10.1007/s11336-008--9102-zGoogle ScholarCross Ref
- Yves Rosseel. 2012. lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, Vol. 48, 2 (2012), 1--36. http://www.jstatsoft.org/v48/i02/Google ScholarCross Ref
- Klaas Sijtsma. 2009. On the Use, the Misuse, and the Very Limited Usefulness of Cronbach?s Alpha. Psychometrika, Vol. 74, 1 (Mar 2009), 107--120. https://doi.org/10.1007/s11336-008--9101-0Google ScholarCross Ref
- Suzanne L. Slocum-Gori and Bruno D. Zumbo. 2011. Assessing the Unidimensionality of Psychological Scales: Using Multiple Criteria from Factor Analysis. Social Indicators Research, Vol. 102, 3 (Jul 2011), 443--461. https://doi.org/10.1007/s11205-010--9682--8Google ScholarCross Ref
- Phil Steinhorst, Andrew Petersen, and Jan Vahrenhold. 2020. Revisiting Self-Efficacy in Introductory Programming. In Proceedings of the 2020 ACM Conference on International Computing Education Research. 158--169.Google ScholarDigital Library
- Dubravka Svetina and Leslie Rutkowski. 2014. Detecting differential item functioning using generalized logistic regression in the context of large-scale assessments. Large-scale Assessments in Education, Vol. 2, 1 (Dec 2014), 4. https://doi.org/10.1186/s40536-014-0004--5Google ScholarCross Ref
- Keith S. Taber. 2018. The Use of Cronbach's Alpha When Developing and Reporting Research Instruments in Science Education. Research in Science Education, Vol. 48, 6 (Dec 2018), 1273--1296. https://doi.org/10.1007/s11165-016--9602--2Google ScholarCross Ref
- A. E. Tew and B. Dorn. 2013. The Case for Validated Tools in Computer Science Education Research. Computer, Vol. 46, 9 (2013), 60--66.Google ScholarDigital Library
- Allison Elliott Tew and Mark Guzdial. 2011. The FCS1: a language independent assessment of CS1 knowledge. In Proceedings of the 42nd ACM technical symposium on Computer science education - SIGCSE '11. ACM Press, 111. https://doi.org/10.1145/1953163.1953200Google ScholarDigital Library
- Italo Trizano-Hermosilla and Jesús M. Alvarado. 2016. Best Alternatives to Cronbach's Alpha Reliability in Realistic Conditions: Congeneric and Asymmetrical Measurements. Frontiers in Psychology, Vol. 7 (May 2016). https://doi.org/10.3389/fpsyg.2016.00769Google Scholar
- Cindy M. Walker. 2011. What's the DIF? Why Differential Item Functioning Analyses Are an Important Part of Instrument Development and Validation. Journal of Psychoeducational Assessment, Vol. 29, 4 (Aug 2011), 364--376. https://doi.org/10.1177/0734282911406666Google ScholarCross Ref
- Benjamin Xie, Matthew J. Davidson, Min Li, and Andrew J. Ko. 2019. An Item Response Theory Evaluation of a Language-Independent CS1 Knowledge Assessment. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education - SIGCSE '19. ACM Press, 699--705. https://doi.org/10.1145/3287324.3287370Google Scholar
- Bruno D Zumbo. 1999. A Handbook on the Theory and Methods of Differential Item Functioning (DIF): Logistic Regression modeling as a Unitary Framework for Binary and Likert-Type (Ordinal) Item Scores. 57 pages.Google Scholar
- Bruno D Zumbo. 2007. Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language assessment quarterly, Vol. 4, 2 (2007), 223--233.Google Scholar
- Rebecca Zwick. 2012. A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. ETS Research Report Series, Vol. 2012, 1 (2012), i--30.Google Scholar
Index Terms
- Investigating Item Bias in a CS1 Exam with Differential Item Functioning
Recommendations
Domain Experts' Interpretations of Assessment Bias in a Scaled, Online Computer Science Curriculum
L@S '21: Proceedings of the Eighth ACM Conference on Learning @ ScaleUnderstanding inequity at scale is necessary for designing equitable online learning experiences, but also difficult. Statistical techniques like differential item functioning (DIF) can help identify whether items/questions in an assessment exhibit ...
Replicating a Validated CS1 Assessment (Abstract Only)
SIGCSE '16: Proceedings of the 47th ACM Technical Symposium on Computing Science EducationValidated assessments are important for teachers and researchers. A validated assessment is carefully developed to make sure that it is measuring the right things. Computing education needs more and better validated assessments. Validated assessments ...
Intersectional Biases Within an Introductory Computing Assessment
SIGCSE 2024: Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1Assessments that can measure student understanding of concepts in a reliable and valid way are incredibly valuable in research. Unfortunately, assessments can be a source of bias, differentially impacting students along various demographic lines. ...
Comments