skip to main content
10.1145/3408877.3432397acmconferencesArticle/Chapter ViewAbstractPublication PagessigcseConference Proceedingsconference-collections
research-article

Investigating Item Bias in a CS1 Exam with Differential Item Functioning

Published:05 March 2021Publication History

ABSTRACT

Reliable and valid exams are a crucial part of both sound research design and trustworthy assessment of student knowledge. Assessing and addressing item bias is a crucial step in building a validity argument for any assessment instrument. Despite calls for valid assessment tools in CS, item bias is rarely investigated. What kinds of item bias might appear in conventional CS1 exams? To investigate this, we examined responses to a final exam in a large CS1 course. We used differential item functioning (DIF) methods and specifically investigated bias related to binary gender and year of study. Although not a published assessment instrument, the exam had a similar format to many exams in higher education and research: students are asked to trace code and write programs, using paper and pencil. One item with significant DIF was detected on the exam, though the magnitude was negligible. This case study shows how to detect DIF items so that future researchers and practitioners can do these analyses.

References

  1. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. 2014. Standards for educational and psychological testing. American Educational Research Association.Google ScholarGoogle Scholar
  2. Deborah L. Bandalos. 2018. Measurement theory and applications for the social sciences. Guilford Press.Google ScholarGoogle Scholar
  3. William C. M. Belzak. 2019. Testing Differential Item Functioning in Small Samples. Multivariate Behavioral Research (Oct 2019), 1--26. https://doi.org/10.1080/00273171.2019.1671162Google ScholarGoogle Scholar
  4. Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), Vol. 57, 1 (1995), 289--300.Google ScholarGoogle ScholarCross RefCross Ref
  5. Ryan Bockmon, Stephen Cooper, Jonathan Gratch, and Mohsen Dorodchi. 2019. (Re)Validating Cognitive Introductory Computing Instruments. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education - SIGCSE '19. ACM Press, 552--557. https://doi.org/10.1145/3287324.3287372Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Philip Chalmers. 2012. mirt: A Multidimensional Item Response Theory Package for the R Environment. Journal of Statistical Software, Vol. 48, 6 (2012), 1--29. https://doi.org/10.18637/jss.v048.i06Google ScholarGoogle ScholarCross RefCross Ref
  7. Nick Cheng and Brian Harrington. 2017. The Code Mangler: Evaluating Coding Ability Without Writing any Code. In Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education - SIGCSE '17. ACM Press, 123--128. https://doi.org/10.1145/3017680.3017704Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Howard Wainer David Thissen, Lynne Steinberg. 1988. Use of item response theory in the study of group differences in trace lines. Test validity (1988), 147.Google ScholarGoogle Scholar
  9. R. J. De Ayala. 2009. The theory and practice of item response theory .Guilford Press.Google ScholarGoogle Scholar
  10. Adrienne Decker. 2007. How Students Measure Up: an Assessment Instrument for Introductory Computer Science. Ph.D. Dissertation. State University of New York.Google ScholarGoogle Scholar
  11. Neil J Dorans and Edward Kulick. 1986. Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of educational measurement, Vol. 23, 4 (1986), 355--368.Google ScholarGoogle ScholarCross RefCross Ref
  12. Barbara Ericson, Shelly Engelman, Tom McKlin, and Ja'Quan Taylor. 2014. Project Rise up 4 CS: Increasing the Number of Black Students Who Pass Advanced Placement CS A. In Proceedings of the 45th ACM Technical Symposium on Computer Science Education (Atlanta, Georgia, USA) (SIGCSE '14). Association for Computing Machinery, New York, NY, USA, 439--444. https://doi.org/10.1145/2538862.2538937Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Barbara J. Ericson, Miranda C. Parker, and Shelly Engelman. 2016. Sisters Rise Up 4 CS: Helping Female Students Pass the Advanced Placement Computer Science A Exam. In Proceedings of the 47th ACM Technical Symposium on Computing Science Education (Memphis, Tennessee, USA) (SIGCSE '16). Association for Computing Machinery, New York, NY, USA, 309--314. https://doi.org/10.1145/2839509.2844623Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Holmes Finch. 2016. Detection of Differential Item Functioning for More Than Two Groups: A Monte Carlo Comparison of Methods. Applied Measurement in Education, Vol. 29, 1 (Jan 2016), 30--45. https://doi.org/10.1080/08957347.2015.1102916Google ScholarGoogle ScholarCross RefCross Ref
  15. Google and Gallup. 2015. Images of Computer Science: Perceptions Among Students, Parents, and Educators in the U.S. (2015).Google ScholarGoogle Scholar
  16. Geoffrey L. Herman, Craig Zilles, and Michael C. Loui. 2014. A psychometric evaluation of the digital logic concept inventory. Computer Science Education, Vol. 24, 4 (Oct 2014), 277--303. https://doi.org/10.1080/08993408.2014.970781Google ScholarGoogle ScholarCross RefCross Ref
  17. Michael G Jodoin and Mark J Gierl. 2001. Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied measurement in education, Vol. 14, 4 (2001), 329--349.Google ScholarGoogle Scholar
  18. Lisa C. Kaczmarczyk, Elizabeth R. Petrick, J. Philip East, and Geoffrey L. Herman. 2010. Identifying student misconceptions of programming. In Proceedings of the 41st ACM technical symposium on Computer science education - SIGCSE '10. ACM Press, 107. https://doi.org/10.1145/1734263.1734299Google ScholarGoogle Scholar
  19. Michael T. Kane. 2013. Validating the Interpretations and Uses of Test Scores: Validating the Interpretations and Uses of Test Scores. Journal of Educational Measurement, Vol. 50, 1 (Mar 2013), 1--73. https://doi.org/10.1111/jedm.12000Google ScholarGoogle ScholarCross RefCross Ref
  20. Rex B Kline. 2015. Principles and practice of structural equation modeling .Guilford publications.Google ScholarGoogle Scholar
  21. Sunbok Lee. 2017. Detecting Differential Item Functioning Using the Logistic Regression Procedure in Small Samples. Applied Psychological Measurement, Vol. 41, 1 (Jan 2017), 30--43. https://doi.org/10.1177/0146621616668015Google ScholarGoogle ScholarCross RefCross Ref
  22. D. Magis, S. Beland, F. Tuerlinckx, and P. De Boeck. 2010. A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, Vol. 42 (2010), 847--862.Google ScholarGoogle ScholarCross RefCross Ref
  23. Patrícia Martinková, Adéla Drabinová, Yuan-Ling Liaw, Elizabeth A. Sanders, Jenny L. McFarland, and Rebecca M. Price. 2017. Checking Equity: Why Differential Item Functioning Analysis Should Be a Routine Part of Developing Conceptual Assessments. CBE?Life Sciences Education, Vol. 16, 2 (Jun 2017), rm2. https://doi.org/10.1187/cbe.16--10-0307Google ScholarGoogle Scholar
  24. Adam W. Meade. 2010. A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, Vol. 95, 4 (Jul 2010), 728--743. https://doi.org/10.1037/a0018966Google ScholarGoogle ScholarCross RefCross Ref
  25. Samuel Messick. 1995. Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American psychologist, Vol. 50, 9 (1995), 741.Google ScholarGoogle Scholar
  26. Greg L. Nelson, Andrew Hu, Benjamin Xie, and Amy J. Ko. 2019. Towards validity for a formative assessment for language-specific program tracing skills. In Proceedings of the 19th Koli Calling International Conference on Computing Education Research. ACM, 1--10. https://doi.org/10.1145/3364510.3364525Google ScholarGoogle Scholar
  27. Thomas H. Park, Meen Chul Kim, Sukrit Chhabra, Brian Lee, and Andrea Forte. 2016. Reading Hierarchies in Code: Assessment of a Basic Computational Skill. In Proceedings of the 2016 ACM Conference on Innovation and Technology in Computer Science Education - ITiCSE '16. ACM Press, 302--307. https://doi.org/10.1145/2899415.2899435Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Miranda C. Parker, Mark Guzdial, and Shelly Engleman. 2016. Replication, Validation, and Use of a Language Independent CS1 Knowledge Assessment. In Proceedings of the 2016 ACM Conference on International Computing Education Research - ICER '16. ACM Press, 93--101. https://doi.org/10.1145/2960310.2960316Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R Core Team. 2020. R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/Google ScholarGoogle Scholar
  30. William Revelle and Richard E. Zinbarg. 2009. Coefficients Alpha, Beta, Omega, and the glb: Comments on Sijtsma. Psychometrika, Vol. 74, 1 (Mar 2009), 145--154. https://doi.org/10.1007/s11336-008--9102-zGoogle ScholarGoogle ScholarCross RefCross Ref
  31. Yves Rosseel. 2012. lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, Vol. 48, 2 (2012), 1--36. http://www.jstatsoft.org/v48/i02/Google ScholarGoogle ScholarCross RefCross Ref
  32. Klaas Sijtsma. 2009. On the Use, the Misuse, and the Very Limited Usefulness of Cronbach?s Alpha. Psychometrika, Vol. 74, 1 (Mar 2009), 107--120. https://doi.org/10.1007/s11336-008--9101-0Google ScholarGoogle ScholarCross RefCross Ref
  33. Suzanne L. Slocum-Gori and Bruno D. Zumbo. 2011. Assessing the Unidimensionality of Psychological Scales: Using Multiple Criteria from Factor Analysis. Social Indicators Research, Vol. 102, 3 (Jul 2011), 443--461. https://doi.org/10.1007/s11205-010--9682--8Google ScholarGoogle ScholarCross RefCross Ref
  34. Phil Steinhorst, Andrew Petersen, and Jan Vahrenhold. 2020. Revisiting Self-Efficacy in Introductory Programming. In Proceedings of the 2020 ACM Conference on International Computing Education Research. 158--169.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Dubravka Svetina and Leslie Rutkowski. 2014. Detecting differential item functioning using generalized logistic regression in the context of large-scale assessments. Large-scale Assessments in Education, Vol. 2, 1 (Dec 2014), 4. https://doi.org/10.1186/s40536-014-0004--5Google ScholarGoogle ScholarCross RefCross Ref
  36. Keith S. Taber. 2018. The Use of Cronbach's Alpha When Developing and Reporting Research Instruments in Science Education. Research in Science Education, Vol. 48, 6 (Dec 2018), 1273--1296. https://doi.org/10.1007/s11165-016--9602--2Google ScholarGoogle ScholarCross RefCross Ref
  37. A. E. Tew and B. Dorn. 2013. The Case for Validated Tools in Computer Science Education Research. Computer, Vol. 46, 9 (2013), 60--66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Allison Elliott Tew and Mark Guzdial. 2011. The FCS1: a language independent assessment of CS1 knowledge. In Proceedings of the 42nd ACM technical symposium on Computer science education - SIGCSE '11. ACM Press, 111. https://doi.org/10.1145/1953163.1953200Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Italo Trizano-Hermosilla and Jesús M. Alvarado. 2016. Best Alternatives to Cronbach's Alpha Reliability in Realistic Conditions: Congeneric and Asymmetrical Measurements. Frontiers in Psychology, Vol. 7 (May 2016). https://doi.org/10.3389/fpsyg.2016.00769Google ScholarGoogle Scholar
  40. Cindy M. Walker. 2011. What's the DIF? Why Differential Item Functioning Analyses Are an Important Part of Instrument Development and Validation. Journal of Psychoeducational Assessment, Vol. 29, 4 (Aug 2011), 364--376. https://doi.org/10.1177/0734282911406666Google ScholarGoogle ScholarCross RefCross Ref
  41. Benjamin Xie, Matthew J. Davidson, Min Li, and Andrew J. Ko. 2019. An Item Response Theory Evaluation of a Language-Independent CS1 Knowledge Assessment. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education - SIGCSE '19. ACM Press, 699--705. https://doi.org/10.1145/3287324.3287370Google ScholarGoogle Scholar
  42. Bruno D Zumbo. 1999. A Handbook on the Theory and Methods of Differential Item Functioning (DIF): Logistic Regression modeling as a Unitary Framework for Binary and Likert-Type (Ordinal) Item Scores. 57 pages.Google ScholarGoogle Scholar
  43. Bruno D Zumbo. 2007. Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language assessment quarterly, Vol. 4, 2 (2007), 223--233.Google ScholarGoogle Scholar
  44. Rebecca Zwick. 2012. A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. ETS Research Report Series, Vol. 2012, 1 (2012), i--30.Google ScholarGoogle Scholar

Index Terms

  1. Investigating Item Bias in a CS1 Exam with Differential Item Functioning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGCSE '21: Proceedings of the 52nd ACM Technical Symposium on Computer Science Education
      March 2021
      1454 pages
      ISBN:9781450380621
      DOI:10.1145/3408877

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 March 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,595of4,542submissions,35%

      Upcoming Conference

      SIGCSE Virtual 2024
      SIGCSE Virtual 2024: ACM Virtual Global Computing Education Conference
      November 30 - December 1, 2024
      Virtual Event , USA

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader