skip to main content
10.1145/3408877.3432397acmconferencesArticle/Chapter ViewAbstractPublication PagessigcseConference Proceedingsconference-collections
research-article

Investigating Item Bias in a CS1 Exam with Differential Item Functioning

Published: 05 March 2021 Publication History

Abstract

Reliable and valid exams are a crucial part of both sound research design and trustworthy assessment of student knowledge. Assessing and addressing item bias is a crucial step in building a validity argument for any assessment instrument. Despite calls for valid assessment tools in CS, item bias is rarely investigated. What kinds of item bias might appear in conventional CS1 exams? To investigate this, we examined responses to a final exam in a large CS1 course. We used differential item functioning (DIF) methods and specifically investigated bias related to binary gender and year of study. Although not a published assessment instrument, the exam had a similar format to many exams in higher education and research: students are asked to trace code and write programs, using paper and pencil. One item with significant DIF was detected on the exam, though the magnitude was negligible. This case study shows how to detect DIF items so that future researchers and practitioners can do these analyses.

References

[1]
American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. 2014. Standards for educational and psychological testing. American Educational Research Association.
[2]
Deborah L. Bandalos. 2018. Measurement theory and applications for the social sciences. Guilford Press.
[3]
William C. M. Belzak. 2019. Testing Differential Item Functioning in Small Samples. Multivariate Behavioral Research (Oct 2019), 1--26. https://doi.org/10.1080/00273171.2019.1671162
[4]
Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), Vol. 57, 1 (1995), 289--300.
[5]
Ryan Bockmon, Stephen Cooper, Jonathan Gratch, and Mohsen Dorodchi. 2019. (Re)Validating Cognitive Introductory Computing Instruments. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education - SIGCSE '19. ACM Press, 552--557. https://doi.org/10.1145/3287324.3287372
[6]
R. Philip Chalmers. 2012. mirt: A Multidimensional Item Response Theory Package for the R Environment. Journal of Statistical Software, Vol. 48, 6 (2012), 1--29. https://doi.org/10.18637/jss.v048.i06
[7]
Nick Cheng and Brian Harrington. 2017. The Code Mangler: Evaluating Coding Ability Without Writing any Code. In Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education - SIGCSE '17. ACM Press, 123--128. https://doi.org/10.1145/3017680.3017704
[8]
Howard Wainer David Thissen, Lynne Steinberg. 1988. Use of item response theory in the study of group differences in trace lines. Test validity (1988), 147.
[9]
R. J. De Ayala. 2009. The theory and practice of item response theory .Guilford Press.
[10]
Adrienne Decker. 2007. How Students Measure Up: an Assessment Instrument for Introductory Computer Science. Ph.D. Dissertation. State University of New York.
[11]
Neil J Dorans and Edward Kulick. 1986. Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of educational measurement, Vol. 23, 4 (1986), 355--368.
[12]
Barbara Ericson, Shelly Engelman, Tom McKlin, and Ja'Quan Taylor. 2014. Project Rise up 4 CS: Increasing the Number of Black Students Who Pass Advanced Placement CS A. In Proceedings of the 45th ACM Technical Symposium on Computer Science Education (Atlanta, Georgia, USA) (SIGCSE '14). Association for Computing Machinery, New York, NY, USA, 439--444. https://doi.org/10.1145/2538862.2538937
[13]
Barbara J. Ericson, Miranda C. Parker, and Shelly Engelman. 2016. Sisters Rise Up 4 CS: Helping Female Students Pass the Advanced Placement Computer Science A Exam. In Proceedings of the 47th ACM Technical Symposium on Computing Science Education (Memphis, Tennessee, USA) (SIGCSE '16). Association for Computing Machinery, New York, NY, USA, 309--314. https://doi.org/10.1145/2839509.2844623
[14]
W. Holmes Finch. 2016. Detection of Differential Item Functioning for More Than Two Groups: A Monte Carlo Comparison of Methods. Applied Measurement in Education, Vol. 29, 1 (Jan 2016), 30--45. https://doi.org/10.1080/08957347.2015.1102916
[15]
Google and Gallup. 2015. Images of Computer Science: Perceptions Among Students, Parents, and Educators in the U.S. (2015).
[16]
Geoffrey L. Herman, Craig Zilles, and Michael C. Loui. 2014. A psychometric evaluation of the digital logic concept inventory. Computer Science Education, Vol. 24, 4 (Oct 2014), 277--303. https://doi.org/10.1080/08993408.2014.970781
[17]
Michael G Jodoin and Mark J Gierl. 2001. Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied measurement in education, Vol. 14, 4 (2001), 329--349.
[18]
Lisa C. Kaczmarczyk, Elizabeth R. Petrick, J. Philip East, and Geoffrey L. Herman. 2010. Identifying student misconceptions of programming. In Proceedings of the 41st ACM technical symposium on Computer science education - SIGCSE '10. ACM Press, 107. https://doi.org/10.1145/1734263.1734299
[19]
Michael T. Kane. 2013. Validating the Interpretations and Uses of Test Scores: Validating the Interpretations and Uses of Test Scores. Journal of Educational Measurement, Vol. 50, 1 (Mar 2013), 1--73. https://doi.org/10.1111/jedm.12000
[20]
Rex B Kline. 2015. Principles and practice of structural equation modeling .Guilford publications.
[21]
Sunbok Lee. 2017. Detecting Differential Item Functioning Using the Logistic Regression Procedure in Small Samples. Applied Psychological Measurement, Vol. 41, 1 (Jan 2017), 30--43. https://doi.org/10.1177/0146621616668015
[22]
D. Magis, S. Beland, F. Tuerlinckx, and P. De Boeck. 2010. A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, Vol. 42 (2010), 847--862.
[23]
Patrícia Martinková, Adéla Drabinová, Yuan-Ling Liaw, Elizabeth A. Sanders, Jenny L. McFarland, and Rebecca M. Price. 2017. Checking Equity: Why Differential Item Functioning Analysis Should Be a Routine Part of Developing Conceptual Assessments. CBE?Life Sciences Education, Vol. 16, 2 (Jun 2017), rm2. https://doi.org/10.1187/cbe.16--10-0307
[24]
Adam W. Meade. 2010. A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, Vol. 95, 4 (Jul 2010), 728--743. https://doi.org/10.1037/a0018966
[25]
Samuel Messick. 1995. Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American psychologist, Vol. 50, 9 (1995), 741.
[26]
Greg L. Nelson, Andrew Hu, Benjamin Xie, and Amy J. Ko. 2019. Towards validity for a formative assessment for language-specific program tracing skills. In Proceedings of the 19th Koli Calling International Conference on Computing Education Research. ACM, 1--10. https://doi.org/10.1145/3364510.3364525
[27]
Thomas H. Park, Meen Chul Kim, Sukrit Chhabra, Brian Lee, and Andrea Forte. 2016. Reading Hierarchies in Code: Assessment of a Basic Computational Skill. In Proceedings of the 2016 ACM Conference on Innovation and Technology in Computer Science Education - ITiCSE '16. ACM Press, 302--307. https://doi.org/10.1145/2899415.2899435
[28]
Miranda C. Parker, Mark Guzdial, and Shelly Engleman. 2016. Replication, Validation, and Use of a Language Independent CS1 Knowledge Assessment. In Proceedings of the 2016 ACM Conference on International Computing Education Research - ICER '16. ACM Press, 93--101. https://doi.org/10.1145/2960310.2960316
[29]
R Core Team. 2020. R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
[30]
William Revelle and Richard E. Zinbarg. 2009. Coefficients Alpha, Beta, Omega, and the glb: Comments on Sijtsma. Psychometrika, Vol. 74, 1 (Mar 2009), 145--154. https://doi.org/10.1007/s11336-008--9102-z
[31]
Yves Rosseel. 2012. lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, Vol. 48, 2 (2012), 1--36. http://www.jstatsoft.org/v48/i02/
[32]
Klaas Sijtsma. 2009. On the Use, the Misuse, and the Very Limited Usefulness of Cronbach?s Alpha. Psychometrika, Vol. 74, 1 (Mar 2009), 107--120. https://doi.org/10.1007/s11336-008--9101-0
[33]
Suzanne L. Slocum-Gori and Bruno D. Zumbo. 2011. Assessing the Unidimensionality of Psychological Scales: Using Multiple Criteria from Factor Analysis. Social Indicators Research, Vol. 102, 3 (Jul 2011), 443--461. https://doi.org/10.1007/s11205-010--9682--8
[34]
Phil Steinhorst, Andrew Petersen, and Jan Vahrenhold. 2020. Revisiting Self-Efficacy in Introductory Programming. In Proceedings of the 2020 ACM Conference on International Computing Education Research. 158--169.
[35]
Dubravka Svetina and Leslie Rutkowski. 2014. Detecting differential item functioning using generalized logistic regression in the context of large-scale assessments. Large-scale Assessments in Education, Vol. 2, 1 (Dec 2014), 4. https://doi.org/10.1186/s40536-014-0004--5
[36]
Keith S. Taber. 2018. The Use of Cronbach's Alpha When Developing and Reporting Research Instruments in Science Education. Research in Science Education, Vol. 48, 6 (Dec 2018), 1273--1296. https://doi.org/10.1007/s11165-016--9602--2
[37]
A. E. Tew and B. Dorn. 2013. The Case for Validated Tools in Computer Science Education Research. Computer, Vol. 46, 9 (2013), 60--66.
[38]
Allison Elliott Tew and Mark Guzdial. 2011. The FCS1: a language independent assessment of CS1 knowledge. In Proceedings of the 42nd ACM technical symposium on Computer science education - SIGCSE '11. ACM Press, 111. https://doi.org/10.1145/1953163.1953200
[39]
Italo Trizano-Hermosilla and Jesús M. Alvarado. 2016. Best Alternatives to Cronbach's Alpha Reliability in Realistic Conditions: Congeneric and Asymmetrical Measurements. Frontiers in Psychology, Vol. 7 (May 2016). https://doi.org/10.3389/fpsyg.2016.00769
[40]
Cindy M. Walker. 2011. What's the DIF? Why Differential Item Functioning Analyses Are an Important Part of Instrument Development and Validation. Journal of Psychoeducational Assessment, Vol. 29, 4 (Aug 2011), 364--376. https://doi.org/10.1177/0734282911406666
[41]
Benjamin Xie, Matthew J. Davidson, Min Li, and Andrew J. Ko. 2019. An Item Response Theory Evaluation of a Language-Independent CS1 Knowledge Assessment. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education - SIGCSE '19. ACM Press, 699--705. https://doi.org/10.1145/3287324.3287370
[42]
Bruno D Zumbo. 1999. A Handbook on the Theory and Methods of Differential Item Functioning (DIF): Logistic Regression modeling as a Unitary Framework for Binary and Likert-Type (Ordinal) Item Scores. 57 pages.
[43]
Bruno D Zumbo. 2007. Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language assessment quarterly, Vol. 4, 2 (2007), 223--233.
[44]
Rebecca Zwick. 2012. A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. ETS Research Report Series, Vol. 2012, 1 (2012), i--30.

Cited By

View all

Index Terms

  1. Investigating Item Bias in a CS1 Exam with Differential Item Functioning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGCSE '21: Proceedings of the 52nd ACM Technical Symposium on Computer Science Education
    March 2021
    1454 pages
    ISBN:9781450380621
    DOI:10.1145/3408877
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 March 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cs1
    2. differential item functioning
    3. equity
    4. psychometrics
    5. validity

    Qualifiers

    • Research-article

    Funding Sources

    • National Science Foundation

    Conference

    SIGCSE '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,787 of 5,146 submissions, 35%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 217
      Total Downloads
    • Downloads (Last 12 months)40
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media