research-article

Using Transformer Language Models to Validate Peer-Assigned Essay Scores in Massive Open Online Courses (MOOCs)

Authors:
Wesley Morris

Vanderbilt University, United States

Vanderbilt University, United States

0000-0001-6316-6479
View Profile

,
Scott Crossley

Vanderbilt University, United States

Vanderbilt University, United States

0000-0002-5148-0273
View Profile

,
Langdon Holmes

Vanderbilt University, United States

Vanderbilt University, United States

0000-0003-4338-4609
View Profile

,
Anne Trumbore

University of Virginia, United States

University of Virginia, United States

0000-0003-0291-9478
View Profile

LAK2023: LAK23: 13th International Learning Analytics and Knowledge ConferenceMarch 2023Pages 315–323https://doi.org/10.1145/3576050.3576098

Published:13 March 2023Publication History

LAK2023: LAK23: 13th International Learning Analytics and Knowledge Conference

Pages 315–323

ABSTRACT

Massive Open Online Courses (MOOCs) such as those offered by Coursera are popular ways for adults to gain important skills, advance their careers, and pursue their interests. Within these courses, students are often required to compose, submit, and peer review written essays, providing a valuable pedagogical experience for the student and a wealth of natural language data for the educational researcher. However, the scores provided by peers do not always reflect the actual quality of the text, generating questions about the reliability and validity of the scores. This study evaluates methods to increase the reliability of MOOC peer-review ratings through a series of validation tests on peer-reviewed essays. Reliability of reviewers was based on correlations between text length and essay quality. Raters were pruned based on score variance and the lexical diversity observed in their comments to create sub-sets of raters. Each subset was then used as training data to finetune distilBERT large language models to automatically score essay quality as a measure of validation. The accuracy of each language model for each subset was evaluated. We find that training language models on data subsets produced by more reliable raters based on a combination of score variance and lexical diversity produce more accurate essay scoring models. The approach developed in this study should allow for enhanced reliability of peer-reviewed scoring in MOOCS affording greater credibility within the systems.

References

Ramón Alcarria, Borja Bordel, Diego Martín de Andrés, and Tomás Robles. 2018. Enhanced Peer Assessment in MOOC Evaluation Through Assignment and Review Analysis. International Journal of Emerging Technologies in Learning (iJET) 13, 01 (Jan. 2018), 206. https://doi.org/10.3991/ijet.v13i01.7461Google ScholarCross Ref
Gabriel Badea and Elvira Popescu. 2022. A dynamic review allocation approach for peer assessment in technology enhanced learning. Education and Information Technologies (June 2022). https://doi.org/10.1007/s10639-022-11175-5Google ScholarDigital Library
Elizabeth Badger and Brenda Thomas. 1992. Open-Ended Questions in Reading. Practical Assessment, Research and Evaluation 3, 4 (1992). https://doi.org/10.7275/FRYF-Z044 Publisher: University of Massachusetts Amherst.Google ScholarCross Ref
Stephen P Balfour. 2013. Assessing Writing in MOOCs: Automated Essay Scoring and Calibrated Peer Review™.Research & Practice in Assessment 8 (2013), 40–48.Google Scholar
William E. Becker and Carol Johnston. 1999. The Relationship between Multiple Choice and Essay Response Questions in Assessing Economics Understanding. Economic Record 75, 4 (Dec. 1999), 348–357. https://doi.org/10.1111/j.1475-4932.1999.tb02571.xGoogle ScholarCross Ref
John Bennion, Brian Cannon, Brian Hill, Riley Nelson, and Meagan Ricks. 2020. Asking the Right Questions: Using Reflective Essays for Experiential Assessment. Journal of Experiential Education 43, 1 (March 2020), 37–54. https://doi.org/10.1177/1053825919880202Google ScholarCross Ref
Majdi Beseiso and Saleh Alzahrani. 2020. An empirical analysis of BERT embedding for automated essay scoring. International Journal of Advanced Computer Science and Applications 11, 10(2020).Google ScholarCross Ref
Mina Shirvani Boroujeni and Pierre Dillenbourg. 2018. Discovery and temporal analysis of latent study patterns in MOOC interaction sequences. In Proceedings of the 8th International Conference on Learning Analytics and Knowledge. ACM, Sydney New South Wales Australia, 206–215. https://doi.org/10.1145/3170358.3170388Google ScholarDigital Library
Stephen Bostock. 2000. Student peer assessment. Learning Technology 5, 1 (2000), 245–249.Google Scholar
Robert-Mihai Botarleanu, Mihai Dascalu, Laura K Allen, Scott Andrew Crossley, and Danielle S McNamara. 2022. Multitask Summary Scoring with Longformers. In International Conference on Artificial Intelligence in Education. Springer, 756–761.Google Scholar
George A Brown, Joanna Bull, and Malcolm Pendlebury. 2013. Assessing Student Learning in Higher Education (0 ed.). Routledge. https://doi.org/10.4324/9781315004914Google ScholarCross Ref
Scott Crossley, Luc Paquette, Mihai Dascalu, Danielle S. McNamara, and Ryan S. Baker. 2016. Combining click-stream data with NLP tools to better understand MOOC completion. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. ACM, Edinburgh United Kingdom, 6–14. https://doi.org/10.1145/2883851.2883931Google ScholarDigital Library
Scott Crossley, Qian Wan, Laura Allen, and Danielle McNamara. 2021. Source inclusion in synthesis writing: an NLP approach to understanding argumentation, sourcing, and essay quality. Reading and Writing (2021).Google Scholar
Scott A Crossley, Kristopher Kyle, and Danielle S McNamara. 2016. The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior research methods 48, 4 (2016), 1227–1237.Google Scholar
R Wes Crues, Nigel Bosch, and Carolyn J Anderson. 2018. Who they are and what they want: Understanding the reasons for MOOC enrollment. Proceedings of the 11th International Conference on Educational Data Mining (2018), 10.Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).Google Scholar
Johanna Fleckenstein, Jennifer Meyer, Thorben Jansen, Stefan Keller, and Olaf Köller. 2020. Is a Long Essay Always a Good Essay? The Effect of Text Length on Writing Assessment. Frontiers in psychology 11 (2020), 562462.Google Scholar
Dilrukshi Gamage, Thomas Staubitz, and Mark Whiting. 2021. Peer assessment in MOOCs: Systematic literature review. Distance Education 42, 2 (April 2021), 268–289. https://doi.org/10.1080/01587919.2021.1911626Google ScholarCross Ref
Felix Garcia-Loro, Sergio Martin, José A. Ruipérez-Valiente, Elio Sancristobal, and Manuel Castro. 2020. Reviewing and analyzing peer review Inter-Rater Reliability in a MOOC platform. Computers & Education 154 (Sept. 2020), 103894. https://doi.org/10.1016/j.compedu.2020.103894Google ScholarCross Ref
Xinyue Guo, Feng Wu, and Xin Zheng. 2019. What Motives Learner to Learn in MOOC? An Investigation of Chinese University MOOC. In 2019 International Joint Conference on Information, Media and Engineering (IJCIME). IEEE, Osaka, Japan, 154–159. https://doi.org/10.1109/IJCIME49369.2019.00039Google ScholarCross Ref
Andreas Köhler and Lorenz Erdmann. 2004. Expected environmental impacts of pervasive computing. Human and Ecological Risk Assessment 10, 5 (2004), 831–852.Google ScholarCross Ref
Kristopher Kyle and Scott A Crossley. 2015. Automatically assessing lexical sophistication: Indices, tools, findings, and application. Tesol Quarterly 49, 4 (2015), 757–786.Google ScholarCross Ref
Kristopher Kyle and Scott A Crossley. 2018. Measuring syntactic complexity in L2 writing using fine-grained clausal and phrasal indices. The Modern Language Journal 102, 2 (2018), 333–349.Google ScholarCross Ref
Kristopher Kyle, Scott A Crossley, and Scott Jarvis. 2021. Assessing the validity of lexical diversity indices using direct judgements. Language Assessment Quarterly 18, 2 (2021), 154–170.Google ScholarCross Ref
Paraskevas Lagakis and Stavros Demetriadis. 2021. Automated essay scoring: A review of the field. In 2021 International Conference on Computer, Information and Telecommunication Systems (CITS). IEEE, Istanbul, Turkey, 1–6. https://doi.org/10.1109/CITS52676.2021.9618476Google ScholarCross Ref
Hongxia Li, ChengLing Zhao, Taotao Long, Yan Huang, and Fengfang Shu. 2021. Exploring the reliability and its influencing factors of peer assessment in massive open online courses. British Journal of Educational Technology 52 (2021), 2263–2277.Google ScholarCross Ref
Xiaofei Lu. 2010. Automatic analysis of syntactic complexity in second language writing. International journal of corpus linguistics 15, 4 (2010), 474–496.Google Scholar
Robert Lukhele, David Thissen, and Howard Weiner. 1994. On the Relative Value of Multiple-Choice, Constructed Response, and Examinee-Selected Items on Two Achievement Tests. Journal of Educational Measurement 31, 3 (1994), 234–250.Google ScholarCross Ref
Elijah Mayfield and Alan W Black. 2020. Should You Fine-Tune BERT for Automated Essay Scoring?. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Seattle, WA, USA → Online, 151–162. https://doi.org/10.18653/v1/2020.bea-1.15Google ScholarCross Ref
Rolin Moe. 2015. The brief & expansive history (and future) of the MOOC: Why two divergent models share the same name. Current Issues in Emerging eLearning 2, 1 (2015), 24.Google Scholar
Yasuhiro Ozuru, Stephen Briner, Christopher A. Kurby, and Danielle S. McNamara. 2013. Comparing comprehension measured by multiple-choice and open-ended questions.Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale 67, 3 (2013), 215–227. https://doi.org/10.1037/a0032918Google ScholarCross Ref
Les Perelman. 2014. When “the state of the art” is counting words. Assessing Writing 21 (July 2014), 104–111. https://doi.org/10.1016/j.asw.2014.05.001Google ScholarCross Ref
Martin Popel and Ondřej Bojar. 2018. Training Tips for the Transformer Model. The Prague Bulletin of Mathematical Linguistics 110, 1 (April 2018), 43–70. https://doi.org/10.2478/pralin-2018-0002 arXiv:1804.00247 [cs].Google ScholarCross Ref
Justin Reich and José A. Ruipérez-Valiente. 2019. The MOOC pivot. Science 363, 6423 (Jan. 2019), 130–131. https://doi.org/10.1126/science.aav7958Google ScholarCross Ref
Erin Dawna Reilly, Rose Eleanore Stafford, Kyle Marie Williams, and Stephanie Brooks Corliss. 2014. Evaluating the validity and applicability of automated essay scoring in two massive open online courses. The International Review of Research in Open and Distributed Learning 15, 5 (Oct. 2014). https://doi.org/10.19173/irrodl.v15i5.1857Google ScholarCross Ref
Zhiyun Ren, Huzefa Rangwala, and Aditya Johri. 2016. Predicting Performance on MOOC Assessments using Multi-Regression Models. http://arxiv.org/abs/1605.02269 arXiv:1605.02269 [cs].Google Scholar
Pedro Uria Rodriguez, Amir Jafari, and Christopher M. Ormerod. 2019. Language models and Automated Essay Scoring. http://arxiv.org/abs/1909.09482 arXiv:1909.09482 [cs, stat].Google Scholar
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108(2019).Google Scholar
Huma Shafiq. 2017. Courses beyond borders: A case study of MOOC platform Coursera. Library of Philosophy and Practice(2017), 1566.Google Scholar
Thomas Staubitz. 2020. Gradable team assignments in large scale learning environments: collaborative learning, teamwork, and peer assessment in MOOCs: Kollaboratives Lernen, Teamarbeit und Peer Assessment in MOOCs. Ph. D. Dissertation. Universität Potsdam. https://doi.org/10.25932/PUBLISHUP-47183 Artwork Size: 16774 KB, 70133 KB, 122 pages Medium: application/pdf,application/zip Pages: 16774 KB, 70133 KB, 122 pages.Google ScholarCross Ref
Thomas Staubitz, Dominic Petrick, Matthias Bauer, Jan Renz, and Christoph Meinel. 2016. Improving the Peer Assessment Experience on MOOC Platforms. In Proceedings of the Third (2016) ACM Conference on Learning @ Scale. ACM, Edinburgh Scotland UK, 389–398. https://doi.org/10.1145/2876034.2876043Google ScholarDigital Library
Thomas Staubitz, Hanadi Traifeh, Salim Chujfi, and Christoph Meinel. 2020. Have Your Tickets Ready! Impede Free Riding in Large Scale Team Assignments. In Proceedings of the Seventh ACM Conference on Learning @ Scale. ACM, Virtual Event USA, 349–352. https://doi.org/10.1145/3386527.3406744Google ScholarDigital Library
Hoi K. Suen. 2014. Peer assessment for massive open online courses (MOOCs). The International Review of Research in Open and Distributed Learning 15, 3 (June 2014). https://doi.org/10.19173/irrodl.v15i3.1680Google ScholarCross Ref
Bruce W. Tuckman. 1993. The Essay Test: A Look at the Advantages and Disadvantages. NASSP Bulletin 77, 555 (Oct. 1993), 20–26. https://doi.org/10.1177/019263659307755504Google ScholarCross Ref
Masaki Uto. 2021. A review of deep-neural automated essay scoring models. Behaviormetrika 48, 2 (2021), 459–484.Google ScholarCross Ref
Yongjie Wang, Chuan Wang, Ruobing Li, and Hui Lin. 2022. On the Use of BERT for Automated Essay Scoring: Joint Learning of Multi-Scale Essay Representation. arXiv preprint arXiv:2205.03835(2022).Google Scholar
Joshua Wilson and Jessica Rodrigues. 2020. Classification accuracy and efficiency of writing screening using automated essay scoring. Journal of School Psychology 82 (2020), 123–140.Google ScholarCross Ref
Haoran Zhang and Diane Litman. 2021. Essay Quality Signals as Weak Supervision for Source-based Essay Scoring. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications. 85–96.Google Scholar

Index Terms

Using Transformer Language Models to Validate Peer-Assigned Essay Scores in Massive Open Online Courses (MOOCs)

Recommendations

Instructional quality of Massive Open Online Courses (MOOCs)

We present an analysis of instructional design quality of 76 randomly selected Massive Open Online Courses (MOOCs). The quality of MOOCs was determined from first principles of instruction, using a course survey instrument. Two types of MOOCs (xMOOCs ...
Read More
Student engagement in massive open online courses

Completion rates in massive open online courses MOOCs are disturbingly low. Existing analysis has focused on patterns of resource access and prediction of drop-out using learning analytics. In contrast, the effectiveness of teaching programs in ...
Read More
Towards an educational design pattern language for massive open online courses (MOOCs)
PLoP '17: Proceedings of the 24th Conference on Pattern Languages of Programs

The design of innovative and groundbreaking Massive Open Online Courses (MOOCs) has been a difficult task for practitioners, especially novices, because there is no definite standard about the desirable characteristics of MOOC courses. Educational ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
LAK2023: LAK23: 13th International Learning Analytics and Knowledge Conference
March 2023
692 pages
ISBN:9781450398657
DOI:10.1145/3576050
Program Chairs:
Isabel Hilliger
Pontificia Universidad Católica de Chile, Chile
,
Hassan Khosravi
University of Queensland, Australia
,
Bart Rienties
Open University, United Kingdom
,
Shane Dawson
University of South Australia, Australia
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 March 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
moocs
natural language processing
rater reliability
transformers
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate236of782submissions,30%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 320
  Total Downloads
- Downloads (Last 12 months)256
- Downloads (Last 6 weeks)33
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Using Transformer Language Models to Validate Peer-Assigned Essay Scores in Massive Open Online Courses (MOOCs)

LAK2023: LAK23: 13th International Learning Analytics and Knowledge Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Instructional quality of Massive Open Online Courses (MOOCs)

Student engagement in massive open online courses

Towards an educational design pattern language for massive open online courses (MOOCs)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Using Transformer Language Models to Validate Peer-Assigned Essay Scores in Massive Open Online Courses (MOOCs)

LAK2023: LAK23: 13th International Learning Analytics and Knowledge Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Instructional quality of Massive Open Online Courses (MOOCs)

Student engagement in massive open online courses

Towards an educational design pattern language for massive open online courses (MOOCs)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media