skip to main content
10.1145/3576050.3576098acmotherconferencesArticle/Chapter ViewAbstractPublication PageslakConference Proceedingsconference-collections
research-article

Using Transformer Language Models to Validate Peer-Assigned Essay Scores in Massive Open Online Courses (MOOCs)

Published:13 March 2023Publication History

ABSTRACT

Massive Open Online Courses (MOOCs) such as those offered by Coursera are popular ways for adults to gain important skills, advance their careers, and pursue their interests. Within these courses, students are often required to compose, submit, and peer review written essays, providing a valuable pedagogical experience for the student and a wealth of natural language data for the educational researcher. However, the scores provided by peers do not always reflect the actual quality of the text, generating questions about the reliability and validity of the scores. This study evaluates methods to increase the reliability of MOOC peer-review ratings through a series of validation tests on peer-reviewed essays. Reliability of reviewers was based on correlations between text length and essay quality. Raters were pruned based on score variance and the lexical diversity observed in their comments to create sub-sets of raters. Each subset was then used as training data to finetune distilBERT large language models to automatically score essay quality as a measure of validation. The accuracy of each language model for each subset was evaluated. We find that training language models on data subsets produced by more reliable raters based on a combination of score variance and lexical diversity produce more accurate essay scoring models. The approach developed in this study should allow for enhanced reliability of peer-reviewed scoring in MOOCS affording greater credibility within the systems.

References

  1. Ramón Alcarria, Borja Bordel, Diego Martín de Andrés, and Tomás Robles. 2018. Enhanced Peer Assessment in MOOC Evaluation Through Assignment and Review Analysis. International Journal of Emerging Technologies in Learning (iJET) 13, 01 (Jan. 2018), 206. https://doi.org/10.3991/ijet.v13i01.7461Google ScholarGoogle ScholarCross RefCross Ref
  2. Gabriel Badea and Elvira Popescu. 2022. A dynamic review allocation approach for peer assessment in technology enhanced learning. Education and Information Technologies (June 2022). https://doi.org/10.1007/s10639-022-11175-5Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Elizabeth Badger and Brenda Thomas. 1992. Open-Ended Questions in Reading. Practical Assessment, Research and Evaluation 3, 4 (1992). https://doi.org/10.7275/FRYF-Z044 Publisher: University of Massachusetts Amherst.Google ScholarGoogle ScholarCross RefCross Ref
  4. Stephen P Balfour. 2013. Assessing Writing in MOOCs: Automated Essay Scoring and Calibrated Peer Review™.Research & Practice in Assessment 8 (2013), 40–48.Google ScholarGoogle Scholar
  5. William E. Becker and Carol Johnston. 1999. The Relationship between Multiple Choice and Essay Response Questions in Assessing Economics Understanding. Economic Record 75, 4 (Dec. 1999), 348–357. https://doi.org/10.1111/j.1475-4932.1999.tb02571.xGoogle ScholarGoogle ScholarCross RefCross Ref
  6. John Bennion, Brian Cannon, Brian Hill, Riley Nelson, and Meagan Ricks. 2020. Asking the Right Questions: Using Reflective Essays for Experiential Assessment. Journal of Experiential Education 43, 1 (March 2020), 37–54. https://doi.org/10.1177/1053825919880202Google ScholarGoogle ScholarCross RefCross Ref
  7. Majdi Beseiso and Saleh Alzahrani. 2020. An empirical analysis of BERT embedding for automated essay scoring. International Journal of Advanced Computer Science and Applications 11, 10(2020).Google ScholarGoogle ScholarCross RefCross Ref
  8. Mina Shirvani Boroujeni and Pierre Dillenbourg. 2018. Discovery and temporal analysis of latent study patterns in MOOC interaction sequences. In Proceedings of the 8th International Conference on Learning Analytics and Knowledge. ACM, Sydney New South Wales Australia, 206–215. https://doi.org/10.1145/3170358.3170388Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Stephen Bostock. 2000. Student peer assessment. Learning Technology 5, 1 (2000), 245–249.Google ScholarGoogle Scholar
  10. Robert-Mihai Botarleanu, Mihai Dascalu, Laura K Allen, Scott Andrew Crossley, and Danielle S McNamara. 2022. Multitask Summary Scoring with Longformers. In International Conference on Artificial Intelligence in Education. Springer, 756–761.Google ScholarGoogle Scholar
  11. George A Brown, Joanna Bull, and Malcolm Pendlebury. 2013. Assessing Student Learning in Higher Education (0 ed.). Routledge. https://doi.org/10.4324/9781315004914Google ScholarGoogle ScholarCross RefCross Ref
  12. Scott Crossley, Luc Paquette, Mihai Dascalu, Danielle S. McNamara, and Ryan S. Baker. 2016. Combining click-stream data with NLP tools to better understand MOOC completion. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. ACM, Edinburgh United Kingdom, 6–14. https://doi.org/10.1145/2883851.2883931Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Scott Crossley, Qian Wan, Laura Allen, and Danielle McNamara. 2021. Source inclusion in synthesis writing: an NLP approach to understanding argumentation, sourcing, and essay quality. Reading and Writing (2021).Google ScholarGoogle Scholar
  14. Scott A Crossley, Kristopher Kyle, and Danielle S McNamara. 2016. The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior research methods 48, 4 (2016), 1227–1237.Google ScholarGoogle Scholar
  15. R Wes Crues, Nigel Bosch, and Carolyn J Anderson. 2018. Who they are and what they want: Understanding the reasons for MOOC enrollment. Proceedings of the 11th International Conference on Educational Data Mining (2018), 10.Google ScholarGoogle Scholar
  16. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).Google ScholarGoogle Scholar
  17. Johanna Fleckenstein, Jennifer Meyer, Thorben Jansen, Stefan Keller, and Olaf Köller. 2020. Is a Long Essay Always a Good Essay? The Effect of Text Length on Writing Assessment. Frontiers in psychology 11 (2020), 562462.Google ScholarGoogle Scholar
  18. Dilrukshi Gamage, Thomas Staubitz, and Mark Whiting. 2021. Peer assessment in MOOCs: Systematic literature review. Distance Education 42, 2 (April 2021), 268–289. https://doi.org/10.1080/01587919.2021.1911626Google ScholarGoogle ScholarCross RefCross Ref
  19. Felix Garcia-Loro, Sergio Martin, José A. Ruipérez-Valiente, Elio Sancristobal, and Manuel Castro. 2020. Reviewing and analyzing peer review Inter-Rater Reliability in a MOOC platform. Computers & Education 154 (Sept. 2020), 103894. https://doi.org/10.1016/j.compedu.2020.103894Google ScholarGoogle ScholarCross RefCross Ref
  20. Xinyue Guo, Feng Wu, and Xin Zheng. 2019. What Motives Learner to Learn in MOOC? An Investigation of Chinese University MOOC. In 2019 International Joint Conference on Information, Media and Engineering (IJCIME). IEEE, Osaka, Japan, 154–159. https://doi.org/10.1109/IJCIME49369.2019.00039Google ScholarGoogle ScholarCross RefCross Ref
  21. Andreas Köhler and Lorenz Erdmann. 2004. Expected environmental impacts of pervasive computing. Human and Ecological Risk Assessment 10, 5 (2004), 831–852.Google ScholarGoogle ScholarCross RefCross Ref
  22. Kristopher Kyle and Scott A Crossley. 2015. Automatically assessing lexical sophistication: Indices, tools, findings, and application. Tesol Quarterly 49, 4 (2015), 757–786.Google ScholarGoogle ScholarCross RefCross Ref
  23. Kristopher Kyle and Scott A Crossley. 2018. Measuring syntactic complexity in L2 writing using fine-grained clausal and phrasal indices. The Modern Language Journal 102, 2 (2018), 333–349.Google ScholarGoogle ScholarCross RefCross Ref
  24. Kristopher Kyle, Scott A Crossley, and Scott Jarvis. 2021. Assessing the validity of lexical diversity indices using direct judgements. Language Assessment Quarterly 18, 2 (2021), 154–170.Google ScholarGoogle ScholarCross RefCross Ref
  25. Paraskevas Lagakis and Stavros Demetriadis. 2021. Automated essay scoring: A review of the field. In 2021 International Conference on Computer, Information and Telecommunication Systems (CITS). IEEE, Istanbul, Turkey, 1–6. https://doi.org/10.1109/CITS52676.2021.9618476Google ScholarGoogle ScholarCross RefCross Ref
  26. Hongxia Li, ChengLing Zhao, Taotao Long, Yan Huang, and Fengfang Shu. 2021. Exploring the reliability and its influencing factors of peer assessment in massive open online courses. British Journal of Educational Technology 52 (2021), 2263–2277.Google ScholarGoogle ScholarCross RefCross Ref
  27. Xiaofei Lu. 2010. Automatic analysis of syntactic complexity in second language writing. International journal of corpus linguistics 15, 4 (2010), 474–496.Google ScholarGoogle Scholar
  28. Robert Lukhele, David Thissen, and Howard Weiner. 1994. On the Relative Value of Multiple-Choice, Constructed Response, and Examinee-Selected Items on Two Achievement Tests. Journal of Educational Measurement 31, 3 (1994), 234–250.Google ScholarGoogle ScholarCross RefCross Ref
  29. Elijah Mayfield and Alan W Black. 2020. Should You Fine-Tune BERT for Automated Essay Scoring?. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Seattle, WA, USA → Online, 151–162. https://doi.org/10.18653/v1/2020.bea-1.15Google ScholarGoogle ScholarCross RefCross Ref
  30. Rolin Moe. 2015. The brief & expansive history (and future) of the MOOC: Why two divergent models share the same name. Current Issues in Emerging eLearning 2, 1 (2015), 24.Google ScholarGoogle Scholar
  31. Yasuhiro Ozuru, Stephen Briner, Christopher A. Kurby, and Danielle S. McNamara. 2013. Comparing comprehension measured by multiple-choice and open-ended questions.Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale 67, 3 (2013), 215–227. https://doi.org/10.1037/a0032918Google ScholarGoogle ScholarCross RefCross Ref
  32. Les Perelman. 2014. When “the state of the art” is counting words. Assessing Writing 21 (July 2014), 104–111. https://doi.org/10.1016/j.asw.2014.05.001Google ScholarGoogle ScholarCross RefCross Ref
  33. Martin Popel and Ondřej Bojar. 2018. Training Tips for the Transformer Model. The Prague Bulletin of Mathematical Linguistics 110, 1 (April 2018), 43–70. https://doi.org/10.2478/pralin-2018-0002 arXiv:1804.00247 [cs].Google ScholarGoogle ScholarCross RefCross Ref
  34. Justin Reich and José A. Ruipérez-Valiente. 2019. The MOOC pivot. Science 363, 6423 (Jan. 2019), 130–131. https://doi.org/10.1126/science.aav7958Google ScholarGoogle ScholarCross RefCross Ref
  35. Erin Dawna Reilly, Rose Eleanore Stafford, Kyle Marie Williams, and Stephanie Brooks Corliss. 2014. Evaluating the validity and applicability of automated essay scoring in two massive open online courses. The International Review of Research in Open and Distributed Learning 15, 5 (Oct. 2014). https://doi.org/10.19173/irrodl.v15i5.1857Google ScholarGoogle ScholarCross RefCross Ref
  36. Zhiyun Ren, Huzefa Rangwala, and Aditya Johri. 2016. Predicting Performance on MOOC Assessments using Multi-Regression Models. http://arxiv.org/abs/1605.02269 arXiv:1605.02269 [cs].Google ScholarGoogle Scholar
  37. Pedro Uria Rodriguez, Amir Jafari, and Christopher M. Ormerod. 2019. Language models and Automated Essay Scoring. http://arxiv.org/abs/1909.09482 arXiv:1909.09482 [cs, stat].Google ScholarGoogle Scholar
  38. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108(2019).Google ScholarGoogle Scholar
  39. Huma Shafiq. 2017. Courses beyond borders: A case study of MOOC platform Coursera. Library of Philosophy and Practice(2017), 1566.Google ScholarGoogle Scholar
  40. Thomas Staubitz. 2020. Gradable team assignments in large scale learning environments: collaborative learning, teamwork, and peer assessment in MOOCs: Kollaboratives Lernen, Teamarbeit und Peer Assessment in MOOCs. Ph. D. Dissertation. Universität Potsdam. https://doi.org/10.25932/PUBLISHUP-47183 Artwork Size: 16774 KB, 70133 KB, 122 pages Medium: application/pdf,application/zip Pages: 16774 KB, 70133 KB, 122 pages.Google ScholarGoogle ScholarCross RefCross Ref
  41. Thomas Staubitz, Dominic Petrick, Matthias Bauer, Jan Renz, and Christoph Meinel. 2016. Improving the Peer Assessment Experience on MOOC Platforms. In Proceedings of the Third (2016) ACM Conference on Learning @ Scale. ACM, Edinburgh Scotland UK, 389–398. https://doi.org/10.1145/2876034.2876043Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Thomas Staubitz, Hanadi Traifeh, Salim Chujfi, and Christoph Meinel. 2020. Have Your Tickets Ready! Impede Free Riding in Large Scale Team Assignments. In Proceedings of the Seventh ACM Conference on Learning @ Scale. ACM, Virtual Event USA, 349–352. https://doi.org/10.1145/3386527.3406744Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Hoi K. Suen. 2014. Peer assessment for massive open online courses (MOOCs). The International Review of Research in Open and Distributed Learning 15, 3 (June 2014). https://doi.org/10.19173/irrodl.v15i3.1680Google ScholarGoogle ScholarCross RefCross Ref
  44. Bruce W. Tuckman. 1993. The Essay Test: A Look at the Advantages and Disadvantages. NASSP Bulletin 77, 555 (Oct. 1993), 20–26. https://doi.org/10.1177/019263659307755504Google ScholarGoogle ScholarCross RefCross Ref
  45. Masaki Uto. 2021. A review of deep-neural automated essay scoring models. Behaviormetrika 48, 2 (2021), 459–484.Google ScholarGoogle ScholarCross RefCross Ref
  46. Yongjie Wang, Chuan Wang, Ruobing Li, and Hui Lin. 2022. On the Use of BERT for Automated Essay Scoring: Joint Learning of Multi-Scale Essay Representation. arXiv preprint arXiv:2205.03835(2022).Google ScholarGoogle Scholar
  47. Joshua Wilson and Jessica Rodrigues. 2020. Classification accuracy and efficiency of writing screening using automated essay scoring. Journal of School Psychology 82 (2020), 123–140.Google ScholarGoogle ScholarCross RefCross Ref
  48. Haoran Zhang and Diane Litman. 2021. Essay Quality Signals as Weak Supervision for Source-based Essay Scoring. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications. 85–96.Google ScholarGoogle Scholar

Index Terms

  1. Using Transformer Language Models to Validate Peer-Assigned Essay Scores in Massive Open Online Courses (MOOCs)

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            LAK2023: LAK23: 13th International Learning Analytics and Knowledge Conference
            March 2023
            692 pages
            ISBN:9781450398657
            DOI:10.1145/3576050

            Copyright © 2023 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 13 March 2023

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            Overall Acceptance Rate236of782submissions,30%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format