ABSTRACT
Massive Open Online Courses (MOOCs) such as those offered by Coursera are popular ways for adults to gain important skills, advance their careers, and pursue their interests. Within these courses, students are often required to compose, submit, and peer review written essays, providing a valuable pedagogical experience for the student and a wealth of natural language data for the educational researcher. However, the scores provided by peers do not always reflect the actual quality of the text, generating questions about the reliability and validity of the scores. This study evaluates methods to increase the reliability of MOOC peer-review ratings through a series of validation tests on peer-reviewed essays. Reliability of reviewers was based on correlations between text length and essay quality. Raters were pruned based on score variance and the lexical diversity observed in their comments to create sub-sets of raters. Each subset was then used as training data to finetune distilBERT large language models to automatically score essay quality as a measure of validation. The accuracy of each language model for each subset was evaluated. We find that training language models on data subsets produced by more reliable raters based on a combination of score variance and lexical diversity produce more accurate essay scoring models. The approach developed in this study should allow for enhanced reliability of peer-reviewed scoring in MOOCS affording greater credibility within the systems.
- Ramón Alcarria, Borja Bordel, Diego Martín de Andrés, and Tomás Robles. 2018. Enhanced Peer Assessment in MOOC Evaluation Through Assignment and Review Analysis. International Journal of Emerging Technologies in Learning (iJET) 13, 01 (Jan. 2018), 206. https://doi.org/10.3991/ijet.v13i01.7461Google ScholarCross Ref
- Gabriel Badea and Elvira Popescu. 2022. A dynamic review allocation approach for peer assessment in technology enhanced learning. Education and Information Technologies (June 2022). https://doi.org/10.1007/s10639-022-11175-5Google ScholarDigital Library
- Elizabeth Badger and Brenda Thomas. 1992. Open-Ended Questions in Reading. Practical Assessment, Research and Evaluation 3, 4 (1992). https://doi.org/10.7275/FRYF-Z044 Publisher: University of Massachusetts Amherst.Google ScholarCross Ref
- Stephen P Balfour. 2013. Assessing Writing in MOOCs: Automated Essay Scoring and Calibrated Peer Review™.Research & Practice in Assessment 8 (2013), 40–48.Google Scholar
- William E. Becker and Carol Johnston. 1999. The Relationship between Multiple Choice and Essay Response Questions in Assessing Economics Understanding. Economic Record 75, 4 (Dec. 1999), 348–357. https://doi.org/10.1111/j.1475-4932.1999.tb02571.xGoogle ScholarCross Ref
- John Bennion, Brian Cannon, Brian Hill, Riley Nelson, and Meagan Ricks. 2020. Asking the Right Questions: Using Reflective Essays for Experiential Assessment. Journal of Experiential Education 43, 1 (March 2020), 37–54. https://doi.org/10.1177/1053825919880202Google ScholarCross Ref
- Majdi Beseiso and Saleh Alzahrani. 2020. An empirical analysis of BERT embedding for automated essay scoring. International Journal of Advanced Computer Science and Applications 11, 10(2020).Google ScholarCross Ref
- Mina Shirvani Boroujeni and Pierre Dillenbourg. 2018. Discovery and temporal analysis of latent study patterns in MOOC interaction sequences. In Proceedings of the 8th International Conference on Learning Analytics and Knowledge. ACM, Sydney New South Wales Australia, 206–215. https://doi.org/10.1145/3170358.3170388Google ScholarDigital Library
- Stephen Bostock. 2000. Student peer assessment. Learning Technology 5, 1 (2000), 245–249.Google Scholar
- Robert-Mihai Botarleanu, Mihai Dascalu, Laura K Allen, Scott Andrew Crossley, and Danielle S McNamara. 2022. Multitask Summary Scoring with Longformers. In International Conference on Artificial Intelligence in Education. Springer, 756–761.Google Scholar
- George A Brown, Joanna Bull, and Malcolm Pendlebury. 2013. Assessing Student Learning in Higher Education (0 ed.). Routledge. https://doi.org/10.4324/9781315004914Google ScholarCross Ref
- Scott Crossley, Luc Paquette, Mihai Dascalu, Danielle S. McNamara, and Ryan S. Baker. 2016. Combining click-stream data with NLP tools to better understand MOOC completion. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. ACM, Edinburgh United Kingdom, 6–14. https://doi.org/10.1145/2883851.2883931Google ScholarDigital Library
- Scott Crossley, Qian Wan, Laura Allen, and Danielle McNamara. 2021. Source inclusion in synthesis writing: an NLP approach to understanding argumentation, sourcing, and essay quality. Reading and Writing (2021).Google Scholar
- Scott A Crossley, Kristopher Kyle, and Danielle S McNamara. 2016. The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior research methods 48, 4 (2016), 1227–1237.Google Scholar
- R Wes Crues, Nigel Bosch, and Carolyn J Anderson. 2018. Who they are and what they want: Understanding the reasons for MOOC enrollment. Proceedings of the 11th International Conference on Educational Data Mining (2018), 10.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).Google Scholar
- Johanna Fleckenstein, Jennifer Meyer, Thorben Jansen, Stefan Keller, and Olaf Köller. 2020. Is a Long Essay Always a Good Essay? The Effect of Text Length on Writing Assessment. Frontiers in psychology 11 (2020), 562462.Google Scholar
- Dilrukshi Gamage, Thomas Staubitz, and Mark Whiting. 2021. Peer assessment in MOOCs: Systematic literature review. Distance Education 42, 2 (April 2021), 268–289. https://doi.org/10.1080/01587919.2021.1911626Google ScholarCross Ref
- Felix Garcia-Loro, Sergio Martin, José A. Ruipérez-Valiente, Elio Sancristobal, and Manuel Castro. 2020. Reviewing and analyzing peer review Inter-Rater Reliability in a MOOC platform. Computers & Education 154 (Sept. 2020), 103894. https://doi.org/10.1016/j.compedu.2020.103894Google ScholarCross Ref
- Xinyue Guo, Feng Wu, and Xin Zheng. 2019. What Motives Learner to Learn in MOOC? An Investigation of Chinese University MOOC. In 2019 International Joint Conference on Information, Media and Engineering (IJCIME). IEEE, Osaka, Japan, 154–159. https://doi.org/10.1109/IJCIME49369.2019.00039Google ScholarCross Ref
- Andreas Köhler and Lorenz Erdmann. 2004. Expected environmental impacts of pervasive computing. Human and Ecological Risk Assessment 10, 5 (2004), 831–852.Google ScholarCross Ref
- Kristopher Kyle and Scott A Crossley. 2015. Automatically assessing lexical sophistication: Indices, tools, findings, and application. Tesol Quarterly 49, 4 (2015), 757–786.Google ScholarCross Ref
- Kristopher Kyle and Scott A Crossley. 2018. Measuring syntactic complexity in L2 writing using fine-grained clausal and phrasal indices. The Modern Language Journal 102, 2 (2018), 333–349.Google ScholarCross Ref
- Kristopher Kyle, Scott A Crossley, and Scott Jarvis. 2021. Assessing the validity of lexical diversity indices using direct judgements. Language Assessment Quarterly 18, 2 (2021), 154–170.Google ScholarCross Ref
- Paraskevas Lagakis and Stavros Demetriadis. 2021. Automated essay scoring: A review of the field. In 2021 International Conference on Computer, Information and Telecommunication Systems (CITS). IEEE, Istanbul, Turkey, 1–6. https://doi.org/10.1109/CITS52676.2021.9618476Google ScholarCross Ref
- Hongxia Li, ChengLing Zhao, Taotao Long, Yan Huang, and Fengfang Shu. 2021. Exploring the reliability and its influencing factors of peer assessment in massive open online courses. British Journal of Educational Technology 52 (2021), 2263–2277.Google ScholarCross Ref
- Xiaofei Lu. 2010. Automatic analysis of syntactic complexity in second language writing. International journal of corpus linguistics 15, 4 (2010), 474–496.Google Scholar
- Robert Lukhele, David Thissen, and Howard Weiner. 1994. On the Relative Value of Multiple-Choice, Constructed Response, and Examinee-Selected Items on Two Achievement Tests. Journal of Educational Measurement 31, 3 (1994), 234–250.Google ScholarCross Ref
- Elijah Mayfield and Alan W Black. 2020. Should You Fine-Tune BERT for Automated Essay Scoring?. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Seattle, WA, USA → Online, 151–162. https://doi.org/10.18653/v1/2020.bea-1.15Google ScholarCross Ref
- Rolin Moe. 2015. The brief & expansive history (and future) of the MOOC: Why two divergent models share the same name. Current Issues in Emerging eLearning 2, 1 (2015), 24.Google Scholar
- Yasuhiro Ozuru, Stephen Briner, Christopher A. Kurby, and Danielle S. McNamara. 2013. Comparing comprehension measured by multiple-choice and open-ended questions.Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale 67, 3 (2013), 215–227. https://doi.org/10.1037/a0032918Google ScholarCross Ref
- Les Perelman. 2014. When “the state of the art” is counting words. Assessing Writing 21 (July 2014), 104–111. https://doi.org/10.1016/j.asw.2014.05.001Google ScholarCross Ref
- Martin Popel and Ondřej Bojar. 2018. Training Tips for the Transformer Model. The Prague Bulletin of Mathematical Linguistics 110, 1 (April 2018), 43–70. https://doi.org/10.2478/pralin-2018-0002 arXiv:1804.00247 [cs].Google ScholarCross Ref
- Justin Reich and José A. Ruipérez-Valiente. 2019. The MOOC pivot. Science 363, 6423 (Jan. 2019), 130–131. https://doi.org/10.1126/science.aav7958Google ScholarCross Ref
- Erin Dawna Reilly, Rose Eleanore Stafford, Kyle Marie Williams, and Stephanie Brooks Corliss. 2014. Evaluating the validity and applicability of automated essay scoring in two massive open online courses. The International Review of Research in Open and Distributed Learning 15, 5 (Oct. 2014). https://doi.org/10.19173/irrodl.v15i5.1857Google ScholarCross Ref
- Zhiyun Ren, Huzefa Rangwala, and Aditya Johri. 2016. Predicting Performance on MOOC Assessments using Multi-Regression Models. http://arxiv.org/abs/1605.02269 arXiv:1605.02269 [cs].Google Scholar
- Pedro Uria Rodriguez, Amir Jafari, and Christopher M. Ormerod. 2019. Language models and Automated Essay Scoring. http://arxiv.org/abs/1909.09482 arXiv:1909.09482 [cs, stat].Google Scholar
- Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108(2019).Google Scholar
- Huma Shafiq. 2017. Courses beyond borders: A case study of MOOC platform Coursera. Library of Philosophy and Practice(2017), 1566.Google Scholar
- Thomas Staubitz. 2020. Gradable team assignments in large scale learning environments: collaborative learning, teamwork, and peer assessment in MOOCs: Kollaboratives Lernen, Teamarbeit und Peer Assessment in MOOCs. Ph. D. Dissertation. Universität Potsdam. https://doi.org/10.25932/PUBLISHUP-47183 Artwork Size: 16774 KB, 70133 KB, 122 pages Medium: application/pdf,application/zip Pages: 16774 KB, 70133 KB, 122 pages.Google ScholarCross Ref
- Thomas Staubitz, Dominic Petrick, Matthias Bauer, Jan Renz, and Christoph Meinel. 2016. Improving the Peer Assessment Experience on MOOC Platforms. In Proceedings of the Third (2016) ACM Conference on Learning @ Scale. ACM, Edinburgh Scotland UK, 389–398. https://doi.org/10.1145/2876034.2876043Google ScholarDigital Library
- Thomas Staubitz, Hanadi Traifeh, Salim Chujfi, and Christoph Meinel. 2020. Have Your Tickets Ready! Impede Free Riding in Large Scale Team Assignments. In Proceedings of the Seventh ACM Conference on Learning @ Scale. ACM, Virtual Event USA, 349–352. https://doi.org/10.1145/3386527.3406744Google ScholarDigital Library
- Hoi K. Suen. 2014. Peer assessment for massive open online courses (MOOCs). The International Review of Research in Open and Distributed Learning 15, 3 (June 2014). https://doi.org/10.19173/irrodl.v15i3.1680Google ScholarCross Ref
- Bruce W. Tuckman. 1993. The Essay Test: A Look at the Advantages and Disadvantages. NASSP Bulletin 77, 555 (Oct. 1993), 20–26. https://doi.org/10.1177/019263659307755504Google ScholarCross Ref
- Masaki Uto. 2021. A review of deep-neural automated essay scoring models. Behaviormetrika 48, 2 (2021), 459–484.Google ScholarCross Ref
- Yongjie Wang, Chuan Wang, Ruobing Li, and Hui Lin. 2022. On the Use of BERT for Automated Essay Scoring: Joint Learning of Multi-Scale Essay Representation. arXiv preprint arXiv:2205.03835(2022).Google Scholar
- Joshua Wilson and Jessica Rodrigues. 2020. Classification accuracy and efficiency of writing screening using automated essay scoring. Journal of School Psychology 82 (2020), 123–140.Google ScholarCross Ref
- Haoran Zhang and Diane Litman. 2021. Essay Quality Signals as Weak Supervision for Source-based Essay Scoring. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications. 85–96.Google Scholar
Index Terms
- Using Transformer Language Models to Validate Peer-Assigned Essay Scores in Massive Open Online Courses (MOOCs)
Recommendations
Instructional quality of Massive Open Online Courses (MOOCs)
We present an analysis of instructional design quality of 76 randomly selected Massive Open Online Courses (MOOCs). The quality of MOOCs was determined from first principles of instruction, using a course survey instrument. Two types of MOOCs (xMOOCs ...
Student engagement in massive open online courses
Completion rates in massive open online courses MOOCs are disturbingly low. Existing analysis has focused on patterns of resource access and prediction of drop-out using learning analytics. In contrast, the effectiveness of teaching programs in ...
Towards an educational design pattern language for massive open online courses (MOOCs)
PLoP '17: Proceedings of the 24th Conference on Pattern Languages of ProgramsThe design of innovative and groundbreaking Massive Open Online Courses (MOOCs) has been a difficult task for practitioners, especially novices, because there is no definite standard about the desirable characteristics of MOOC courses. Educational ...
Comments