Skip to main content

Rater-Effect IRT Model Integrating Supervised LDA for Accurate Measurement of Essay Writing Ability

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11625))

Abstract

Essay-writing tests are widely used in various assessment contexts to measure higher-order abilities of learners. However, a persistent difficulty is that ability measurement accuracy strongly depends on rater characteristics. To resolve this problem, many item response theory (IRT) models have been proposed that can estimate learners’ abilities in consideration of rater effects. One remaining difficulty, however, is that measurement accuracy is reduced when few raters are assigned to each essay, a common situation in practical testing contexts. To address this problem, we propose a new rater-effect IRT model integrating a supervised topic model that can estimate abilities from raters’ scores and the textual content of written essays. By reflecting textual content features in IRT-based ability estimates, the model can improve ability measurement accuracy when there are few raters for each essay. Furthermore, learners’ abilities can be estimated using essay textual content alone, without ratings, when model parameters are known. Finally, scores for unrated essays can be estimated from textual content, so the model can be used for automated essay scoring. We evaluate the effectiveness of the proposed model through experiments using actual data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Abosalem, Y.: Beyond translation: adapting a performance-task-based assessment of critical thinking ability for use in Rwanda. Int. J. Second. Educ. 4(1), 1–11 (2016)

    Article  Google Scholar 

  2. Alikaniotis, D., Yannakoudakis, H., Rei, M.: Automatic text scoring using neural networks. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 715–725. Association for Computational Linguistics (2016)

    Google Scholar 

  3. Amorim, E., Cançado, M., Veloso, A.: Automated essay scoring in the presence of biased ratings. In: Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 229–237 (2018)

    Google Scholar 

  4. Andrich, D.: A rating formulation for ordered response categories. Psychometrika 43(4), 561–573 (1978)

    Article  Google Scholar 

  5. Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence, pp. 27–34 (2009)

    Google Scholar 

  6. Baker, F., Kim, S.H.: Item Response Theory: Parameter Estimation Techniques. Statistics, textbooks and monographs. Marcel Dekker, New York (2004)

    Book  Google Scholar 

  7. Bernardin, H.J., Thomason, S., Buckley, M.R., Kane, J.S.: Rater rating-level bias and accuracy in performance appraisals: the impact of rater personality, performance management competence, and rater accountability. Hum. Resour. Manag. 55(2), 321–340 (2016)

    Article  Google Scholar 

  8. Blei, D.M., McAuliffe, J.D.: Supervised topic models. In: Proceedings of the 20th International Conference on Neural Information Processing Systems, pp. 121–128 (2007)

    Google Scholar 

  9. Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  10. Dascalu, M., Westera, W., Ruseti, S., Trausan-Matu, S., Kurvers, H.: ReaderBench learns Dutch: building a comprehensive automated essay scoring system for Dutch language. In: André, E., Baker, R., Hu, X., Rodrigo, M.M.T., du Boulay, B. (eds.) AIED 2017. LNCS (LNAI), vol. 10331, pp. 52–63. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61425-0_5

    Chapter  Google Scholar 

  11. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  12. Duan, D., Li, Y., Li, R., Zhang, R., Wen, A.: Ranktopic: ranking based topic modeling. In: IEEE 12th International Conference on Data Mining, pp. 211–220 (2012)

    Google Scholar 

  13. Eckes, T.: Examining rater effects in TestDaF writing and speaking performance assessments: a many-Facet Rasch analysis. Lang. Assess. Q. 2(3), 197–221 (2005)

    Article  Google Scholar 

  14. Eckes, T.: Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments. Peter Lang Pub. Inc., Frankfurt (2015)

    Google Scholar 

  15. Engelhard, G.: Constructing rater and task banks for performance assessments. J. Outcome Meas. 1(1), 19–33 (1997)

    Google Scholar 

  16. Farag, Y., Yannakoudakis, H., Briscoe, T.: Neural automated essay scoring and coherence modeling for adversarially crafted input. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 263–271. Association for Computational Linguistics (2018)

    Google Scholar 

  17. Fox, J.P.: Bayesian Item Response Modeling: Theory and Applications. Statistics for Social and Behavioral Sciences. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-0742-4

    Book  MATH  Google Scholar 

  18. Gerrish, S.M., Blei, D.M.: Predicting legislative roll calls from text. In: Proceedings of International Conference on International Conference on Machine Learning, pp. 489–496 (2011)

    Google Scholar 

  19. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. 101(Suppl. 1), 5228–5235 (2004)

    Article  Google Scholar 

  20. Hastings, P., Hughes, S., Britt, M.A.: Active learning for improving machine learning of student explanatory essays. In: Penstein Rosé, C., et al. (eds.) AIED 2018. LNCS (LNAI), vol. 10947, pp. 140–153. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93843-1_11

    Chapter  Google Scholar 

  21. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57 (1999)

    Google Scholar 

  22. Jameel, S., Lam, W., Bing, L.: Supervised topic models with word order structure for document classification and retrieval learning. Inf. Retr. J. 18(4), 283–330 (2015)

    Article  Google Scholar 

  23. Kassim, N.L.A.: Judging behaviour and rater errors: an application of the many-Facet Rasch model. GEMA Online J. Lang. Stud. 11(3), 179–197 (2011)

    Google Scholar 

  24. Li, F., Wang, S., Liu, S., Zhang, M.: SUIT: a supervised user-item based topic model for sentiment analysis. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 1636–1642 (2014)

    Google Scholar 

  25. Li, X., Ouyang, J., Zhou, X.: Supervised topic models for multi-label classification. Neurocomputing 149, 811–819 (2015)

    Article  Google Scholar 

  26. Liu, O.L., Frankel, L., Roohr, K.C.: Assessing critical thinking in higher education: Current state and directions for next-generation assessment. ETS Research Report Series 2014, 1, pp. 1–23 (2014)

    Article  Google Scholar 

  27. Lord, F.: Applications of Item Response Theory to Practical Testing Problems. Erlbaum Associates, Hillsdale (1980)

    Google Scholar 

  28. Louvigné, S., Uto, M., Kato, Y., Ishii, T.: Social constructivist approach of motivation: social media messages recommendation system. Behaviormetrika 45(1), 133–155 (2018)

    Google Scholar 

  29. Masters, G.: A Rasch model for partial credit scoring. Psychometrika 47(2), 149–174 (1982)

    Article  Google Scholar 

  30. Muraki, E.: A generalized partial credit model. In: van der Linden, W.J., Hambleton, R.K. (eds.) Handbook of Modern Item Response Theory, pp. 153–164. Springer, New york (1997). https://doi.org/10.1007/978-1-4757-2691-6_9

    Chapter  Google Scholar 

  31. Myford, C.M., Wolfe, E.W.: Detecting and measuring rater effects using many-Facet Rasch measurement: Part I. J. Appl. Meas. 4, 386–422 (2003)

    Google Scholar 

  32. Patz, R.J., Junker, B.W., Johnson, M.S., Mariano, L.T.: The hierarchical rater model for rated test items and its application to large-scale educational assessment data. J. Educ. Behav. Stat. 27(4), 341–366 (1999)

    Article  Google Scholar 

  33. Patz, R.J., Junker, B.: Applications and extensions of MCMC in IRT: multiple item types, missing data, and rated responses. J. Educ. Behav. Stat. 24, 342–366 (1999)

    Article  Google Scholar 

  34. Persky, H., Daane, M., Jin, Y.: The nation’s report card: Writing 2002. Technical report, National Center for Education Statistics (2003)

    Google Scholar 

  35. Rodrigues, F., Ribeiro, B., Lourenço, M., Pereira, F.C.: Learning supervised topic models from crowds. In: Third AAAI Conference on Human Computation and Crowdsourcing, pp. 160–168 (2015)

    Google Scholar 

  36. Rosen, Y., Tager, M.: Making student thinking visible through a concept map in computer-based assessment of critical thinking. J. Educ. Comput. Res. 50(2), 249–270 (2014)

    Article  Google Scholar 

  37. Salahu-Din, D., Persky, H., Miller, J.: The nation’s report card: Writing 2007. Technical report, National Center for Education Statistics (2008)

    Google Scholar 

  38. Samejima, F.: Estimation of latent ability using a response pattern of graded scores. Psychom. Monogr. 17, 1–100 (1969)

    Google Scholar 

  39. Schendel, R., Tolmie, A.: Assessment techniques and students’ higher-order thinking skills. Assess. Eval. High. Educ. 42(5), 673–689 (2017)

    Article  Google Scholar 

  40. Taddy, M.: On estimation and selection for topic models. In: Lawrence, N.D., Girolami, M.A. (eds.) Proceedings of International Conference on Artificial Intelligence and Statistics, vol. 22, pp. 1184–1193 (2012)

    Google Scholar 

  41. Taghipour, K., Ng, H.T.: A neural approach to automated essay scoring. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1882–1891. Association for Computational Linguistics (2016)

    Google Scholar 

  42. Ueno, M., Okamoto, T.: Item response theory for peer assessment. In: Proceedings of IEEE International Conference on Advanced Learning Technologies, pp. 554–558 (2008)

    Google Scholar 

  43. Uto, M., Nguyen, T., Ueno, M.: Group optimization to maximize peer assessment accuracy using item response theory and integer programming. IEEE Trans. Learn. Technol. p. 1 (2019)

    Google Scholar 

  44. Uto, M., Louvigné, S., Kato, Y., Ishii, T., Miyazawa, Y.: Diverse reports recommendation system based on latent Dirichlet allocation. Behaviormetrika 44(2), 425–444 (2017)

    Article  Google Scholar 

  45. Uto, M., Thien, N.D., Ueno, M.: Group optimization to maximize peer assessment accuracy using item response theory. In: André, E., Baker, R., Hu, X., Rodrigo, M.M.T., du Boulay, B. (eds.) AIED 2017. LNCS (LNAI), vol. 10331, pp. 393–405. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61425-0_33

    Chapter  Google Scholar 

  46. Uto, M., Ueno, M.: Item response theory for peer assessment. IEEE Trans. Learn.Technol. 9(2), 157–170 (2016)

    Article  Google Scholar 

  47. Uto, M., Ueno, M.: Empirical comparison of item response theory models with rater’s parameters. Heliyon 4(5), 1–32 (2018)

    Article  Google Scholar 

  48. Uto, M., Ueno, M.: Item response theory without restriction of equal interval scale for rater’s score. In: Penstein Rosé, C., Martínez-Maldonado, R., Hoppe, H.U., Luckin, R., Mavrikis, M., Porayska-Pomsta, K., McLaren, B., du Boulay, B. (eds.) AIED 2018. LNCS (LNAI), vol. 10948, pp. 363–368. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93846-2_68

    Chapter  Google Scholar 

  49. Zheng, X., Yu, Y., Xing, E.P.: Linear time samplers for supervised topic models using compositional proposals. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1523–1532 (2015)

    Google Scholar 

  50. Zhu, J., Ahmed, A., Xing, E.P.: MedLDA: maximum margin supervised topic models for regression and classification. In: Proceedings of the 26th International Conference on Machine Learning. pp. 1257–1264 (2009)

    Google Scholar 

Download references

Acknowledgment

This work was supported by JSPS KAKENHI Grant Numbers 17H04726 and 17K20024.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Masaki Uto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Uto, M. (2019). Rater-Effect IRT Model Integrating Supervised LDA for Accurate Measurement of Essay Writing Ability. In: Isotani, S., Millán, E., Ogan, A., Hastings, P., McLaren, B., Luckin, R. (eds) Artificial Intelligence in Education. AIED 2019. Lecture Notes in Computer Science(), vol 11625. Springer, Cham. https://doi.org/10.1007/978-3-030-23204-7_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-23204-7_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-23203-0

  • Online ISBN: 978-3-030-23204-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics