Rater-Effect IRT Model Integrating Supervised LDA for Accurate Measurement of Essay Writing Ability

Uto, Masaki

doi:10.1007/978-3-030-23204-7_41

Masaki Uto²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11625))

Included in the following conference series:

International Conference on Artificial Intelligence in Education

4328 Accesses

Abstract

Essay-writing tests are widely used in various assessment contexts to measure higher-order abilities of learners. However, a persistent difficulty is that ability measurement accuracy strongly depends on rater characteristics. To resolve this problem, many item response theory (IRT) models have been proposed that can estimate learners’ abilities in consideration of rater effects. One remaining difficulty, however, is that measurement accuracy is reduced when few raters are assigned to each essay, a common situation in practical testing contexts. To address this problem, we propose a new rater-effect IRT model integrating a supervised topic model that can estimate abilities from raters’ scores and the textual content of written essays. By reflecting textual content features in IRT-based ability estimates, the model can improve ability measurement accuracy when there are few raters for each essay. Furthermore, learners’ abilities can be estimated using essay textual content alone, without ratings, when model parameters are known. Finally, scores for unrated essays can be estimated from textual content, so the model can be used for automated essay scoring. We evaluate the effectiveness of the proposed model through experiments using actual data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Collaborative Essay Evaluation with Human and Neural Graders Using Item Response Theory Under a Nonequivalent Groups Design

Linking essay-writing tests using many-facet models and neural automated essay scoring

Article Open access 20 August 2024

Item Response Theory Without Restriction of Equal Interval Scale for Rater’s Score

References

Abosalem, Y.: Beyond translation: adapting a performance-task-based assessment of critical thinking ability for use in Rwanda. Int. J. Second. Educ. 4(1), 1–11 (2016)
Article Google Scholar
Alikaniotis, D., Yannakoudakis, H., Rei, M.: Automatic text scoring using neural networks. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 715–725. Association for Computational Linguistics (2016)
Google Scholar
Amorim, E., Cançado, M., Veloso, A.: Automated essay scoring in the presence of biased ratings. In: Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 229–237 (2018)
Google Scholar
Andrich, D.: A rating formulation for ordered response categories. Psychometrika 43(4), 561–573 (1978)
Article Google Scholar
Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence, pp. 27–34 (2009)
Google Scholar
Baker, F., Kim, S.H.: Item Response Theory: Parameter Estimation Techniques. Statistics, textbooks and monographs. Marcel Dekker, New York (2004)
Book Google Scholar
Bernardin, H.J., Thomason, S., Buckley, M.R., Kane, J.S.: Rater rating-level bias and accuracy in performance appraisals: the impact of rater personality, performance management competence, and rater accountability. Hum. Resour. Manag. 55(2), 321–340 (2016)
Article Google Scholar
Blei, D.M., McAuliffe, J.D.: Supervised topic models. In: Proceedings of the 20th International Conference on Neural Information Processing Systems, pp. 121–128 (2007)
Google Scholar
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Dascalu, M., Westera, W., Ruseti, S., Trausan-Matu, S., Kurvers, H.: ReaderBench learns Dutch: building a comprehensive automated essay scoring system for Dutch language. In: André, E., Baker, R., Hu, X., Rodrigo, M.M.T., du Boulay, B. (eds.) AIED 2017. LNCS (LNAI), vol. 10331, pp. 52–63. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61425-0_5
Chapter Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
Duan, D., Li, Y., Li, R., Zhang, R., Wen, A.: Ranktopic: ranking based topic modeling. In: IEEE 12th International Conference on Data Mining, pp. 211–220 (2012)
Google Scholar
Eckes, T.: Examining rater effects in TestDaF writing and speaking performance assessments: a many-Facet Rasch analysis. Lang. Assess. Q. 2(3), 197–221 (2005)
Article Google Scholar
Eckes, T.: Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments. Peter Lang Pub. Inc., Frankfurt (2015)
Google Scholar
Engelhard, G.: Constructing rater and task banks for performance assessments. J. Outcome Meas. 1(1), 19–33 (1997)
Google Scholar
Farag, Y., Yannakoudakis, H., Briscoe, T.: Neural automated essay scoring and coherence modeling for adversarially crafted input. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 263–271. Association for Computational Linguistics (2018)
Google Scholar
Fox, J.P.: Bayesian Item Response Modeling: Theory and Applications. Statistics for Social and Behavioral Sciences. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-0742-4
Book MATH Google Scholar
Gerrish, S.M., Blei, D.M.: Predicting legislative roll calls from text. In: Proceedings of International Conference on International Conference on Machine Learning, pp. 489–496 (2011)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. 101(Suppl. 1), 5228–5235 (2004)
Article Google Scholar
Hastings, P., Hughes, S., Britt, M.A.: Active learning for improving machine learning of student explanatory essays. In: Penstein Rosé, C., et al. (eds.) AIED 2018. LNCS (LNAI), vol. 10947, pp. 140–153. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93843-1_11
Chapter Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57 (1999)
Google Scholar
Jameel, S., Lam, W., Bing, L.: Supervised topic models with word order structure for document classification and retrieval learning. Inf. Retr. J. 18(4), 283–330 (2015)
Article Google Scholar
Kassim, N.L.A.: Judging behaviour and rater errors: an application of the many-Facet Rasch model. GEMA Online J. Lang. Stud. 11(3), 179–197 (2011)
Google Scholar
Li, F., Wang, S., Liu, S., Zhang, M.: SUIT: a supervised user-item based topic model for sentiment analysis. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 1636–1642 (2014)
Google Scholar
Li, X., Ouyang, J., Zhou, X.: Supervised topic models for multi-label classification. Neurocomputing 149, 811–819 (2015)
Article Google Scholar
Liu, O.L., Frankel, L., Roohr, K.C.: Assessing critical thinking in higher education: Current state and directions for next-generation assessment. ETS Research Report Series 2014, 1, pp. 1–23 (2014)
Article Google Scholar
Lord, F.: Applications of Item Response Theory to Practical Testing Problems. Erlbaum Associates, Hillsdale (1980)
Google Scholar
Louvigné, S., Uto, M., Kato, Y., Ishii, T.: Social constructivist approach of motivation: social media messages recommendation system. Behaviormetrika 45(1), 133–155 (2018)
Google Scholar
Masters, G.: A Rasch model for partial credit scoring. Psychometrika 47(2), 149–174 (1982)
Article Google Scholar
Muraki, E.: A generalized partial credit model. In: van der Linden, W.J., Hambleton, R.K. (eds.) Handbook of Modern Item Response Theory, pp. 153–164. Springer, New york (1997). https://doi.org/10.1007/978-1-4757-2691-6_9
Chapter Google Scholar
Myford, C.M., Wolfe, E.W.: Detecting and measuring rater effects using many-Facet Rasch measurement: Part I. J. Appl. Meas. 4, 386–422 (2003)
Google Scholar
Patz, R.J., Junker, B.W., Johnson, M.S., Mariano, L.T.: The hierarchical rater model for rated test items and its application to large-scale educational assessment data. J. Educ. Behav. Stat. 27(4), 341–366 (1999)
Article Google Scholar
Patz, R.J., Junker, B.: Applications and extensions of MCMC in IRT: multiple item types, missing data, and rated responses. J. Educ. Behav. Stat. 24, 342–366 (1999)
Article Google Scholar
Persky, H., Daane, M., Jin, Y.: The nation’s report card: Writing 2002. Technical report, National Center for Education Statistics (2003)
Google Scholar
Rodrigues, F., Ribeiro, B., Lourenço, M., Pereira, F.C.: Learning supervised topic models from crowds. In: Third AAAI Conference on Human Computation and Crowdsourcing, pp. 160–168 (2015)
Google Scholar
Rosen, Y., Tager, M.: Making student thinking visible through a concept map in computer-based assessment of critical thinking. J. Educ. Comput. Res. 50(2), 249–270 (2014)
Article Google Scholar
Salahu-Din, D., Persky, H., Miller, J.: The nation’s report card: Writing 2007. Technical report, National Center for Education Statistics (2008)
Google Scholar
Samejima, F.: Estimation of latent ability using a response pattern of graded scores. Psychom. Monogr. 17, 1–100 (1969)
Google Scholar
Schendel, R., Tolmie, A.: Assessment techniques and students’ higher-order thinking skills. Assess. Eval. High. Educ. 42(5), 673–689 (2017)
Article Google Scholar
Taddy, M.: On estimation and selection for topic models. In: Lawrence, N.D., Girolami, M.A. (eds.) Proceedings of International Conference on Artificial Intelligence and Statistics, vol. 22, pp. 1184–1193 (2012)
Google Scholar
Taghipour, K., Ng, H.T.: A neural approach to automated essay scoring. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1882–1891. Association for Computational Linguistics (2016)
Google Scholar
Ueno, M., Okamoto, T.: Item response theory for peer assessment. In: Proceedings of IEEE International Conference on Advanced Learning Technologies, pp. 554–558 (2008)
Google Scholar
Uto, M., Nguyen, T., Ueno, M.: Group optimization to maximize peer assessment accuracy using item response theory and integer programming. IEEE Trans. Learn. Technol. p. 1 (2019)
Google Scholar
Uto, M., Louvigné, S., Kato, Y., Ishii, T., Miyazawa, Y.: Diverse reports recommendation system based on latent Dirichlet allocation. Behaviormetrika 44(2), 425–444 (2017)
Article Google Scholar
Uto, M., Thien, N.D., Ueno, M.: Group optimization to maximize peer assessment accuracy using item response theory. In: André, E., Baker, R., Hu, X., Rodrigo, M.M.T., du Boulay, B. (eds.) AIED 2017. LNCS (LNAI), vol. 10331, pp. 393–405. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61425-0_33
Chapter Google Scholar
Uto, M., Ueno, M.: Item response theory for peer assessment. IEEE Trans. Learn.Technol. 9(2), 157–170 (2016)
Article Google Scholar
Uto, M., Ueno, M.: Empirical comparison of item response theory models with rater’s parameters. Heliyon 4(5), 1–32 (2018)
Article Google Scholar
Uto, M., Ueno, M.: Item response theory without restriction of equal interval scale for rater’s score. In: Penstein Rosé, C., Martínez-Maldonado, R., Hoppe, H.U., Luckin, R., Mavrikis, M., Porayska-Pomsta, K., McLaren, B., du Boulay, B. (eds.) AIED 2018. LNCS (LNAI), vol. 10948, pp. 363–368. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93846-2_68
Chapter Google Scholar
Zheng, X., Yu, Y., Xing, E.P.: Linear time samplers for supervised topic models using compositional proposals. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1523–1532 (2015)
Google Scholar
Zhu, J., Ahmed, A., Xing, E.P.: MedLDA: maximum margin supervised topic models for regression and classification. In: Proceedings of the 26th International Conference on Machine Learning. pp. 1257–1264 (2009)
Google Scholar

Download references

Acknowledgment

This work was supported by JSPS KAKENHI Grant Numbers 17H04726 and 17K20024.

Author information

Authors and Affiliations

University of Electro-Communications, Tokyo, Japan
Masaki Uto

Authors

Masaki Uto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Masaki Uto .

Editor information

Editors and Affiliations

University of Sao Paulo, Sao Paulo, Brazil
Seiji Isotani
University of Malaga, Málaga, Spain
Eva Millán
Carnegie Mellon University, Pittsburgh, PA, USA
Amy Ogan
DePaul University, Chicago, IL, USA
Peter Hastings
Carnegie Mellon University, Pittsburgh, PA, USA
Bruce McLaren
University College London, London, UK
Rose Luckin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Uto, M. (2019). Rater-Effect IRT Model Integrating Supervised LDA for Accurate Measurement of Essay Writing Ability. In: Isotani, S., Millán, E., Ogan, A., Hastings, P., McLaren, B., Luckin, R. (eds) Artificial Intelligence in Education. AIED 2019. Lecture Notes in Computer Science(), vol 11625. Springer, Cham. https://doi.org/10.1007/978-3-030-23204-7_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-23204-7_41
Published: 21 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23203-0
Online ISBN: 978-3-030-23204-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics