Automated estimation of item difficulty for multiple-choice tests: An application of word embedding techniques

https://doi.org/10.1016/j.ipm.2018.06.007Get rights and content

Abstract

Pretesting is the most commonly used method for estimating test item difficulty because it provides highly accurate results that can be applied to assessment development activities. However, pretesting is inefficient, and it can lead to item exposure. Hence, an increasing number of studies have invested considerable effort in researching the automated estimation of item difficulty. Language proficiency tests constitute the majority of researched test topics, while comparatively less research has focused on content subjects. This paper introduces a novel method for the automated estimation of item difficulty for social studies tests. In this study, we explore the difficulty of multiple-choice items, which consist of the following item elements: a question and alternative options. We use learning materials to construct a semantic space using word embedding techniques and project an item's texts into the semantic space to obtain corresponding vectors. Semantic features are obtained by calculating the cosine similarity between the vectors of item elements. Subsequently, these semantic features are sent to a classifier for training and testing. Based on the output of the classifier, an estimation model is created and item difficulty is estimated. Our findings suggest that the semantic similarity between a stem and the options has the strongest impact on item difficulty. Furthermore, the results indicate that the proposed estimation method outperforms pretesting, and therefore, we expect that the proposed approach will complement and partially replace pretesting in future.

Introduction

This section is divided into two subsections. The first subsection introduces the background of the study. The second subsection describes the purpose and research questions of this study.

The widespread use of the Internet and the rapid development of information technology have exerted a strong influence on assessment development. Recent advancements in educational assessment and evaluation techniques reflect a trend in movement away from conventional paper-based testing towards computer-based testing (CBT). Advances in psychometrics and the development of the item response theory (IRT) (DeMars, 2010) have directed CBT's evolution into computer adaptive testing (CAT). CAT is advantageous in that it provides more efficient estimates of ability using fewer items compared to conventional testing. To ensure that CAT achieves its assessment targets, item analysis is indispensable for determining item quality, which is used to identify items that require modification or deletion before banking. Item difficulty is an index of primary importance in item analysis. Based on the preceding statement, we infer that accurately estimating item difficulty is of considerable importance in testing, as obtaining individual item difficulty values allows for the estimation of the difficulty of an entire test and the abilities of student test takers.

Item types are closely related to item difficulty, even though the implementation of item difficulty estimation varies across different item types. In this study, we explore the difficulty of multiple-choice items (MCIs), which consist of the following item elements: a question (stem) and typically three to five alternatives (or options), from which students must select. The alternatives include a correct option (answer) and a few some plausible but incorrect options known as distractors. A practical example of item elements is shown in Fig. 1. This type of item requires students to integrate stem information with their background knowledge to select the correct option. Reliable tests using MCIs are straightforward to implement, prevent the requirement for manual intervention in scoring, and are well suited for learning diagnosis and achievement evaluation. Thus, MCI-based tests are widely used in education, professional certification, and licensure. These attributes and applications have led MCI tests to be considered one of the most effective and successful forms of educational assessment (Gierl, Bulut, Guo, & Zhang, 2017).

Before drafting a test, large-scale testing organizations typically set parameters including item format, content range, and item difficulty configuration, according to test objectives. Then, they employ algorithms such as linear programming or genetic algorithms, to select the best combination of test items from an item bank that meets the criteria established by the parameters. Item difficulty is presented as categorical information, and it is generally classified into the following five levels: very easy, easy, moderate, difficult, and very difficult. At present, the primary methods for estimating item difficulty are pretesting and subjective expert judgment (Attali, Saldivia, Jackson, Schuppan, & Wanamaker, 2014). The subjective expert judgment of item difficulty does not require students. Instead, this approach obtains estimates based on experts’ experience and the intuitive judgments of item difficulty. It is difficult to evaluate result stability owing to the subjectivity of expert judgment. Alternately, item difficulty can be assessed empirically by pretesting items prior to employing them in an examination. Pretesting informs the item selection process based on item difficulty, which is obtained from the analysis of the collected item responses of representative subjects that are randomly sampled from the exam population. Even though this process can achieve highly accurate estimates of item difficulty, it is relatively labor intensive and time consuming and must consider item exposure, particularly in the development of high-stakes tests (Loukina, Yoon, Sakano, Wei, & Sheehan, 2016). Thus, there is significant value in developing an automated procedure that could evaluate item difficulty and ensure sufficient psychometric quality of test items.

Even though numerous variables that affect item difficulty have already been identified by scholars, much opportunity remains for improvement in the automated estimation of item difficulty. This is largely attributed to the complex interaction among these variables and the complicated relations between item difficulty and item demands (El Masri et al., 2017, Pollitt et al., 2007). Ferrara, Svetina, Skucha, and Davidson (2011) define item demands as the knowledge, comprehension, and cognitive process required for examinees to correctly answer an item. Until now, research on automatic estimation of item difficulty has focused on language proficiency tests, while studies on content subjects, such as social sciences, natural sciences, medicine, and law, have received less attention. However, content subject tests are widely used in education assessment, certification, and licensure examinations. Usually, studies estimating the difficulty of items in language proficiency tests employ tools to automatically analyze linguistic features (Sheehan, 2017), or rely on external word lists and electronic lexical databases such as WordNet (Ronzano, Anke, & Saggion, 2016) to extract item difficulty characteristics to direct the automated estimation of item difficulty. However, in content subject tests, examinees must apply the knowledge and materials learned in class to a stem to select the correct answer from multiple choices. To estimate the difficulty of test items, a knowledge base must be constructed based on the expected domain knowledge of examinees. This study proposes the extraction of item difficulty attributes using automated semantic analysis techniques.

Several studies have explored the relationship between cognitive processing models and item difficulty (Embretson and Wetzel, 1987, Gorin and Embretson, 2006, Kirsch, 2001). Among the numerous variables cataloged in the previously discussed studies, we identified two variables as the key predictors of item difficulty to be examined in this study. The first variable is the type of information required by a stem to answer a question. The second variable is the semantic similarity between the answer and distractors. A more detailed explanation is provided in Section 3. Previous studies have demonstrated that the association between a stem and options affects item difficulty and quality (Abdulghani et al., 2014, Pho et al., 2015). If semantic similarity is used to measure this association, automatic tools can be employed to extract the attributes of item difficulty. Then, using the known difficulty of a training set, an estimation model can be designed based on the semantic features of item difficulty.

The purpose of this paper is to introduce a method for the automated estimation of MCI difficulty. This method will address the issues of inefficiency, subjective judgements, and security challenges, as they relate to test item development. We conducted an experiment on actual social studies test items and analyzed the collected data with respect to the following research questions:

  • (1)

    Does item difficulty correlate with the semantic similarity between item elements?

  • (2)

    Could the proposed approach complement and/or partially replace pretesting?

Our results demonstrate that item difficulty does correlate with the semantic similarity between item elements. In addition, they indicate that the predictive performance of the proposed method is superior to that of a pretest program. We hope that the proposed method will provide a viable solution to the problems faced by conventional approaches in predicting item difficulty. Additionally, our findings have enabled us to construct the empirical descriptions of semantic similarity at different item difficulty levels. Because the semantic similarities for different levels of item difficulty would be the same across different domains and languages, this method could be easily applied across various languages and domains.

The rest of the paper is organized as follows: Section 2 provides related research on item difficulty estimation and semantic similarity measurement. Section 3 details the method and system architecture used to conduct the study. Section 4 describes our experimental data and setting. Section 5 analyzes the experimental results and discusses the findings. Section 6 concludes the paper and suggests future research directions.

Section snippets

Related work

The following literature review is divided into two subsections to elaborate on the rationale for this study. The first subsection discusses related work on item difficulty estimation. The second subsection focuses on related work on semantic similarity measurement techniques.

Method

This section describes our proposed method for the automated estimation of MCI difficulty in social studies tests. As mentioned in Section 1.2, this study identified two variables as the key predictors of item difficulty in this study. The first variable identifies the types of information (provided below) that may be required by a stem to answer a question.

  • (1)

    Highly semantically related information: This information is the easiest type to process. This information includes the people, events, and

Experiments

This section is divided into three subsections. The first subsection describes the materials used in the study. The second subsection explains the experimental procedures. The third subsection introduces the evaluation method in detail.

Results and discussion

This section is divided into two subsections. The first subsection describes the results of two experiments in this study. The second subsection analyzes the experimental results and discusses the findings.

Conclusions and future research

The findings of our experiments are consistent with those of the empirical studies discussed previously, and they confirm prior evidence of the relationship between cognitive processing models and item difficulty. These findings lead us to believe that the proposed approach can be applied to not only social studies but also to other content subjects such as natural sciences and medicine and to other item types such as fill-in-the-blanks, matching, and short answers. Furthermore, it has been

Acknowledgments

The research was supported by the Ministry of Science and Technology, Taiwan, under the grants MOST 105-2221-E-011-085-MY3, and was also financially supported by the “Institute for Research Excellence in Learning Sciences” of National Taiwan Normal University (NTNU) from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan.

References (54)

  • M. Baroni et al.

    Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors

  • Y. Bengio et al.

    A neural probabilistic language model

    Journal of Machine Learning Research

    (2003)
  • L. Bickman et al.

    The SAGE handbook of applied social research methods

    (2009)
  • R.F. Boldt

    GRE analytical reasoning item statistics prediction study

    ETS Research Report Series

    (1998)
  • R.F. Boldt et al.

    Using a neural net to predict item difficulty

    ETS Research Report Series

    (1996)
  • T.G. Bond et al.

    Basic principles of the Rasch model

    Applying the Rasch model: fundamental measurement in the human sciences

    (2015)
  • ChangC.C. et al.

    LIBSVM: A library for support vector machines

    ACM Transactions on Intelligent Systems and Technology (TIST)

    (2011)
  • ChenB.G. et al.

    Age of acquisition effects in reading Chinese: Evidence in favour of the arbitrary mapping hypothesis

    British Journal of Psychology

    (2007)
  • ChenY.Y. et al.

    An unsupervised automated essay scoring system

    IEEE Intelligent Systems

    (2010)
  • Chih-WeiH. et al.

    A comparison of methods for multiclass support vector machines

    IEEE Transactions on Neural Networks

    (2002)
  • O. Cokluk et al.

    Examining differential item functions of different item ordered test forms according to item difficulty levels

    Educational Sciences: Theory & Practice

    (2016)
  • C. Cortes et al.

    Support-vector networks

    Machine Learning

    (1995)
  • C. "DeMars

    Item response theory

    (2010)
  • El Masri et al.

    Predicting item difficulty of science national curriculum tests: The case of key stage 2 assessments

    The Curriculum Journal

    (2017)
  • S.E. Embretson et al.

    Component latent trait models for paragraph comprehension tests

    Applied Psychological Measurement

    (1987)
  • S. Ferrara et al.

    Test development with performance standards and achievement growth in mind

    Educational Measurement: Issues and Practice

    (2011)
  • R. Freedle et al.

    The prediction of SAT reading comprehension item difficulty for expository prose passages (RR-91-29)

    Retrieved from Princeton

    (1991)
  • Cited by (32)

    • Developing bug severity prediction models using word2vec

      2021, International Journal of Cognitive Computing in Engineering
      Citation Excerpt :

      Therefore, word embedding is apt in such scenarios in getting contextual similarity of word vectors. Recent years, word embedding have been implemented for feature generation with semantic similarity of embedding vectors (Zamani and Croft, 2016; Hsu et al., 2018; Tien et al., 2019; Nguyen et al., 2018; Fernández-Reyes et al., 2018; Sharma et al., 2017), which forms the foundation of this empirical study. Performance of word embedding is influenced by a lot of factors.

    • ATM: Adversarial-neural Topic Model

      2019, Information Processing and Management
      Citation Excerpt :

      In recent years, word embeddings (such as Word2vec(Le & Mikolov, 2014), GloVe(Pennington, Socher, & Manning, 2014), fastText(Bojanowski, Grave, Joulin, & Mikolov, 2017; Grave, Mikolov, Joulin, & Bojanowski, 2017) and probabilistic fastText(Athiwaratkun, Wilson, & Anandkumar, 2018)) have gained an increasing interest thanks to their improved efficiency in representing words as continuous vectors in a low-dimensional space. The resulting embeddings encode numerous semantic relations (similarity or analogies) and are helpful for NLP tasks(Fernández-Reyes, Valadez, & Montes-y-Gómez, 2018; Hsu, Lee, Chang, & Sung, 2018). But the traditional topic models could not generate such word-level semantic representations.

    • Paragraph-based representation of texts: A complex networks approach

      2019, Information Processing and Management
      Citation Excerpt :

      Due to the ever-increasing number of available online texts, many machine learning techniques have been developed to treat this kind of information (Belbachir & Boughanem, 2018; Hsu, Lee, Chang, & Sung, 2018; Kim & Kang, 2018; Sicilia, Giudice, Pei, Pechenizkiy, & Soda, 2018; Symeonidis, Effrosynidis, & Arampatzis, 2018; Xiong, Wang, Zhang, & Ma, 2018).

    View all citing articles on Scopus
    View full text