Automated estimation of item difficulty for multiple-choice tests: An application of word embedding techniques
Introduction
This section is divided into two subsections. The first subsection introduces the background of the study. The second subsection describes the purpose and research questions of this study.
The widespread use of the Internet and the rapid development of information technology have exerted a strong influence on assessment development. Recent advancements in educational assessment and evaluation techniques reflect a trend in movement away from conventional paper-based testing towards computer-based testing (CBT). Advances in psychometrics and the development of the item response theory (IRT) (DeMars, 2010) have directed CBT's evolution into computer adaptive testing (CAT). CAT is advantageous in that it provides more efficient estimates of ability using fewer items compared to conventional testing. To ensure that CAT achieves its assessment targets, item analysis is indispensable for determining item quality, which is used to identify items that require modification or deletion before banking. Item difficulty is an index of primary importance in item analysis. Based on the preceding statement, we infer that accurately estimating item difficulty is of considerable importance in testing, as obtaining individual item difficulty values allows for the estimation of the difficulty of an entire test and the abilities of student test takers.
Item types are closely related to item difficulty, even though the implementation of item difficulty estimation varies across different item types. In this study, we explore the difficulty of multiple-choice items (MCIs), which consist of the following item elements: a question (stem) and typically three to five alternatives (or options), from which students must select. The alternatives include a correct option (answer) and a few some plausible but incorrect options known as distractors. A practical example of item elements is shown in Fig. 1. This type of item requires students to integrate stem information with their background knowledge to select the correct option. Reliable tests using MCIs are straightforward to implement, prevent the requirement for manual intervention in scoring, and are well suited for learning diagnosis and achievement evaluation. Thus, MCI-based tests are widely used in education, professional certification, and licensure. These attributes and applications have led MCI tests to be considered one of the most effective and successful forms of educational assessment (Gierl, Bulut, Guo, & Zhang, 2017).
Before drafting a test, large-scale testing organizations typically set parameters including item format, content range, and item difficulty configuration, according to test objectives. Then, they employ algorithms such as linear programming or genetic algorithms, to select the best combination of test items from an item bank that meets the criteria established by the parameters. Item difficulty is presented as categorical information, and it is generally classified into the following five levels: very easy, easy, moderate, difficult, and very difficult. At present, the primary methods for estimating item difficulty are pretesting and subjective expert judgment (Attali, Saldivia, Jackson, Schuppan, & Wanamaker, 2014). The subjective expert judgment of item difficulty does not require students. Instead, this approach obtains estimates based on experts’ experience and the intuitive judgments of item difficulty. It is difficult to evaluate result stability owing to the subjectivity of expert judgment. Alternately, item difficulty can be assessed empirically by pretesting items prior to employing them in an examination. Pretesting informs the item selection process based on item difficulty, which is obtained from the analysis of the collected item responses of representative subjects that are randomly sampled from the exam population. Even though this process can achieve highly accurate estimates of item difficulty, it is relatively labor intensive and time consuming and must consider item exposure, particularly in the development of high-stakes tests (Loukina, Yoon, Sakano, Wei, & Sheehan, 2016). Thus, there is significant value in developing an automated procedure that could evaluate item difficulty and ensure sufficient psychometric quality of test items.
Even though numerous variables that affect item difficulty have already been identified by scholars, much opportunity remains for improvement in the automated estimation of item difficulty. This is largely attributed to the complex interaction among these variables and the complicated relations between item difficulty and item demands (El Masri et al., 2017, Pollitt et al., 2007). Ferrara, Svetina, Skucha, and Davidson (2011) define item demands as the knowledge, comprehension, and cognitive process required for examinees to correctly answer an item. Until now, research on automatic estimation of item difficulty has focused on language proficiency tests, while studies on content subjects, such as social sciences, natural sciences, medicine, and law, have received less attention. However, content subject tests are widely used in education assessment, certification, and licensure examinations. Usually, studies estimating the difficulty of items in language proficiency tests employ tools to automatically analyze linguistic features (Sheehan, 2017), or rely on external word lists and electronic lexical databases such as WordNet (Ronzano, Anke, & Saggion, 2016) to extract item difficulty characteristics to direct the automated estimation of item difficulty. However, in content subject tests, examinees must apply the knowledge and materials learned in class to a stem to select the correct answer from multiple choices. To estimate the difficulty of test items, a knowledge base must be constructed based on the expected domain knowledge of examinees. This study proposes the extraction of item difficulty attributes using automated semantic analysis techniques.
Several studies have explored the relationship between cognitive processing models and item difficulty (Embretson and Wetzel, 1987, Gorin and Embretson, 2006, Kirsch, 2001). Among the numerous variables cataloged in the previously discussed studies, we identified two variables as the key predictors of item difficulty to be examined in this study. The first variable is the type of information required by a stem to answer a question. The second variable is the semantic similarity between the answer and distractors. A more detailed explanation is provided in Section 3. Previous studies have demonstrated that the association between a stem and options affects item difficulty and quality (Abdulghani et al., 2014, Pho et al., 2015). If semantic similarity is used to measure this association, automatic tools can be employed to extract the attributes of item difficulty. Then, using the known difficulty of a training set, an estimation model can be designed based on the semantic features of item difficulty.
The purpose of this paper is to introduce a method for the automated estimation of MCI difficulty. This method will address the issues of inefficiency, subjective judgements, and security challenges, as they relate to test item development. We conducted an experiment on actual social studies test items and analyzed the collected data with respect to the following research questions:
- (1)
Does item difficulty correlate with the semantic similarity between item elements?
- (2)
Could the proposed approach complement and/or partially replace pretesting?
Our results demonstrate that item difficulty does correlate with the semantic similarity between item elements. In addition, they indicate that the predictive performance of the proposed method is superior to that of a pretest program. We hope that the proposed method will provide a viable solution to the problems faced by conventional approaches in predicting item difficulty. Additionally, our findings have enabled us to construct the empirical descriptions of semantic similarity at different item difficulty levels. Because the semantic similarities for different levels of item difficulty would be the same across different domains and languages, this method could be easily applied across various languages and domains.
The rest of the paper is organized as follows: Section 2 provides related research on item difficulty estimation and semantic similarity measurement. Section 3 details the method and system architecture used to conduct the study. Section 4 describes our experimental data and setting. Section 5 analyzes the experimental results and discusses the findings. Section 6 concludes the paper and suggests future research directions.
Section snippets
Related work
The following literature review is divided into two subsections to elaborate on the rationale for this study. The first subsection discusses related work on item difficulty estimation. The second subsection focuses on related work on semantic similarity measurement techniques.
Method
This section describes our proposed method for the automated estimation of MCI difficulty in social studies tests. As mentioned in Section 1.2, this study identified two variables as the key predictors of item difficulty in this study. The first variable identifies the types of information (provided below) that may be required by a stem to answer a question.
- (1)
Highly semantically related information: This information is the easiest type to process. This information includes the people, events, and
Experiments
This section is divided into three subsections. The first subsection describes the materials used in the study. The second subsection explains the experimental procedures. The third subsection introduces the evaluation method in detail.
Results and discussion
This section is divided into two subsections. The first subsection describes the results of two experiments in this study. The second subsection analyzes the experimental results and discusses the findings.
Conclusions and future research
The findings of our experiments are consistent with those of the empirical studies discussed previously, and they confirm prior evidence of the relationship between cognitive processing models and item difficulty. These findings lead us to believe that the proposed approach can be applied to not only social studies but also to other content subjects such as natural sciences and medicine and to other item types such as fill-in-the-blanks, matching, and short answers. Furthermore, it has been
Acknowledgments
The research was supported by the Ministry of Science and Technology, Taiwan, under the grants MOST 105-2221-E-011-085-MY3, and was also financially supported by the “Institute for Research Excellence in Learning Sciences” of National Taiwan Normal University (NTNU) from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan.
References (54)
- et al.
Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features
Information Processing & Management
(2017) - et al.
A prospect-guided global query expansion strategy using word embeddings
Information Processing & Management
(2018) - et al.
Continuous space models for CLIR
Information Processing & Management
(2017) - et al.
Wikipedia-based information content and semantic similarity computation
Information Processing & Management
(2017) - et al.
Distance measures in author profiling
Information Processing & Management
(2017) - et al.
Reexamining the relationship between test anxiety and learning achievement: An individual-differences perspective
Contemporary Educational Psychology, 46
(2016) Disambiguating context-dependent polarity of words: An information retrieval approach
Information Processing & Management
(2017)- et al.
Association of catechol-O-methyltransferase (COMT) polymorphism and academic achievement in a Chinese cohort
Brain and Cognition
(2009) - et al.
The relationship between non-functioning distractors and item difficulty of multiple choice questions: A descriptive analysis
Journal of Health Specialties
(2014) - et al.
Estimating item difficulty with comparative judgments
ETS Research Report Series
(2014)
Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
A neural probabilistic language model
Journal of Machine Learning Research
The SAGE handbook of applied social research methods
GRE analytical reasoning item statistics prediction study
ETS Research Report Series
Using a neural net to predict item difficulty
ETS Research Report Series
Basic principles of the Rasch model
Applying the Rasch model: fundamental measurement in the human sciences
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Age of acquisition effects in reading Chinese: Evidence in favour of the arbitrary mapping hypothesis
British Journal of Psychology
An unsupervised automated essay scoring system
IEEE Intelligent Systems
A comparison of methods for multiclass support vector machines
IEEE Transactions on Neural Networks
Examining differential item functions of different item ordered test forms according to item difficulty levels
Educational Sciences: Theory & Practice
Support-vector networks
Machine Learning
Item response theory
Predicting item difficulty of science national curriculum tests: The case of key stage 2 assessments
The Curriculum Journal
Component latent trait models for paragraph comprehension tests
Applied Psychological Measurement
Test development with performance standards and achievement growth in mind
Educational Measurement: Issues and Practice
The prediction of SAT reading comprehension item difficulty for expository prose passages (RR-91-29)
Retrieved from Princeton
Cited by (32)
Influence of statistical feature normalisation methods on K-Nearest Neighbours and K-Means in the context of industry 4.0
2022, Engineering Applications of Artificial IntelligenceDeveloping bug severity prediction models using word2vec
2021, International Journal of Cognitive Computing in EngineeringCitation Excerpt :Therefore, word embedding is apt in such scenarios in getting contextual similarity of word vectors. Recent years, word embedding have been implemented for feature generation with semantic similarity of embedding vectors (Zamani and Croft, 2016; Hsu et al., 2018; Tien et al., 2019; Nguyen et al., 2018; Fernández-Reyes et al., 2018; Sharma et al., 2017), which forms the foundation of this empirical study. Performance of word embedding is influenced by a lot of factors.
ATM: Adversarial-neural Topic Model
2019, Information Processing and ManagementCitation Excerpt :In recent years, word embeddings (such as Word2vec(Le & Mikolov, 2014), GloVe(Pennington, Socher, & Manning, 2014), fastText(Bojanowski, Grave, Joulin, & Mikolov, 2017; Grave, Mikolov, Joulin, & Bojanowski, 2017) and probabilistic fastText(Athiwaratkun, Wilson, & Anandkumar, 2018)) have gained an increasing interest thanks to their improved efficiency in representing words as continuous vectors in a low-dimensional space. The resulting embeddings encode numerous semantic relations (similarity or analogies) and are helpful for NLP tasks(Fernández-Reyes, Valadez, & Montes-y-Gómez, 2018; Hsu, Lee, Chang, & Sung, 2018). But the traditional topic models could not generate such word-level semantic representations.
Paragraph-based representation of texts: A complex networks approach
2019, Information Processing and ManagementCitation Excerpt :Due to the ever-increasing number of available online texts, many machine learning techniques have been developed to treat this kind of information (Belbachir & Boughanem, 2018; Hsu, Lee, Chang, & Sung, 2018; Kim & Kang, 2018; Sicilia, Giudice, Pei, Pechenizkiy, & Soda, 2018; Symeonidis, Effrosynidis, & Arampatzis, 2018; Xiong, Wang, Zhang, & Ma, 2018).
Integrating LSA-based hierarchical conceptual space and machine learning methods for leveling the readability of domain-specific texts
2019, Natural Language EngineeringField-testing multiple-choice questions with AI examinees
2024, Research Square