Automated estimation of item difficulty for multiple-choice tests: An application of word embedding techniques

doi:10.1016/j.ipm.2018.06.007

Information Processing & Management

Volume 54, Issue 6, November 2018, Pages 969-984

https://doi.org/10.1016/j.ipm.2018.06.007 Get rights and content

Abstract

Pretesting is the most commonly used method for estimating test item difficulty because it provides highly accurate results that can be applied to assessment development activities. However, pretesting is inefficient, and it can lead to item exposure. Hence, an increasing number of studies have invested considerable effort in researching the automated estimation of item difficulty. Language proficiency tests constitute the majority of researched test topics, while comparatively less research has focused on content subjects. This paper introduces a novel method for the automated estimation of item difficulty for social studies tests. In this study, we explore the difficulty of multiple-choice items, which consist of the following item elements: a question and alternative options. We use learning materials to construct a semantic space using word embedding techniques and project an item's texts into the semantic space to obtain corresponding vectors. Semantic features are obtained by calculating the cosine similarity between the vectors of item elements. Subsequently, these semantic features are sent to a classifier for training and testing. Based on the output of the classifier, an estimation model is created and item difficulty is estimated. Our findings suggest that the semantic similarity between a stem and the options has the strongest impact on item difficulty. Furthermore, the results indicate that the proposed estimation method outperforms pretesting, and therefore, we expect that the proposed approach will complement and partially replace pretesting in future.

Introduction

This section is divided into two subsections. The first subsection introduces the background of the study. The second subsection describes the purpose and research questions of this study.

The widespread use of the Internet and the rapid development of information technology have exerted a strong influence on assessment development. Recent advancements in educational assessment and evaluation techniques reflect a trend in movement away from conventional paper-based testing towards computer-based testing (CBT). Advances in psychometrics and the development of the item response theory (IRT) (DeMars, 2010) have directed CBT's evolution into computer adaptive testing (CAT). CAT is advantageous in that it provides more efficient estimates of ability using fewer items compared to conventional testing. To ensure that CAT achieves its assessment targets, item analysis is indispensable for determining item quality, which is used to identify items that require modification or deletion before banking. Item difficulty is an index of primary importance in item analysis. Based on the preceding statement, we infer that accurately estimating item difficulty is of considerable importance in testing, as obtaining individual item difficulty values allows for the estimation of the difficulty of an entire test and the abilities of student test takers.

Item types are closely related to item difficulty, even though the implementation of item difficulty estimation varies across different item types. In this study, we explore the difficulty of multiple-choice items (MCIs), which consist of the following item elements: a question (stem) and typically three to five alternatives (or options), from which students must select. The alternatives include a correct option (answer) and a few some plausible but incorrect options known as distractors. A practical example of item elements is shown in Fig. 1. This type of item requires students to integrate stem information with their background knowledge to select the correct option. Reliable tests using MCIs are straightforward to implement, prevent the requirement for manual intervention in scoring, and are well suited for learning diagnosis and achievement evaluation. Thus, MCI-based tests are widely used in education, professional certification, and licensure. These attributes and applications have led MCI tests to be considered one of the most effective and successful forms of educational assessment (Gierl, Bulut, Guo, & Zhang, 2017).

Before drafting a test, large-scale testing organizations typically set parameters including item format, content range, and item difficulty configuration, according to test objectives. Then, they employ algorithms such as linear programming or genetic algorithms, to select the best combination of test items from an item bank that meets the criteria established by the parameters. Item difficulty is presented as categorical information, and it is generally classified into the following five levels: very easy, easy, moderate, difficult, and very difficult. At present, the primary methods for estimating item difficulty are pretesting and subjective expert judgment (Attali, Saldivia, Jackson, Schuppan, & Wanamaker, 2014). The subjective expert judgment of item difficulty does not require students. Instead, this approach obtains estimates based on experts’ experience and the intuitive judgments of item difficulty. It is difficult to evaluate result stability owing to the subjectivity of expert judgment. Alternately, item difficulty can be assessed empirically by pretesting items prior to employing them in an examination. Pretesting informs the item selection process based on item difficulty, which is obtained from the analysis of the collected item responses of representative subjects that are randomly sampled from the exam population. Even though this process can achieve highly accurate estimates of item difficulty, it is relatively labor intensive and time consuming and must consider item exposure, particularly in the development of high-stakes tests (Loukina, Yoon, Sakano, Wei, & Sheehan, 2016). Thus, there is significant value in developing an automated procedure that could evaluate item difficulty and ensure sufficient psychometric quality of test items.

Even though numerous variables that affect item difficulty have already been identified by scholars, much opportunity remains for improvement in the automated estimation of item difficulty. This is largely attributed to the complex interaction among these variables and the complicated relations between item difficulty and item demands (El Masri et al., 2017, Pollitt et al., 2007). Ferrara, Svetina, Skucha, and Davidson (2011) define item demands as the knowledge, comprehension, and cognitive process required for examinees to correctly answer an item. Until now, research on automatic estimation of item difficulty has focused on language proficiency tests, while studies on content subjects, such as social sciences, natural sciences, medicine, and law, have received less attention. However, content subject tests are widely used in education assessment, certification, and licensure examinations. Usually, studies estimating the difficulty of items in language proficiency tests employ tools to automatically analyze linguistic features (Sheehan, 2017), or rely on external word lists and electronic lexical databases such as WordNet (Ronzano, Anke, & Saggion, 2016) to extract item difficulty characteristics to direct the automated estimation of item difficulty. However, in content subject tests, examinees must apply the knowledge and materials learned in class to a stem to select the correct answer from multiple choices. To estimate the difficulty of test items, a knowledge base must be constructed based on the expected domain knowledge of examinees. This study proposes the extraction of item difficulty attributes using automated semantic analysis techniques.

Several studies have explored the relationship between cognitive processing models and item difficulty (Embretson and Wetzel, 1987, Gorin and Embretson, 2006, Kirsch, 2001). Among the numerous variables cataloged in the previously discussed studies, we identified two variables as the key predictors of item difficulty to be examined in this study. The first variable is the type of information required by a stem to answer a question. The second variable is the semantic similarity between the answer and distractors. A more detailed explanation is provided in Section 3. Previous studies have demonstrated that the association between a stem and options affects item difficulty and quality (Abdulghani et al., 2014, Pho et al., 2015). If semantic similarity is used to measure this association, automatic tools can be employed to extract the attributes of item difficulty. Then, using the known difficulty of a training set, an estimation model can be designed based on the semantic features of item difficulty.

The purpose of this paper is to introduce a method for the automated estimation of MCI difficulty. This method will address the issues of inefficiency, subjective judgements, and security challenges, as they relate to test item development. We conducted an experiment on actual social studies test items and analyzed the collected data with respect to the following research questions:

(1)
Does item difficulty correlate with the semantic similarity between item elements?
(2)
Could the proposed approach complement and/or partially replace pretesting?

Our results demonstrate that item difficulty does correlate with the semantic similarity between item elements. In addition, they indicate that the predictive performance of the proposed method is superior to that of a pretest program. We hope that the proposed method will provide a viable solution to the problems faced by conventional approaches in predicting item difficulty. Additionally, our findings have enabled us to construct the empirical descriptions of semantic similarity at different item difficulty levels. Because the semantic similarities for different levels of item difficulty would be the same across different domains and languages, this method could be easily applied across various languages and domains.

The rest of the paper is organized as follows: Section 2 provides related research on item difficulty estimation and semantic similarity measurement. Section 3 details the method and system architecture used to conduct the study. Section 4 describes our experimental data and setting. Section 5 analyzes the experimental results and discusses the findings. Section 6 concludes the paper and suggests future research directions.

Section snippets

Related work

The following literature review is divided into two subsections to elaborate on the rationale for this study. The first subsection discusses related work on item difficulty estimation. The second subsection focuses on related work on semantic similarity measurement techniques.

Method

This section describes our proposed method for the automated estimation of MCI difficulty in social studies tests. As mentioned in Section 1.2, this study identified two variables as the key predictors of item difficulty in this study. The first variable identifies the types of information (provided below) that may be required by a stem to answer a question.

(1)
Highly semantically related information: This information is the easiest type to process. This information includes the people, events, and

Experiments

This section is divided into three subsections. The first subsection describes the materials used in the study. The second subsection explains the experimental procedures. The third subsection introduces the evaluation method in detail.

Results and discussion

This section is divided into two subsections. The first subsection describes the results of two experiments in this study. The second subsection analyzes the experimental results and discusses the findings.

Conclusions and future research

The findings of our experiments are consistent with those of the empirical studies discussed previously, and they confirm prior evidence of the relationship between cognitive processing models and item difficulty. These findings lead us to believe that the proposed approach can be applied to not only social studies but also to other content subjects such as natural sciences and medicine and to other item types such as fill-in-the-blanks, matching, and short answers. Furthermore, it has been

Acknowledgments

The research was supported by the Ministry of Science and Technology, Taiwan, under the grants MOST 105-2221-E-011-085-MY3, and was also financially supported by the “Institute for Research Excellence in Learning Sciences” of National Taiwan Normal University (NTNU) from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan.

References (54)

M. Al-Smadi et al.
Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features
Information Processing & Management
(2017)
F.C. Fernández-Reyes et al.
A prospect-guided global query expansion strategy using word embeddings
Information Processing & Management
(2018)
P. Gupta et al.
Continuous space models for CLIR
Information Processing & Management
(2017)
JiangY. et al.
Wikipedia-based information content and semantic similarity computation
Information Processing & Management
(2017)
M. Kocher et al.
Distance measures in author profiling
Information Processing & Management
(2017)
SungY.T. et al.
Reexamining the relationship between test anxiety and learning achievement: An individual-differences perspective
Contemporary Educational Psychology, 46
(2016)
O. Vechtomova
Disambiguating context-dependent polarity of words: An information retrieval approach
Information Processing & Management
(2017)
YehT.K. et al.
Association of catechol-O-methyltransferase (COMT) polymorphism and academic achievement in a Chinese cohort
Brain and Cognition
(2009)
H.M. Abdulghani et al.
The relationship between non-functioning distractors and item difficulty of multiple choice questions: A descriptive analysis
Journal of Health Specialties
(2014)
Y. Attali et al.
Estimating item difficulty with comparative judgments
ETS Research Report Series
(2014)

M. Baroni et al.

Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors

Y. Bengio et al.

A neural probabilistic language model

Journal of Machine Learning Research

(2003)

L. Bickman et al.

The SAGE handbook of applied social research methods

(2009)

R.F. Boldt

GRE analytical reasoning item statistics prediction study

ETS Research Report Series

(1998)

R.F. Boldt et al.

Using a neural net to predict item difficulty

ETS Research Report Series

(1996)

T.G. Bond et al.

Basic principles of the Rasch model

Applying the Rasch model: fundamental measurement in the human sciences

(2015)

ChangC.C. et al.

LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

(2011)

ChenB.G. et al.

Age of acquisition effects in reading Chinese: Evidence in favour of the arbitrary mapping hypothesis

British Journal of Psychology

(2007)

ChenY.Y. et al.

An unsupervised automated essay scoring system

IEEE Intelligent Systems

(2010)

Chih-WeiH. et al.

A comparison of methods for multiclass support vector machines

IEEE Transactions on Neural Networks

(2002)

O. Cokluk et al.

Examining differential item functions of different item ordered test forms according to item difficulty levels

Educational Sciences: Theory & Practice

(2016)

C. Cortes et al.

Support-vector networks

Machine Learning

(1995)

C. "DeMars

Item response theory

(2010)

El Masri et al.

Predicting item difficulty of science national curriculum tests: The case of key stage 2 assessments

The Curriculum Journal

(2017)

S.E. Embretson et al.

Component latent trait models for paragraph comprehension tests

Applied Psychological Measurement

(1987)

S. Ferrara et al.

Test development with performance standards and achievement growth in mind

Educational Measurement: Issues and Practice

(2011)

R. Freedle et al.

The prediction of SAT reading comprehension item difficulty for expository prose passages (RR-91-29)

Retrieved from Princeton

(1991)

Cited by (32)

Influence of statistical feature normalisation methods on K-Nearest Neighbours and K-Means in the context of industry 4.0
2022, Engineering Applications of Artificial Intelligence
Normalisation is a preprocessing technique widely employed in Machine Learning (ML)-based solutions for industry to equalise the features’ contribution. However, few researchers have analysed the normalisation effect and its implications on the ML algorithm performance, especially on Euclidean distance-based algorithms, such as the well-known K-Nearest Neighbours and K-means. In this sense, this paper formally analyses the effect of normalisation yielding results significantly far from the state-of-the-art traditional claims. In particular, this paper shows that normalisation does not equalise the contribution of the features, with the consequent impact on the performance of the learning process for a particular problem. More concretely, this demonstration is made on K-Nearest Neighbours and K-means Euclidean distance-based ML algorithms. This paper concludes that normalisation can be viewed as an unsupervised Feature Weighting method. In this context, a new metric (Normalisation weight) for measuring the impact of normalisation on the features is presented. Likewise, an analysis of the normalisation effect on the Euclidean distance is conducted and a new metric referred to as Proportional influence that measures the features influence on the Euclidean distance is proposed. Both metrics enable the automatic selection of the most appropriate normalisation method for a particular engineering problem, which can significantly improve both the computational cost and classification performance of K-Nearest Neighbours and K-means algorithms. The analytical conclusions are validated on well-known datasets from the UCI repository and a real-life application from the refinery industry.
Developing bug severity prediction models using word2vec
2021, International Journal of Cognitive Computing in Engineering
Citation Excerpt :
Therefore, word embedding is apt in such scenarios in getting contextual similarity of word vectors. Recent years, word embedding have been implemented for feature generation with semantic similarity of embedding vectors (Zamani and Croft, 2016; Hsu et al., 2018; Tien et al., 2019; Nguyen et al., 2018; Fernández-Reyes et al., 2018; Sharma et al., 2017), which forms the foundation of this empirical study. Performance of word embedding is influenced by a lot of factors.
Bug tracking systems use repositories to keep track of the bugs to improve software quality. A manual analysis of each bug and classifying it according to its severity is an unmanageable job. Therefore, it is imperative to correctly classify the severity of a bug, which otherwise might be misclassified by a laymen user. Text mining techniques have the potential to analyze such massive databases of the textual description of bug reports to classify bug severity levels adequately. Word embedding is a state-of-art text mining technique that captures the semantics of the text and group the words according to their relevance in a document. Words are embedded into real-valued vectors through word-embedding models. Word2vec is a word embedding model, proven to be effective in representing word meanings. However, the configuration of hyperparameters and feature selection affects the performance of word2vec. Tuning hyperparameter is a time-consuming process, and it is crucial to identify the correct set of parameters. The paper outlines the effectiveness of word2vec technique on the efficiency of classifiers to predict a bug severity from a bug report and examines the impact of different averaging methods, including the configuration of word embedding parameters on the classifiers’ efficiency. Results show that the bigger window size enhances the performance of classifiers; however, the influence of the minimum word count parameter was found to be mixed and depends on the selected data sets. Further, out of the classifiers used, Random Forest and Xgboost could classify the severity level for classes with few records or a rare occurrence of words specific to each class. Otherwise, Support Vector Machine and Naive Bayes classifiers performed better and worst, respectively.
ATM: Adversarial-neural Topic Model
2019, Information Processing and Management
Citation Excerpt :
In recent years, word embeddings (such as Word2vec(Le & Mikolov, 2014), GloVe(Pennington, Socher, & Manning, 2014), fastText(Bojanowski, Grave, Joulin, & Mikolov, 2017; Grave, Mikolov, Joulin, & Bojanowski, 2017) and probabilistic fastText(Athiwaratkun, Wilson, & Anandkumar, 2018)) have gained an increasing interest thanks to their improved efficiency in representing words as continuous vectors in a low-dimensional space. The resulting embeddings encode numerous semantic relations (similarity or analogies) and are helpful for NLP tasks(Fernández-Reyes, Valadez, & Montes-y-Gómez, 2018; Hsu, Lee, Chang, & Sung, 2018). But the traditional topic models could not generate such word-level semantic representations.
Topic models are widely used for thematic structure discovery in text. But traditional topic models often require dedicated inference procedures for specific tasks at hand. Also, they are not designed to generate word-level semantic representations. To address the limitations, we propose a neural topic modeling approach based on the Generative Adversarial Nets (GANs), called Adversarial-neural Topic Model (ATM) in this paper. To our best knowledge, this work is the first attempt to use adversarial training for topic modeling. The proposed ATM models topics with dirichlet prior and employs a generator network to capture the semantic patterns among latent topics. Meanwhile, the generator could also produce word-level semantic representations. Besides, to illustrate the feasibility of porting ATM to tasks other than topic modeling, we apply ATM for open domain event extraction. To validate the effectiveness of the proposed ATM, two topic modeling benchmark corpora and an event dataset are employed in the experiments. Our experimental results on benchmark corpora show that ATM generates more coherence topics (considering five topic coherence measures), outperforming a number of competitive baselines. Moreover, the experiments on event dataset also validate that the proposed approach is able to extract meaningful events from news articles.
Paragraph-based representation of texts: A complex networks approach
2019, Information Processing and Management
Citation Excerpt :
Due to the ever-increasing number of available online texts, many machine learning techniques have been developed to treat this kind of information (Belbachir & Boughanem, 2018; Hsu, Lee, Chang, & Sung, 2018; Kim & Kang, 2018; Sicilia, Giudice, Pei, Pechenizkiy, & Soda, 2018; Symeonidis, Effrosynidis, & Arampatzis, 2018; Xiong, Wang, Zhang, & Ma, 2018).
An interesting model to represent texts as a graph (also called network) is the word adjacency (co-occurrence) representation, which is known to capture mainly syntactical features of texts. In this study, we propose a novel network model, which is based on the similarity between the content of the paragraphs of the text. By considering this representation, we characterized the networks with respect to measurements developed in the network science area. We characterized these measurements according to their properties regarding their ability to discriminate between real and shuffled texts, and to capture information regarding the content similarity of chunks of text. In order to compare the results with a more sophisticated approach, we employed a methodology based on word2vec. When comparing real and shuffled texts, the results revealed that real texts tend to have a more well-defined community structure. This characteristic can be related to the organization of subjects in real texts. The network-based measurements that were found to be able to discriminate real from shuffled texts were used as features in a classifier. As a result, the obtained accuracy was 98.72%. In order to compare with a different methodology, we used doc2vec-based features in the classifier, yielding an accuracy rate of 70.8%. The proposed network-based features were employed to analyze the Voynich manuscript, which was found to be compatible with real texts according to the considered characteristics.
Integrating LSA-based hierarchical conceptual space and machine learning methods for leveling the readability of domain-specific texts
2019, Natural Language Engineering
Field-testing multiple-choice questions with AI examinees
2024, Research Square

View all citing articles on Scopus

View full text

Automated estimation of item difficulty for multiple-choice tests: An application of word embedding techniques

Abstract

Introduction

Section snippets

Related work

Method

Experiments

Results and discussion

Conclusions and future research

Acknowledgments

Information Processing & Management

Information Processing & Management

Information Processing & Management

Information Processing & Management

Information Processing & Management

Contemporary Educational Psychology, 46

Information Processing & Management

Brain and Cognition

The relationship between non-functioning distractors and item difficulty of multiple choice questions: A descriptive analysis

Journal of Health Specialties

Estimating item difficulty with comparative judgments

ETS Research Report Series

Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors

A neural probabilistic language model

Journal of Machine Learning Research

The SAGE handbook of applied social research methods

GRE analytical reasoning item statistics prediction study

ETS Research Report Series

Using a neural net to predict item difficulty

ETS Research Report Series

Basic principles of the Rasch model

Applying the Rasch model: fundamental measurement in the human sciences

LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Age of acquisition effects in reading Chinese: Evidence in favour of the arbitrary mapping hypothesis

British Journal of Psychology

An unsupervised automated essay scoring system

IEEE Intelligent Systems

A comparison of methods for multiclass support vector machines

IEEE Transactions on Neural Networks

Examining differential item functions of different item ordered test forms according to item difficulty levels

Educational Sciences: Theory & Practice

Support-vector networks

Machine Learning

Item response theory

Predicting item difficulty of science national curriculum tests: The case of key stage 2 assessments

The Curriculum Journal

Component latent trait models for paragraph comprehension tests

Applied Psychological Measurement

Test development with performance standards and achievement growth in mind

Educational Measurement: Issues and Practice

The prediction of SAT reading comprehension item difficulty for expository prose passages (RR-91-29)

Retrieved from Princeton