Design and Development of a Framework for an Automatic Answer Evaluation System Based on Similarity Measures

Madhumitha Ramamurthy; Ilango Krishnamurthi

doi:10.1515/jisys-2015-0031

Open Access Published by De Gruyter May 11, 2016

Design and Development of a Framework for an Automatic Answer Evaluation System Based on Similarity Measures

Madhumitha Ramamurthy and Ilango Krishnamurthi

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2015-0031

Abstract

The assessment of answers is an important process that requires great effort from evaluators. This assessment process requires high concentration without any fluctuations in mood. This substantiates the need to automate answer script evaluation. Regarding text answer evaluation, sentence similarity measures have been widely used to compare student written answers with reference texts. In this paper, we propose an automated answer evaluation system that uses our proposed cosine-based sentence similarity measures to evaluate the answers. Cosine measures have proved to be effective in comparing between free text student answers and reference texts. Here we propose a set of novel cosine-based sentence similarity measures with varied approaches of creating document vector space. In addition to this, we propose a novel synset-based word similarity measure for computation of document vectors coupled with varied approaches for dimensionality-reduction for reducing vector space dimensions. Thus, we propose 21 cosine-based sentence similarity measures and measured their performance using MSR paraphrase corpus and Li’s benchmark datasets. We also use these measures for automatic answer evaluation system and compare their performances using the Kaggle short answer and essay dataset. The performance of the system-generated scores is compared with the human scores using Pearson correlation. The results show that system and human scores have correlation between each other.

Keywords: Assessment; automatic answer evaluation system; cosine similarity; similarity measure; sentence similarity

1 Introduction

The evaluation of text answers is a challenging process that requires great effort from the evaluators, especially when the number of answers to evaluate is high. Developing an automatic answer script evaluation system is needed because human evaluation needs concentration and might be biased, whereas an automatic answer evaluation system will be effective without these limitations.

An area related to automated evaluation of free text such as short answers and essays is text processing, which is used in domains such as information retrieval, data mining, and web search [14, 17, 34]. The bag-of-words approach [13, 16, 31], which is commonly used in text processing, represents documents as vector and each component in the vector represents a corresponding feature in the document. Common measures such as term frequency/inverse document frequency (TF/IDF) [8] follow this approach and are useful in determining a word’s importance to a document. Similarly, similarity measures also play a prominent role in many tasks such as natural language processing, information retrieval, web search, etc. The similarity measures find the similarity between two sentences, two words, or two vectors. Few approaches such as those by Lee [20] used similarity measures to compute performance scores for between sentences.

We propose an automatic answer script evaluation system using a sentence similarity measure to evaluate the students’ answers against a reference text. The proposed similarity measures forms document vectors with noun–verb and noun–verb–adjective–adverb models. We propose an improved word similarity measure that performs better than standard TF/IDF model or the model proposed by Lee [20]. Additionally, we compute cosine similarity with full matrix in original or reordered form or with average based dimensionality reduction instead of maximum-based dimensionality reductions proposed by Lee [20]. Our proposed automatic answer script evaluation system uses sentence similarity measures to evaluate the answers.

Some of the works done in automatic answer script evaluation system are described. One of the earliest works done by Page [27] on essay grading whose system was named as PEG (Project Essay Grader), where only style of the essay was checked. E-rater, proposed by Burstein et al. [3] also checks the writing style of the essay. C-rater, by Burstein et al. [4], checks the syntactic and semantic features of the essays. Intelligent Essay Assessor, by Hearst [9] and Jerrams-Smith et al. [12], uses latent semantic analysis (LSA) technique, and Kanejiya et al. [15] use LSA with consideration to preceding word in evaluating the essays. Pérez et al. [29] propose a system called RARE, which uses BLUE algorithm [28] in evaluating answers. Siddiqi and Harrison [36] use pattern-matching techniques in evaluating answers. Hu and Xia [11] use automatic segmentation techniques along with subject ontology in evaluating the answers. Saxena and Gupta [32] use hybrid techniques such as information extraction, natural language processing (NLP), and pattern matching to evaluate the students’ answers. Lajis and Aziz [19] use node link analysis technique to evaluate student’s short answers. Lee [20] proposed a method where cosine similarity calculation is calculated on maximum-based reduced document similarity vectors created with noun–verb pair semantic space and with a hypernym-based word similarity calculation. Our proposed automated system uses our synset-based word similarity measure using WordNet to evaluate the answers. In addition, our system considers only one reference answer instead of multiple pre-graded essays for evaluation.

The rest of the paper is organized as follows. Section 2 outlines related work. Section 3 describes the proposed automatic answer script evaluation system with the proposed sentence similarity measures. Section 4 gives the experimental results. Section 5 gives the final conclusion.

2 Related Works

This section describes the survey of similarity measures, automated assessment systems, dimensionality reduction methods, and vector similarity measures.

2.1 Similarity Measures and Vector Similarity Measures

To compute the similarity between two vectors, many measures have been suggested. The Kullback–Leibler divergence [18] is a non-symmetric measure of the difference between the probability distributions connected with the two vectors. Euclidean distance [33] is a famous similarity metric taken from the Euclidean geometry field. Manhattan distance [33], similar to Euclidean distance, is a well-known taxicab metric that is also a similarity metric. The Canberra distance metric [33] is employed in situations where elements in a vector are non-negative. Cosine similarity computes the cosine of the angle between two vectors. The Bray-Curtis similarity measure [25] is a city-block metric that is perceptive to outlying values. The Jaccard coefficient [6] is a sign employed for comparing the resemblance of two sample sets. The Hamming distance [6, 7] among two vectors is the number of positions at which the consequent symbols are dissimilar. The expanded Jaccard coefficient and the Dice coefficient [38, 39] keep hold of the sparsity property of the cosine similarity measure while permitting discrimination of collinear vectors. An information-theoretic measure named IT-Sim [1, 24] is used for measuring document similarity. A phrase-based measure was suggested by Chim and Deng [5] to calculate the similarity based on the suffix tree document model.

2.2 Sentence Similarity Measures

Many similarity measures were proposed for computing sentence similarity.

Lee [20] proposed a sentence similarity technique in which cosine similarity for the noun and verb vectors are calculated separately and then finally joined using a weighted formula. Our proposed sentence similarity measure does not use the weighted formula since the weighted coefficient (€) whose value is not defined, and hence, we proposed an equivalent weighted formula with synset level-based similarity computation for finding similarity between the sentences. Additionally, we compute cosine similarity with full matrix or with reduced dimensions by computing average instead of maximum-based dimensionality reductions. The depth calculation as mentioned in hypernym-based word similarity measures works well only if the two words are similar. The possibility of finding the depth factor diminishes as the dissimilarity between the words increases.

Ho et al. [10] proposed an approach wherein sentence similarity is computed by remodeling the corpus-based measure into knowledge-based measure and then the particular meanings are compared rather than nearest meanings. First, the string similarity between two words is computed and then the word similarity based on knowledge-based measure [41] is computed along with word sense disambiguation (WSD) and knowledge-based measure to find the sense of any word. Finally, sentence similarity is calculated by combining the string and word similarity based on knowledge-based measure and WSD integration. In case of word pairs that are not disambiguated properly, the similarity score is assigned with supported nearest meanings. For disambiguated word pairs, the similarity score is assigned with supported actual meanings. Our proposed measure also uses a knowledge-based synset word similarity measure that finds many levels of synonyms from WordNet to find similarity between sentences.

Li et al. [23] proposed an approach that considers the “objects” (nouns), “properties” (adjectives and adverbs), and “behaviors” (verbs) of these objects within the sentences to calculate sentence similarity. Nouns, verbs, adverbs, and adjectives are extracted to compute their similarity. Similarity is computed by forming vectors for the extracted objects, properties, and behaviors. To calculate the object similarity, the object vector is created and every noun from the sentence are mapped into object similarity vector. The vector is filled by computing word similarity between a word and every noun of a sentence with the assistance of WordNet using least common subsume (LCS) and path length. Then the word similarity score of the nouns are taken, and eventually, the similarity between objects of two sentences is computed by cosine similarity. Property similarity and behavior similarity are computed in the same manner, and therefore, the overall sentence similarity is computed by combining object similarity, property similarity, and behavior similarity. Our proposed similarity measure also uses object, property, and behavioral similarity as well as synset based word similarity to find the similarity between the sentences.

Shan et al. [35] proposed an approach within which the similarity between sentences relies on both syntactical and semantic similarity. Semantic similarity computation relies on extracting the events from every sentence by considering the subject, predicate, accusative, time, and location. Then the similarity between the event parts is measured using word similarity, and also the similarity between two events is measured using the events part similarity. Thereby, final semantic similarity for the sentence is calculated by measuring the similarity between two events. The syntactical similarity computation is done after pre-processing, stop word removal, and stemming the sentences, and eventually the content words from the sentences such as nouns, verbs, and adjectives are considered. Syntactical similarity is computed based on the number co-occurring content words and by measuring the LCS. Lastly, sentence similarity is calculated by combining semantic and syntactical similarity. The similarity measure proposed also extracts the events from the sentence but uses synset based word similarity instead of LCS to compute the similarity.

Li et al. [22] proposed an approach that takes semantic and word ordering information to compute sentence similarity. This method takes two sentences and finds all distinct words within the sentences to make a joint word set. Raw semantic vector for every sentence is formed by computing the word similarity between each sentence and the word within the joint word set with the help of WordNet by finding the path length and depth between two words. The semantic vector for the two sentences springs using the raw semantic vector and information content derived from the corpus. Then word order and semantic similarity are considered to find the overall sentence similarity. The proposed similarity measure also forms the semantic vector but uses synset based word similarity (SWS) measure to fill the vector.

O’Shea et al. [26] proposed a new sentence similarity based on Li et al. [22], which is used for constructing conversational agents. This methodology additionally finds the similarity between the words based on path length and depth of the words using WordNet. Weights are assigned to each word based on its importance using the information content from the corpus. This information content of every word and similarity between the words supported by path length and depth is combined to the semantic vectors. Thus, semantic vector is formed and semantic similarity is calculated. Then the overall sentence similarity is computed by combining the semantic and word order similarity using the formula s1·s2/√s1·√s2. The similarity measure proposed uses synset based word similarity to form the semantic vector instead of path length and depth.

2.3 Dimensionality Reduction Methods

Dimensional reduction is considered to reduce the number of features or attributes. There are seven reduction techniques (Silipo et al. [37], www.knime.org). The first technique of dimensionality reduction is “missing values.” The data column that contains many missing values will be removed by calculating the threshold value. The second technique is “variance,” where each data column is measured for variance and those columns with lesser threshold value will be removed. The third technique is “correlation,” that is, whether a feature or attribute is dependent on another feature or attribute is checked, i.e. the feature that is dependent on another feature and produces the same information may be removed, so that it does not affect future tasks. This is done by measuring the correlation between two data columns or features. The fourth technique is “principal component analysis,” which transforms the original coordinates of the data set into a new set of coordinates. This new set of coordinates is called principal components, which are sorted using variance. Based on this variance, the data set is reduced by removing the lower-variance data set and retaining the data set with useful information. The fifth technique is “decision tree,” where sets of decision trees are generated with the features of each tree trained with a certain number of attributes. The feature that is selected to split the tree in most of the trees is considered to be important and that feature is considered as the attribute to retain. The sixth technique is “backward feature elimination,” where a classification algorithm is used to iteratively train on n features, n–1 feature, and so on until there is one feature left for classification. The feature that produces an increase in error rate is removed. The seventh technique, “forward feature construction,” generates several classifiers starting from one feature and adding features one by one by selecting the most informative feature.

The singular value decomposition [2] is also a dimensionality reduction technique that is used to break the matrix into useful features. The matrix m*n forms the factorization of the form UΣV^*, where U is an m*𝒫 real or complex unitary matrix, Σ is a 𝒫*𝒫 rectangular diagonal matrix with non-negative real numbers on the diagonal, and V is an n*𝒫 real or complex unitary matrix. 𝒫 is equal to the rank of matrix M. The diagonal entries Σ_ii of Σ are known as the singular values of M. The columns of U and the columns of V are called the left-singular vectors and right-singular vectors of M, respectively. This SVD reduces the dimensionality of the data by projecting the data on the space spanned by the left singular values corresponding to the k largest singular values. Lee [20] used the maximum-based dimensionality reduction, whereas our proposed similarity measure uses strategies like average-based dimensionality reduction and different variations of no dimensionality reduction strategies to compute sentence similarity.

3 Design of Automatic Answer Evaluation System Based on Similarity Measures

In this section, we describe our proposed framework for automated student answer script evaluation system, as depicted in Figure 1.

Figure 1:

Sentence Similarity-Based Student Answer Script Evaluation Framework.

We also propose a set of sentence similarity measures, adopt them, and apply the same to develop the automated student answer evaluation system.

3.1 NLP-Based Pre-Processing

This subsystem processes input sentences and smoothly builds POS-based semantic space. It includes two components.

3.1.1 Sentence Formalization

The sentence pairs considered as input are formalized before actual processing is done. In this step, we perform tokenizing, lowercasing, and stop word removal.

3.1.2 Parts of Speech Tagging

The input sentence pairs are tagged with parts of speech information using Apache Open NLP pos tagger and various sets such as noun, verb, adjective, adverb, and all words group are formed.

3.2 Semantic Space Creation

As our sentence similarity models require different forms of base space, this models performs the function of creating multiple base spaces containing noun–verb, noun–verb–adjective–adverb, noun–verb synsets, noun–verb–adjective–adverb synsets, and all words.

3.3 Document Vector Creation

The module also has the capacity to create multiple document vectors. The vectors are created based on TF/IDF or using our proposed synset-based word similarity measure, which is detailed in Section 4.2.3.

3.4 Dimension Reduction Methods

This module performs the task of reducing the dimension of a document vector by computing the maximum or the average of each column in the document vector, also performs no Reduction with Matrix Reordering and no Reduction with Matrix Reordering & Zeroed Non-Diagonals.

3.5 Cosine Similarity Computation

This computes the cosine similarity between the two sentence vectors given in reduced or full form. This module provides the actual score in percentage marks from the cosine similarity measure value multiplied by 100.

4 Sentence Similarity Evaluation Algorithm

The similarity measure represent similarity between two objects, sentences, etc. Many similarity measures help us evaluate the similarity between the sentences, which can be used for automatic answer correction system.

This section describes our approach for design and evaluation of sentence similarity measures based on several parameters such as document vector formation methods, word similarity, and use of dimensionality reduction.

4.1 Base Set Formation

The objective here is to form the set of words with which the actual comparison is done. The methods under study is to consider all words into document vector space as adopted in measures like TF/IDF or form noun–verb vector space as in Lee [20]. Additionally, we propose to consider noun–verb–adjective–adverb because all the parts of speech equally contribute to expressing semantics of a sentence. Our experimental results show that the proposed performs better than all other models.

4.1.1 Full Sentence Semantic Space

This approach computes similarity by considering all the words in the sentences pairs.

Initially, word set of two input sentences are formed by combining all the words in the sentences.

Definition 1: The word sets of two input sentences are formed as follows:

WSsent_1 = {All wordssent_1},WSsent_2 = {All wordssent_2}.

After the formation of word set, the base set (BS) for all the words is formed. Their definition is given below:

BS = |WSsent_1 ∪ WSsent_2|

Here, BS is the union of all words in sentences 1 and 2.

4.1.2 Noun-Verb Semantic Space

The proposed NV similarity measure approach computes the similarity by considering the nouns and verbs in the sentences. This approach extracts the parts of speech of the sentence pairs using the Stanford parser. Initially, a word set of two input sentences are formed by combining the nouns and verbs of a sentence, i.e. forming the union of nouns and verbs in a sentence. Whereas the Lee [20] models create two base spaces separately with noun and verb, our proposed method creates a single base space with noun and verbs together.

Definition 2: The word sets of two input sentences are formed as follows:

WSsent_1 = {Nounssent_1 ∪ Verbssent_1},WSsent_2 = {Nounssent_2 ∪ Verbssent_2}.

Here, WS_{sent_1} and WS_{sent_2} are the sets of words in sentence₁ and sentence₂, respectively. Nouns_{sent_1}, Verbs_{sent_1} corresponds to nouns and verbs, respectively, in sentence 1.

Definition 3: The definitions for forming the NV BS for sentences 1 and 2 are as given below:

NV_BS1 = |Nounssent_1 ∪ Verbssent_1|NV_BS2 = |Nounssent_2 ∪ Verbssent_2|

Finally, we form the NV BS as the union of NV_BS₁ and NV_BS₂:

NV_BS = |NV_BS1 ∪ NV_BS2| or |WSsent_1 ∪ WSsent_2|

4.1.3 Noun-Verb Synset Semantic Space

Here we redefine the BS defined in the previous section by adding its corresponding synsets as well.

Definition 4: The definitions for forming the NV BS (Ramamurthy and Krishnamurthi [30]) for sentences 1and 2 are as given below:

NV_BS1 = Nounssent_1 ∪ Verbssent_1 ∪ Synset(Nounssent_1 ∪ Verbssent_1)NV_BS2 = Nounssent_2 ∪ Verbssent_2 ∪ Synset(Nounssent_2 ∪ Verbssent_2)

Finally, we form the NV BS as the union of NV_BS₁ and NV_BS₂.

NV_BS = NV_BS1 ∪ NV_BS2

4.1.4 Noun–Verb–Adjective–Adverb Semantic Space

This proposed method is an extension of the noun–verb semantic space model proposed by Lee [20], considering the adjectives and adverbs in the sentence for computing sentence similarity. Here we redefine our definitions for forming base spaces.

Definition 5: The word sets for the two input sentences are formed as follows:

WSsent_1 = {Nounssent_1 ∪ Verbssent_1 ∪ Adjsent_1 ∪ Advsent_1},WSsent_2 = {Nounssent_2 ∪ Verbssent_2 ∪ Adjsent_2 ∪ Advsent_2}.

WS_{sent_1} and WS_{sent_2} are the sets of words in sentences 1 and 2, respectively. Nouns_{sent_1}, Verbs_{sent_1}, Adj_{sent_1}, and Adv_{sent_1} correspond to nouns, verbs, adjectives, and adverbs, respectively, in sentence 1. Nouns_{sent_2}, Verbs_{sent_2}, Adj_{sent_2}, and Adv_{sent_2} are nouns, verbs, adjectives, and adverbs, respectively, in sentence 2.

Definition 6: The definitions for forming the NVAA_BS of sentence 1 and 2 are as given below:

N_BS is the union of nouns in sentences 1 and 2, V_BS is the union of verbs in sentences 1 and 2, Adj_BS is the union of adjectives of sentences 1 and 2, and Adv_BS is the union of adverbs in sentences 1 and 2.

Finally, we form the NVAA_BS as the union of all the mentioned BSs.

NVAA_BS = |N_BS ∪ V_BS ∪ Adj_BS ∪ Adv_BS|or|WSsent_1 ∪ WSsent_2|

4.1.5 Noun–Verb–Adjective–Adverb Synset Semantic Space

This proposed method is an extension of the semantic space formation method adopted in the previous section. Here we add synsets of the previous BS into the BS, considering adjectives and adverbs in the sentence for computing sentence similarity. Here we redefine our definition for forming base spaces Ramamurthy and Krishnamurthi [30].

Definition 7: The definitions for forming the NVAA_BS of sentences 1 and 2 are as given below:

N_BS = Nounssent_1 ∪ Nounssent_2 ∪ Synset(Nounssent_1 ∪ Nounssent_2)V_BS=Verbssent_1 ∪ Verbssent_2 ∪ Synset(Verbssent_1 ∪ Verbssent_2)Adj_BS = Adjsent_1 ∪ Adjsent_2 ∪ Synset(Adjsent_1 ∪ Adjsent_2) andAdv_BS = Advsent_1 ∪ Advsent_1 ∪ Synset(Advsent_1 ∪ Advsent_1)

Finally, we form the NVAA_BS as the union of all the mentioned BSs:

NVAA_BS = |N_BS ∪ V_BS ∪ Adj_BS ∪ Adv_BS|

4.2 Document Vector Matrix Formation

In the previous section, we have framed strategies to create multiple BSs and vector that are compared and result in document similarity matrix. In this section, we employ three strategies that are used to compute values for the document vector. Initially, we describe the classic TF/IDF method, then an hypernymn-based word similarity method listed by Wu and Palmer [40], and finally we propose our synset-based method of computing similarity between two words. Our experimental results show that our proposed method gives better results than the existing approaches.

4.2.1 Term Frequency–Inverse Document Frequency

The TF/IDF is a classical method of computing the document vector matrix used for checking similarity between two documents. The term frequency is calculated by counting the number of times a term or word appears in the document. As sentences 1 and 2 may be of different sizes, normalization is performed for each sentence. The normalization is performed by dividing each term frequency by the total number of terms in a sentence. The inverse document frequency is calculated for each term using the formula

IDF(term) = 1+log e (total number of documents/number of documents with the corresponding term in it).

Then the normalized term frequency vector is multiplied with the inverse document frequency vector that forms a vector for each sentence.

4.2.2 Wu–Palmer Word Similarity Score

Wu and Palmer [40] proposed a word similarity measurement to determine the similarity between two nouns or between two verbs. This is given as

Similarity(wordA, wordB)= 2 ∗ Depth(H_1) ∗ (DPathlength(WordA, H_1) + DPathlength(WordB, H_1) + 2∗ Depth(H_1))−1

Here, H_1 is the depth of the lowest shared hypernym of word_A and word_B. Depth(H_i) is the level of H_1 in the WordNet semantic tree. DPath_Length(word_A, H_1) is the semantic distance (number of hops) form H_1 to word_A. DPath_Length(word_B, H_1) is the semantic distance (number of hops) form H_1 to word_B. Each word is compared with the base space to obtain the value of each field via Formula (1).

Our experimental results shows that the depth calculation works well only if the two words are similar. The possibility of finding the depth factor diminishes as the dissimilarity between the words increases.

Similarity calculation is shown in Table 1.

Table 1:

Similarity Calculation.

	N_BS_w1	N_BS_w2
Nouns_{sent_1}W₁	Sim(Nouns_{sent_1}, W1, N_BS_w1)	Sim(Nouns_{sent_1}W₁, N_BS_w2)
Nouns_{sent_1}W₂	Sim(Nouns_{sent_1}W₂, N_BS_w1)	Sim(Nouns_{sent_1}W₂, N_BS_w2)

4.2.3 Proposed Word Similarity Measure

Contrary to the hypernym-based word similarity measure proposed by Wu and Palmer [40], we propose a synset-based word similarity approach for calculating the similarity between two words. This approach is given in Eq. (2).

Sim(W1, W2) = {1, {w1 = w2 orsynset(w1) ∩ synset(w2) ≠ ∅1 − i10, {w1 ∩ w2 ≠ ∅w1 = w1 ∪ synseti(w1)w2 = w2 ∪ synseti(w2)i = 1i = 10

Our experimental results shows that our proposed synset-based word similarity calculation outperforms classic TF/IDF and Wu and Palmer method when used for calculating sentence similarity based on documents similarity vectors.

4.3 Dimensionality Reduction

Generally, dimensionality reduction is performed to reduce the dimensions or features of vector or space.

4.3.1 No Reduction

Few approaches as given in next section have not been used to reduce the dimensions of a document similarity matrix, before computing similarity scores. It was observed in our experiments that reducing dimensions in document similarity matrix introduces an error factor, resulting in less correlation of similarity measures computed with reduced matrices against the human score.

4.3.2 No Reduction with Matrix Reordering

The cosine similarity value was observed in our experiments to be improved by reordering the rows in the similarity matrix in such a way that maximum value of each column appears as diagonal element of the corresponding row. This can be represented by the formula

{rowsi = 1interchangeRows(similarityMatrix, i, MAX_COL_INDEX(i, rows))

4.3.3 No Reduction with Matrix Reordering and Zeroed Non-Diagonals

It was observed in our experiments that the similarity value improves when irrelevant non-diagonals elements are suppressed to 0. Thus, we have assumed 0 for all non-diagonal elements, which represented synset relationship over two levels.

4.3.4 Lee [20] Approach

Lee [20] proposed maximum-based computation for reducing the dimensions of the document similarity matrix before computing sentence similarity score. From the vector formed using Table 1, the maximum value is chosen as the final value of each field in the vector. The formulas are listed follows:

NVsent1 = MAXk = 1 to NV_BS(Sim(w1, N_BSk))VVsent1 = MAXk = 1 to V_BS(Sim(w1, V_BSk))

It was observed in our experimental results, in some-cases the usage of maximum to reduce dimension introduces greater dimensionality reduction error; thus, the correlation between the human score and the score computed with the reduced document similarity matrix weakened.

4.3.5 Our Proposed Average-Based Dimensionality Reduction

Instead of the maximum-based dimensionality reduction proposed by Lee [20], we propose an average-based dimensionality reduction method for document similarity matrix. Accordingly, the equations given in the previous section are modified as

NVsent1 = AVGk = 1 to NV_BS(Sim(w1, NV_BSk)) NVAAsent1 = AVGk = 1 to NVAA_BS(Sim(w1, NVAA_BSk))

Through our experimental results, we have observed that similarity computations with average-reduced document similarity vectors injects less dimensionality reduction error as with computations based on maximum-reduced document vectors. This may be because errors due to accidental word similarity matches are smoothened or normalized by average computation, whereas they are highlighted in case of maximum-based calculations.

4.4 Cosine Similarity

The cosine similarity measures the similarity between vectors A and B. The vectors formed using Section 4.3 is passed in cosine similarity measures to obtain the similarity value. This measure is given by

similarity = cos(θ) = A ⋅ B∥A∥ ∥B∥ = ∑i = 1nAiBi∑i = 1nAi2∑i = 1nBi2,

where A_i and B_i are the components of vectors A and B, respectively.

4.5 Weighted Computation

It can be generally observed that importance of word in a given sentence is related to its parts of speech. Words that belong to parts of speech such as preposition, conjunction, and interjection constitutes stop word list and is considered to be of zero priority. Therefore, we propose an approach with varied weights for different parts of speech under consideration.

4.5.1 NV-synset-Weighted Computation for Overall Similarity

The NV-synset-weighted formula computes the similarity between sentences using only nouns and verbs. This formula Ramamurthy and Krishnamurthi [30] takes two parameter values for nouns and verbs, as α=0.65, β=0.35. The parameter values are set by the experimenter by considering the fact that nouns are more important to evaluate a sentence and hence more weightage of 65% is given on a noun, i.e. α is given 65% (0.65) and the remaining 35% to the verb, i.e. β is given 35% (0.35).

Overall Sentence Similarity = αNounSim + βVerbSim

Here, NounSim and VerbSim is the similarity values obtained from cosine similarity from Section 4.3.

4.5.2 NVAA-Synset-Weighted Computation for Overall Similarity

The NVAA synset-weighted formula computes the similarity between sentences using nouns, verbs, adjectives, and adverbs POS, i.e. NounSim, VerbSim, AdjSim, and AdvSim. This formula (Ramamurthy and Krishnamurthi [30]) takes four parameter values for nouns, verbs, adjectives, and adverbs, i.e. α=0.4, β=0.3, γ=0.1, δ=0.2. The parameters values are taken by considering the noun similarity to have more weightage of 40%, i.e. α=0.4, verb with 30%, i.e. β=0.3, adjective with 10%, i.e. γ=0.1, and adverb with 20%, i.e. δ=0.2.

Overall Sentence Similarity = αNounSim + βVerbSim + γAdjSim + δAdvSim

Here, NounSim, VerbSim, AdjSim, and AdvSim are the similarity values obtained from cosine similarity from Section 4.5.

5 Essay Scoring Computation

In this phase, we extend the sentence similarity measures described in the previous sections for evaluation of short answers or essays. Similar to sentence similarity computation, we define a semantic space, create similarity matrix using our proposed SWS word similarity measure, optionally apply dimensionality reduction, and finally compute cosine similarity between the two vectors.

Here one vector corresponds to the reference text and the other vector corresponds to answer that has to be evaluated. However, in case of scoring, the BS should solely contain words from reference text and cannot contain words from student answer. Therefore, we redefine our base space as follows.

Definition 8: The base for full sentence BS is formed as follows:

BS = WSsent_1 = {All wordssent_1},WSsent_1 = {All wordssent_1}.

Definition 9: The definitions for forming the NV BS is as given below:

NV_BS = NV_BS1 or WS sent_1

Definition 10: The definitions for forming the NVAA BS is as given below:

N_BS = Nounssent_1V_BS = Verbssent_1Adj_BS = Adjsent_1Adv_BS = Advsent_1NVAA_BS = N_BS ∪ V_BS ∪ Adj_BS ∪ Adv_BS

6 Experimental Results

We conducted experiments to evaluate the performance of our proposed sentence similarity measure and essay scoring method. To evaluate the proposed 21 sentence similarity measures, we obtained results from the MSR Paraphrase Corpus, and for comparison with results given in Lee et al. [21], we used the same set of sentence from Li’s benchmark data sets. Similarly, the proposed essay scoring was evaluated for the entire contents of first essay set from Kaggle, which consisted of 1787 essays.

6.1 Evaluation over the MSR Paraphrase Corpus

The experimental results of our proposed sentence similarity measures were conducted on the data set MSR paraphrase corpus, which is a well-known data set used by SEMEVAL (http://alt.qcri.org/) workshops. The MSR paraphrase corpus consists of 739 sentence pairs, and the experiments of our proposed measures were evaluated with various models. We have selected a set of 21 sentence similarity methods consisting of a cross section of the approaches described for calculating sentence similarity. The measures were applied on the 739 pairs of sentences in the data set and was compared with the gold score value mentioned in the data set for computing standard and average deviation. Our measures reported a least overall standard deviation of 0.11 and average deviation of 0.18 as depicted in Figure 2. In our class-wise evaluation with an interval of human score 1, the least overall standard deviation of 0.18 and average deviation of 0.06 were found. The readers are directed to Figures 3–7 for more evaluation of class-wise performance. Table 2 shows the legend for Figures 2–7.

Figure 2:

Overall Results for Proposed 21 Measures with MSR Paraphrase.

Figure 3:

Standard/Average Deviation Results for Proposed 21 Measures on Score Classes 0–1 with MSR Paraphrase.

Figure 4:

Standard/Average Deviation Results for Proposed 21 Measures on Score Classes 1–2 with MSR Paraphrase.

Figure 5:

Standard/Average Deviation Results for Proposed 21 Measures on Score Classes 2–3 with MSR Paraphrase.

Figure 6:

Standard/Average Deviation Results for Proposed 21 Measures on Score Classes 3–4 with MSR Paraphrase.

Figure 7:

Standard/Average Deviation Results for Proposed 21 Measures on Score Classes 4–5 with MSR Paraphrase.

Table 2:

Legend for the 21 Measures Shown in Figures 2–7.

1. NVAA-SWS-Avg	8. NVAA-SWS-Weighted Max	15. NV-SWS-No Red/Reordered/Zeroed
2. NVAA-SWS-Max	9. NVAA-synset-TF/IDF	16. NV-SWS-Weighted
3. NVAA-SWS-No Red	10. NVAA-TF/IDF	17. NV-SWS-Weighted Avg
4. NVAA-SWS-No Red/Reordered	11. NV-SWS-Avg	18. NV-SWS-Weighted Max
5. NVAA-SWS-No Red/Reordered/Zeroed	12. NV-SWS-Max	19. NV-synset-TF/IDF
6. NVAA-SWS-Weighted	13. NV-SWS-No Red	20. NV-TF/IDF
7. NVAA-SWS-Weighted Avg	14. NV-SWS-No Red/Reordered	21. Sentence-TF/IDF

6.2 Evaluation over with Li’s Benchmark and Lee et al. [21]

For the purpose of comparison of our proposed measures with measures proposed earlier, we computed the correlation, standard deviation, and average deviation over the 29 sentence pairs taken from Li’s benchmark, as given in Lee et al. [21]. Also, we classified the sentence pairs using score value with an interval of 1 and computed the same measures. We received a maximum correlation of 0.99 over classes 3–4 and with 0.27 as standard deviation and 0.191 as average deviation. When Lee et al. [21] claims to have 0.2 average correlation over score classes 0–1 and 0.208 average correlation over score classes 1–3, we have scores 0.03 and 0.11, respectively, as depicted in Figures 8–12. Table 3 shows the legend for Figures 8–12.

Figure 8:

Lee et al. [21] Data Set – Result Analysis for Classes 0–1 Scores.

Figure 9:

Lee et al. [21] Data Set – Result Analysis for Classes 1–2 Scores.

Figure 10:

Lee et al. [21] Data Set – Result Analysis for Classes 2–3 Scores.

Figure 11:

Lee et al. [21] Data Set – Result Analysis for Classes 3–4 Scores.

Figure 12:

Sentence Similarity Measures Overall Comparison with Values from Lee et al. [21].

Table 3:

Legend for the 25 Measures Shown in Figures 8–12.

1. LG	6. NVAA-SWS-No Red	11. NVAA-SWS-Weighted-Max	16. NV-SWS-No Red	21. NV-SWS-Weighted-Max
2. Li-McLean	7. NVAA-SWS-No Red/Reordered	12. NVAA-synset-TF/IDF	17. NV-SWS-No Red/Reordered	22. NV-synset-TF/IDF
3. LSA	8. NVAA-SWS-No Red/Reordered/Zeroed	13. NVAA-TF/IDF	18. NV-SWS-No Red/Reordered/Zeroed	23. NV-TF/IDF
4. NVAA-SWS-Avg	9. NVAA-SWS-Weighted	14. NV-SWS-Avg	19. NV-SWS-Weighted	24. Sentence-TF/IDF
5. NVAA-SWS-Max	10. NVAA-SWS-Weighted-Avg	15. NV-SWS-Max	20. NV-SWS-Weighted-Avg	25. SyMSS

6.3 Performance Evaluation for Automated Scoring of Kaggle Short Answer Data Set

Figures 13–18 shows the performance comparison of the proposed 21 measures for the automated assessment of brief answers. The measures for assessment are checked using the Kaggle data set. Figures 13–18 show the Pearson correlation, standard deviation, and average deviation between the gold score and the calculated score for 21 measures while evaluating the answers. The performance is classified results with respect to gold score, i.e. 0–1, 1–2, 2–3, 34, and 4–5. Finally, Figure 12 shows the overall assessment results, i.e. comparison is made on overall data set of Kaggle data set.

Figure 13:

Results for Kaggle Brief Question Answer Evaluation Data Set for Score Class (0–1).

Figure 14:

Results for Kaggle Brief Question Answer Evaluation Data Set for Score Class (1–2).

Figure 15:

Results for Kaggle Brief Question Answer Evaluation Data Set for Score Class (2–3).

Figure 16:

Results for Kaggle Brief Question Answer Evaluation Data Set for Score Class (3–4).

Figure 17:

Results for Kaggle Brief Question Answer Evaluation Data Set for Score Class (4–5).

Figure 18:

Overall Results for Kaggle Brief Question Answer Evaluation Data Set.

An overall analysis shows that our proposed measure NVAA with no dimensionality reduction or its variations and NVAA with average dimensionality reduction outperform the other models analyzed and the standard TF/IDF method.

6.3.1 Performance Evaluation for Automated Scoring of the Kaggle Essay Data Set

We evaluated 1376 essays from the Kaggle essay data set and computed the average deviation compared with human average score. Our proposed measures showed a minimum average deviation of 0.091 and correlation is 0.689. The full results are given in Figure 19.

Figure 19:

Overall Results for Kaggle Essay Scoring.

7 Conclusion

This paper presents an approach for the design of automatic answer script evaluation system based on our proposed noun–verb–adjective–adverb and noun–verb similarity measures. The proposed two similarity measures is evaluated based on several parameters such as BS/document vector formation methods, word similarity, and use of dimensionality reduction. The evaluation is performed on proposed and existing models using different data sets such as MSR paraphrase corpus, Li’s Benchmark corpus, and Kaggle short answer/essay data set. The performance is measured using Pearson correlation, and the correlation shows that among the base-set formation, noun–verb–adjective–adverb vector is better and among the dimensionality optimizations full matrix with re-ordered and average dimensionality reduction outweighs other existing models. Thus, the proposed automated evaluation system used the proposed similarity measures and were compared with human scores. The performance result shows that the system scores correlate with the human scores.

Bibliography

[1] J. A. Aslam and M. Frost, An information-theoretic measure for document similarity, in: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 449–450, ACM, New York, NY, USA, 2003.10.1145/860435.860545Search in Google Scholar

[2] E. Bingham, and H. Mannila, Random projection in dimensionality reduction: applications to image and text data, in: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250, ACM, 2001.10.1145/502512.502546Search in Google Scholar

[3] J. Burstein, K. Kukich, S. Wolff, L. Chi and M. Chodorow, Enriching automated essay scoring using discourse marking, in: Proceedings of the Workshop on Discourse Relations and Discourse Marking, Annual Meeting of the Association of Computational Linguistics, Montreal, Canada, 1998.Search in Google Scholar

[4] J. Burstein, C. Leacock and R. Swartz, Automated evaluation of essays and short answers, in: Proceedings of the 6th International Computer Assisted Conference, edited by M. Danson, Loughborough, UK, 2001.Search in Google Scholar

[5] H. Chim and X. Deng, Efficient phrase-based document similarity for clustering, IEEE Trans. Knowledge Data Eng.20 (2008), 1217–1229.10.1109/TKDE.2008.50Search in Google Scholar

[6] C. G. González, W. Bonventi Jr. and A. V. Rodrigues, Density of closed balls in real-valued and autometrized boolean spaces for clustering applications, in: Advances in Artificial Intelligence-SBIA 2008, pp. 8–22, Springer, Berlin, Heidelberg, 2008.10.1007/978-3-540-88190-2_7Search in Google Scholar

[7] R. W. Hamming, Error detecting and error correcting codes, Bell Syst. Tech. J.2b9 (1950), 147–160.10.1002/j.1538-7305.1950.tb00463.xSearch in Google Scholar

[8] J. Han, M. Kamber and J. Pei, Data mining: concepts and techniques, Morgan Kaufmann Publishers, Burlington, MA, USA, 2011.Search in Google Scholar

[9] M. A. Hearst, The debate on automated essay grading, IEEE Intell. Syst.15 (2000), 22–37.10.1109/5254.889104Search in Google Scholar

[10] C. Ho, M. A. A. Murad, R. A. Kadir and S. C. Doraisamy, Word sense disambiguation-based sentence similarity, in: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 418–426, Association for Computational Linguistics, Stroudsburg, PA, USA, 2010.Search in Google Scholar

[11] X. Hu, and H. Xia, Automated assessment system for subjective questions based on LSI, in: Intelligent Information Technology and Security Informatics (IITSI), 2010 Third International Symposium on, pp. 250–254, IEEE, 2010.10.1109/IITSI.2010.76Search in Google Scholar

[12] J. Jerrams-Smith, V. Soh and D. Callear Bridging gaps in computerized assessment of texts, in: Proceedings of the International Conference on Advanced Learning Technologies, pp. 139–140, IEEE, 2001.Search in Google Scholar

[13] T. Joachims, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, No. CMU-CS-96-118, Department of Computer Science, Carnegie-Mellon University, Pittsburgh, PA, 1996.Search in Google Scholar

[14] T. Joachims and F. Sebastiani, Guest editors’ introduction to the special issue on automated text categorization, J. Intell. Inf. Syst.18 (2002), 103–105.10.1023/A:1013652626023Search in Google Scholar

[15] D. Kanejiya, A. Kumar, and S. Prasad, Automatic evaluation of students’ answers using syntactically enhanced LSA, in: Proceedings of the HLT-NAACL 03 Workshop on Building Educational Applications Using Natural Language Processing, vol. 2, pp. 53–60, Association for Computational Linguistics, 2003.10.3115/1118894.1118902Search in Google Scholar

[16] H. Kim, P. Howland and H. Park, Dimension reduction in text classification with support vector machines, J. Machine Learn. Res.6 (2005), 37–53.Search in Google Scholar

[17] K. Knight, Mining online text, Commun. ACM42 (1999), 58–61.10.1145/319382.319394Search in Google Scholar

[18] S. Kullback and R. A. Leibler, On information and sufficiency, Ann. Math. Stat.22 (1951), 79–86.10.1214/aoms/1177729694Search in Google Scholar

[19] A. B. Lajis, and N. A. Aziz, Part-of-speech in a node-link scoring techniques for assessing learners’ understanding, Procedia Soc. Behav. Sci.27 (2011), 131–139.10.1016/j.sbspro.2011.10.591Search in Google Scholar

[20] M. C. Lee, A novel sentence similarity measure for semantic-based expert systems, Expert Syst. Appl.38 (2011), 6392–6399.10.1016/j.eswa.2010.10.043Search in Google Scholar

[21] M. C. Lee, J. W. Chang and T. C. Hsieh, A grammar-based semantic similarity algorithm for natural language sentences, Sci. World J.2014 (2014), 437162.10.1155/2014/437162Search in Google Scholar PubMed PubMed Central

[22] Y. Li, D. McLean, Z. A. Bandar, J. D. O’Shea, and K. Crockett, Sentence similarity based on semantic nets and corpus statistics, Knowledge and Data Engineering, IEEE Transactions on18 (2006), 1138–1150.10.1109/TKDE.2006.130Search in Google Scholar

[23] L. Li, X. Hu, X. Hu, J. Wang, and Y. M. Zhou, Measuring sentence similarity from different aspects, in: Machine Learning and Cybernetics, 2009 International Conference on, vol. 4, pp. 2244–2249, IEEE, 2009.Search in Google Scholar

[24] D. Lin, An information-theoretic definition of similarity, ICML98 (1998), 296–304.Search in Google Scholar

[25] M. G. Michie, Use of the Bray-Curtis similarity measure in cluster analysis of foraminiferal data, J. Int. Assoc. Math. Geol.14 (1982), 661–667.10.1007/BF01033886Search in Google Scholar

[26] K. O’Shea, Z. Bandar, and K. Crockett, A novel approach for constructing conversational agents using sentence similarity measures, in: Proceedings of the World Congress on Engineering, vol. 1, 2008.Search in Google Scholar

[27] E. B. Page, The imminence of grading essays by computer, Phi Delta Kappan47 (1966), 238–243.Search in Google Scholar

[28] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, BLEU: a method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318, Association for Computational Linguistics, 2002.10.3115/1073083.1073135Search in Google Scholar

[29] D. Pérez, O. Postolache, E. Alfonseca, D. Cristea, and P. Rodriguez, About the effects of using anaphora resolution in assessing free-text student answers, in: Proceedings of RANLP-2005, 380–386, 2005.Search in Google Scholar

[30] M. Ramamurthy and I. Krishnamurthi, Parts of speech based sentence similarity computation measures, Int. J. App. Eng. Res.10 (2015), 20176–20184.Search in Google Scholar

[31] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, McGraw Hill Book Co., New York, 1983.Search in Google Scholar

[32] S. Saxena, and P. R. Gupta, Automatic assessment of short text answers from computer science domain through pattern based information extraction, in: Proceeding of ASCNT, 109–118, 2009.Search in Google Scholar

[33] T. W. Schoenharl and G. Madey, Evaluation of measurement techniques for the validation of agent-based simulations against streaming data, in: Computational Science, ICCS 2008, pp. 6–15, Springer, Berlin, Heidelberg, 2008.10.1007/978-3-540-69389-5_3Search in Google Scholar

[34] F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv. (CSUR)34 (2002), 1–47.10.1145/505282.505283Search in Google Scholar

[35] J. Shan, Z. Liu, and W. Zhou, Sentence similarity measure based on events and content words, Fuzzy Systems and Knowledge Discovery7 (2009), 623–627.10.1109/FSKD.2009.926Search in Google Scholar

[36] R. Siddiqi, and C. J. Harrison, On the automated assessment of short free-text responses, in: IAEA Conference Paper, 2008.Search in Google Scholar

[37] R. Silipo, I. Adae, A. Hart and M. Berthold, Seven Techniques for Dimensionality Reduction. www.knime.org/files/knime_seventechniquesdatadimreduction.pdf, 2014.Search in Google Scholar

[38] A. Strehl and J. Ghosh, Value-based customer grouping from large retail data sets, in: AeroSense 2000, pp. 33–42, International Society for Optics and Photonics, Bellingham, WA, 2000.10.1117/12.381756Search in Google Scholar

[39] P. N. Tan, M. Steinbach and V. Kumar, Introduction to data mining, Vol. 1, Pearson Addison Wesley, Boston, 2006.Search in Google Scholar

[40] Z. Wu and M. Palmer, Verbs semantics and lexical selection. in: Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pp. 133–138, Association for Computational Linguistics, Stroudsberg, PA, USA, 1994.10.3115/981732.981751Search in Google Scholar

[41] D. Yang, and D. M. Powers, Measuring semantic similarity in the taxonomy of WordNet, in: Proceedings of the Twenty-eighth Australasian conference on Computer Science, vol. 38, pp. 315–322, Australian Computer Society, Inc, 2005.Search in Google Scholar

Received: 2015-4-8

Published Online: 2016-5-11

Published in Print: 2017-4-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Design and Development of a Framework for an Automatic Answer Evaluation System Based on Similarity Measures

Abstract

1 Introduction

2 Related Works

2.1 Similarity Measures and Vector Similarity Measures

2.2 Sentence Similarity Measures

2.3 Dimensionality Reduction Methods

3 Design of Automatic Answer Evaluation System Based on Similarity Measures

3.1 NLP-Based Pre-Processing

3.1.1 Sentence Formalization

3.1.2 Parts of Speech Tagging

3.2 Semantic Space Creation

3.3 Document Vector Creation

3.4 Dimension Reduction Methods

3.5 Cosine Similarity Computation

4 Sentence Similarity Evaluation Algorithm

4.1 Base Set Formation

4.1.1 Full Sentence Semantic Space

4.1.2 Noun-Verb Semantic Space

4.1.3 Noun-Verb Synset Semantic Space

4.1.4 Noun–Verb–Adjective–Adverb Semantic Space

4.1.5 Noun–Verb–Adjective–Adverb Synset Semantic Space

4.2 Document Vector Matrix Formation

4.2.1 Term Frequency–Inverse Document Frequency

4.2.2 Wu–Palmer Word Similarity Score

4.2.3 Proposed Word Similarity Measure

4.3 Dimensionality Reduction

4.3.1 No Reduction

4.3.2 No Reduction with Matrix Reordering

4.3.3 No Reduction with Matrix Reordering and Zeroed Non-Diagonals

4.3.4 Lee [20] Approach

4.3.5 Our Proposed Average-Based Dimensionality Reduction

4.4 Cosine Similarity

4.5 Weighted Computation

4.5.1 NV-synset-Weighted Computation for Overall Similarity

4.5.2 NVAA-Synset-Weighted Computation for Overall Similarity

5 Essay Scoring Computation

6 Experimental Results

6.1 Evaluation over the MSR Paraphrase Corpus

6.2 Evaluation over with Li’s Benchmark and Lee et al. [21]

6.3 Performance Evaluation for Automated Scoring of Kaggle Short Answer Data Set

6.3.1 Performance Evaluation for Automated Scoring of the Kaggle Essay Data Set

7 Conclusion

Bibliography

Journal and Issue

Articles in the same Issue