Introduction

The 2010 Common Core State Standards for student learning emphasize the ability of students as young as the fourth grade to construct essays where they interpret and evaluate a text, construct logical arguments based on substantive claims, and marshal appropriate evidence in support of these claims (Correnti et al. 2013). The Response to Text Assessment (RTA) is developed for research purposes to assess skills at generating analytical text-based writing, and to provide an outcome measure that is independent of a state’s accountability test. Specifically, the RTA, unlike available large-scale assessments, is designed to evaluate the integration of reading comprehension and writing skills (Correnti et al. 2013). Our research takes a first step towards developing an automatic essay assessment system for the RTA. Our goal is to develop a tool that can further large-scale research on the impact of instruction, interventions, and policies that influence the development of this writing skill.

Because scoring text-based writing assessments is typically labor intensive and requires extensive training and expertise on the part of raters to obtain reliable scores, automated essay scoring has been proposed as a fast, effective, and affordable solution to the problem of assessing student writing at scale. For example, a recent contrastive analysis of 9 state-of-the-art systems on 8 essay scoring prompts drawn from high-stakes assessments claimed that Automated Essay Scoring (AES) systems had as high a level of agreement with human graders as human graders had with each other (Shermis and Hamner 2012). However, critics of AES argue that AES scores typically under-represent the construct of writing (Condon 2013; Perelman 2013) and even ardent supporters of AES acknowledge its limitations (Shermis and Hamner 2012; Deane 2013).

First, many essay assessment systems rely on holistic rather than trait-based rubrics (Attali and Burstein 2006; Elliot 2003; Page 2003; Attali et al. 2013), and thus tend to focus on summative rather than formative assessment. While holistic methods are typically more efficient and provide more reliable scores, trait-based methods are better at providing diagnostic insight on student performance (Bacha 2001; Weigle 2002). Such insight is particularly useful for systems that not only score but also provide formative feedback. Even when systems do trait-based scoring, critics maintain that trait-based AES has focused on surface dimensions of writing such as grammar rather than more substantive dimensions (Attali and Powers 2008; Perelman 2012). Our system for automatically scoring the RTA is trait-based rather than holistic, scores two of the RTA’s substantive writing traits (namely, Evidence and Organization), and is motivated by formative rather than summative assessment.

Second, in terms of writing tasks, most systems (whether holistic or trait-based) focus on assessing writing in response to open-ended prompts (Attali and Burstein 2006; Crossley et al. 2013; Elliot 2003; Lee et al. 2008; Page 2003; Klebanov and Higgins 2012) rather than in response to text. They usually use more generic rubrics instead of task-specific ones. One advantage of task-dependent rubrics is the ability to provide feedback that is better aligned with the task. Existing systems also do not explicitly evaluate the quality of reasoning based on information from only the text, and instead evaluate dimensions such as structure, elaboration, and vocabulary sophistication (Shermis and Burstein 2003). Our system for automatically scoring the RTA focus on assessing writing in response to text using task-dependent rubrics.

Third, in terms of scoring method, many AES systems do not consider construct validity (Condon 2013; Perelman 2012). Existing AES systems are limited in evaluation of higher-order aspects of writing, such as the quality of content and its organization. For example, AES achieves high reliability in evaluation of content and ideas mostly by using “bag-of-words” approaches that bear little relationship to the scoring rubric for the construct (Landauer et al. 1998; Attali and Burstein 2006; Attali 2011). In contrast, our model for automatically scoring the RTA is consistent with the rubric criteria and easily explainable. Others in the AES community are similarly arguing that automated scoring models should reflect important aspects of the construct being measured, following common practice in the measurement community. That is, dimensions of the construct should be well represented by the features used in the scoring model, and the features contained in the model should not be irrelevant to the rubric for the construct (Loukina et al. 2015). A model with construct validity has greater potential to generate useful formative feedback to students and teachers.

Finally, current AES systems typically score writing that is generated by upper middle-school, secondary, post-secondary students, or by adults for a high-stakes exam (Burstein et al. 1999; Deane et al. 2013; Klebanov and Higgins 2012). For example, the sample of essays in the contrastive analysis of Shermis and Hammer (Shermis and Hamner 2012) described above were from Grades 7, 8, and 10. Our work, in contrast, focuses on writing in Grades 5 through 8, which poses challenges for existing AES methods as RTA essays are typically shorter, contain more grammatical and spelling errors, and are less sophisticated in terms of use and organization of evidence. Our work thus tackles the challenge of using computational techniques on data that is particularly noisy given the stage of writing development of the students.

In the following sections, we first introduce the previous research on this topic. Next we talk about the data, the rubric dimensions and the prompt that we use in our study. Then, we explain the two models we designed to extract features for the Evidence and Organization dimensions of our rubric. Next, we discuss the experiments and results. Finally, we recap our conclusions and discuss future work. Our results show that in general, our rubric-based task-dependent model performs as well as (if not better than) the rigorous baselines we used. Moreover, the combination of our new features with the baseline features often yields better results than either the proposed or baseline features in isolation. Both within-corpus and cross-corpus experiments yield similar conclusions, supporting the robustness of our approach. Finally, feature ablation studies suggest that feature utility is related to rubric coverage.

Related Work

Natural Language Processing techniques have been used to evaluate both the content and organization of writing. One approach of evaluating the content of student essays is to detect whether they are off-topic or on-topic (Louis and Higgins 2010; Higgins et al. 2006). Adherence to the prompt (Persing and Ng 2014) is another way to measure text topicality. Yet another approach to estimating the quality of content is to compare the essay to sets of training essays with different scores (Attali and Burstein 2006; Kakkonen et al. 2005; Xie et al. 2012). These prior studies differ from our response-to-text task in that they do not target source-based writing in which the quality of content should be measured with regard to how the essays use the source material.

Source-based writing refers to types of writing that require students to generate responses that are based on and that reference one or multiple source text(s). Generally, responses are expected to demonstrate close reading and deep comprehension of texts through effective use of evidence from the source text(s). For example, having read a novel, students might be asked to analyze the main theme, providing evidence from the novel. Or, having read two articles representing opposing viewpoints on a topic, students might be asked to write an opinion or argumentative essay in which they use points from the text to support their claim or rebut the opposing perspective. Professional standards for literacy in K-12 education are increasingly emphasizing such source-based writing (e.g., NCTE/IRA, 2012; NGAC/CCSSO, 2011).Footnote 1

In contrast, quality of content is evaluated with regard to integrating information from the source materials in Kakkonen et al. (2005) and Lemaire and Dessus (2001). These studies also differ from our task. In our work, we care about pieces of evidence that students provide from the source material. So, the task is beyond simply deciding if the essay is semantically similar to the source material or not. To be able to score based on the criteria in the rubric and to have the ability of providing feedback based on the detailed information in the essays, we need to localize pieces of evidence. With an ultimate goal of scoring essays, Klebanov et al. (2014) evaluated different content importance models that help predict which parts of the source material should be selected by the students. This study is in a similar direction with our preliminary study (Rahimi and Litman 2016) for automatically extracting important pieces of evidence from the source material.

Another related area of research is to first find argumentation components using argumentation mining techniques, and then use the results of argumentation mining for scoring the essays (Ong et al. 2014; Burstein et al. 2003a; Song et al. 2014; Persing and Ng 2015). Mostly, argumentation mining in the domain of essay evaluation is applied to persuasive essay corpora written in response to a prompt (Stab and Gurevych 2014a,2014b) rather than to source-based writing. Similarly, the definition of Evidence in our task is related to source material and is different from more general definitions of Evidence, Premise, etc. in persuasive essays. Another difference from prior work is that in our study, the essays are written by young kids and we do not expect them to follow a sophisticated argumentation structure.

As a construct, ‘Organization’ has figured in systems for scoring student writing for decades. When organization is considered as a separate dimension, some surface features of organization are considered. Such surface features include: effective sequencing; strong inviting beginning; strong satisfying conclusion; and smooth transitions.Footnote 2 Assessments aligned to the Common Core State Standards (CCSS), the academic standards adoptedFootnote 3 widely in 2011 that guide K-12 education, reflect a shift in thinking about the scoring of organization in writing to consider the coherence of ideas in the text.Footnote 4 The consideration of idea coherence as a critical aspect of organization of writing is relatively new.

Notably, prior studies in natural language processing have examined the concept of discourse coherence, which is highly related to the coherence of topics in an essay, as a measure of the organization of analytic writing. For example, in Somasundaran et al. (2014) the coherence elements are adherence to the essay topic, elaboration, usage of varied vocabulary, and sound organization of thoughts and ideas. In Scott and McNamara (2011) the elements are effective lead, clear purpose, clear plan, topic sentences, paragraph transitions, organization, unity, perspective, conviction, grammar, syntax, and mechanics.

Many computational methods are used to measure such elements of discourse coherence. Vector-based similarity methods measure lexical relatedness between text segments (Foltz et al. 1998) or between discourse segments (Higgins et al. 2004). Centering theory (Grosz et al. 1995) addresses local coherence (Miltsakaki and Kukich 2000). Entity-based essay representation along with type/token ratios for each syntactic role is another method to evaluate coherence (Burstein et al. 2010) that is shown in Burstein et al. (2013) to be a predictive model on a corpus of essays from grades 6-12. Lexical chaining addresses multiple aspects of coherence such as elaboration, usage of varied vocabulary, and sound organization of thoughts and ideas (Somasundaran et al. 2014). Discourse structure is used to measure the organization of argumentative writing (Cohen 1987, Burstein et al. 1998, 2003b). All these works rely on lexical information to measure coherence. In contrast, our proposed model uses more coarse-grained topic information. Based on the rubric, we are interested in localizing the pieces of evidence for different topics in essays and evaluate the transition between these topics. For this purpose, we proposed the concept of topic-grid and topic-chain.

In previous studies, assessments of text coherence have been task-independent, which means that these models are designed to be able to evaluate the coherence of the response to any writing task. Task-independence is often the goal for automated scoring systems, but it is also important to measure the quality of students’ organization skills when they are responding to a task-dependent prompt. One advantage of task-dependent scores is the ability to provide feedback that is better aligned with the task. Our model to evaluate the Organization dimension is task-dependent which means it is designed based on the detailed criteria in the rubric and makes use of the source material by evaluating the transition of important topics and pieces of evidence adopted from the source in essays.

Our preliminary studies addressing the task-dependent automatic scoring of both the Evidence (Rahimi et al. 2014) and Organization (Rahimi et al. 2015) dimensions of the RTA were motivated by the differences with prior work discussed above. Our initial method for localizing and analyzing the quality of Evidence in source-based writing was presented in Rahimi et al. (2014) and evaluated on a corpus of essays from grades 5–6. Here we extend this earlier work by taking advantage of a second corpus of essays from grades 6–8 (obtained from a different school district) to conduct new types of evaluations such as using cross-validation within each corpus separately and combined, and performing cross-corpus training versus testing. We also address an unbalanced score distribution issue that occurs in both corpora using an oversampling method, and conduct new feature ablation studies. Our initial method for analyzing the organization of ideas and evidence in source-based writing was presented in Rahimi et al. (2015). With the motivation of experimenting on a bigger corpus, in the current paper we conduct several new evaluations that combine our two available datasets from different grades and schools to create a third larger corpus.

Data

Our data consists of students’ writings from the RTA introduced in Correnti et al. (2013). Specifically, we have datasets from two different age groups (grades 5–6 and grades 6–8) which represent different levels of writing proficiency. The two datasets are also from two different school districts.

The administration of the RTA involved having the classroom teacher read aloud a text while students followed along with their own copy. The text is an article from Time for Kids about a United Nations effort (the Millennium Villages Project) to eradicate poverty in a rural village in Kenya. After a guided discussion of the article as part of the read-aloud, students wrote an essay in response to a prompt that requires them to make a claim and support it using details from the text. A small excerpt from the article, the prompt, and three student essays from grades 5–6 are shown in Table 1.

Table 1 A small excerpt from the Time for Kids article, the prompt, and sample low and high-scoring essays with supporting evidence in bold from grades 5–6

Our datasets (particularly responses by students in grades 5-6) have a number of properties that may increase the difficulty of the automatic essay assessment task. The essays in our datasets are shortFootnote 5 and have many spellingFootnote 6 and grammatical errors. Some statistics about the datasets are in Table 2. On average the essays in the 6–8 dataset are longer than essays in the 5–6 dataset. They have more unique words and longer sentences.

Table 2 The two datasets’ statistics

The student responses have been assessed on five dimensions (Analysis, Evidence, Organization, Style/vocabulary and MUGS (mechanics/usage/grammar/syntax)) , each on a scale of 1-4 (Correnti et al. 2013). The Analysis dimension is about addressing the prompt, understanding the text and insightful and clear conclusions. The Evidence dimension is related to demonstrating integral use of selected details from the text to support the claim. The Organization rubric is about clear structure of the essay and logical flow of the ideas. The Style rubric addresses use of sophisticated language and vocabulary. Finally, Mugs is about errors in mechanics, usage, grammar, and syntax. The standards stay fixed across grade levels (and thus across the datasets). Half of the assessments are scored by an expert. The rest are scored by undergraduate students trained to evaluate the essays based on the criteria. All the raters were blind to the grades to which the essays belonged. The corpus from grades 5–6 consists of 1569 essays, with 602 of them double-scored for inter-rater reliability. The other corpus includes 809 essays, with almost all of them (802) double-scored (9 of these essays do not have score for Evidence dimension). Inter-rater agreement (Quadratic Weighted Kappa) on the double-scored portion of the grades 5-6 and 6-8 corpora respectively are 0.67 and 0.73 for Evidence and 0.68 and 0.69 for Organization.

The correlation between the scores of Organization and Evidence dimensions (for rater 1) for 5-6 and 6-8 corpora respectively are (pearson = 0.55 , spearman = 0.54 ) and (pearson = 0.50, spearman = 0.48) with all p-values =0.0001. It is possible to have an essay that scores well on one dimension but poor on the other one although it is more common to have a good Organization score but a poor Evidence score than vice versa. As shown in Table 3, there are 48 essays that have poor organization score but good Evidence score on 5–6 dataset and 65 essays vice versa (the upper right and the middle left triangles respectively). There are only 8 essays that have poor Organization score but good Evidence score on 6–8 dataset and 79 essays vice versa (the middle right and the lower left triangles respectively).

Table 3 The distribution of the Evidence and the Organization scores with respect to each other on the two datasets

In this paper we focus only on predicting the score of the Evidence and the Organization dimensions,Footnote 7 which are the two dimensions most related to argumentation. The distributions of the Evidence and the Organization scores are in Table 4. Higher scores on the 6–8 corpus indicate that the essays in this dataset have better quality in terms of Evidence and Organization than the student essays in the 5–6 dataset. The rubric for the Evidence and the Organization dimensions are shown respectively in Tables 5 and 7.

Table 4 The distribution of the Evidence and the Organization scores on the two datasets
Table 5 Rubric for the Evidence dimension of RTA
Table 6 Feature vector representation of the high and low-scoring Evidence essays from Table 1
Table 7 Rubric for the Organization dimension of RTA

Modeling the Source Article

To build both Evidence and Organization models, we use the information in the source “Time For Kids” text, where exhaustive list of topics, important topic words and examples are provided manually by experts (see Tables 1718, and 19 in the 1). Similarly, in other studies on evaluation of content (typically in short answer scoring), the identification of concepts and topics is often manual (Liu et al. 2014). First, experts provide a list of important words for each of the main topics in the article (Table 17). Second, the experts provide a comprehensive list of topics which includes every specific example from the text related to each topic (Table 19). Since the source text explicitly addresses the conditions in a Kenyan village before and after the United Nations intervention, and since the prompt leads students to discuss the contrasting conditions at these different time points, topics provide evidence for the “before” and “after” states, respectively. That is, except for some topics which do not have a temporal aspect, for each major topic t the experts define two sub-topics t b e f o r e and t a f t e r by listing specific examples related to each sub-topic. Finally, the experts remove the temporal aspect of topics from the comprehensive list of examples by merging the “after” states to a single “Topic7” which is about progress made in the village as it originally was represented in the article (Table 18). This is because in modeling the Evidence dimension, we do not care about the temporal aspect of the topics.

Modeling the Evidence Dimension

Introduction to Evidence Rubric

The Evidence rubric (see Table 5) takes into account four criteria related to the quality of text evidence provided in the response. First, we consider the number of pieces of evidence used. More evidence (i.e., above 3) is scored higher. Second, we consider the relevance of the evidence to the central idea. Writing that includes cogent evidence is scored high, while writing that provides irrelevant details is scored low. The third criterion is the specificity of the evidence provided. Writing that features detailed, specific evidence is scored high, while responses that feature cursory, general references is scored low. Finally, the extent to which the evidence is elaborated upon is considered. Strong responses feature evidence that help support and develop the main idea. Evidence is weak when it is just presented as a short phrase or listed in a sentence. The rubric also notes that when the response features a summary of the whole text or directly copies from the source text, it automatically scores 1.

Features to Model the Rubric

As discussed above, one goal of our research in predicting scores is to design a small set of rubric-based meaningful features that performs acceptably and also models what is actually important in the rubric. To this end, we designed several groups of features, primarily addressing one criterion in the rubric. Below, we explain each of the features and its relation to the rubric. Each group of features is indicated with an abbreviation that relates it to the corresponding criteria in the rubric in Table 5.

(1) Number of Pieces of Evidence (NPE)

addresses the first row of the rubric, e.g., if there are fewer than 2 pieces of evidence, score the essay as 1. For calculating NPE, we use manually provided topics in Table 17. Any information in the essays that is related to these text-based topics will be considered as a piece of evidence. We use a simple window-based algorithm with fixed window-sizeFootnote 8 to calculate NPE. A window contains evidence related to a topic if there are at least two words from the list of words for that topic.Footnote 9 Each topic is only counted as a piece of evidence once to avoid redundancy.

(2) Concentration (CON)

If the essay consists of a not specific, brief list of different pieces of evidence without any elaboration, it has a high concentration and should get the score of 1 or 2. We define concentration as a binary feature which indicates if the essay has a high concentration. The high concentration essays have fewer than 3 sentences with topic words. In the case of elaborated evidence, there should be at least three sentences addressing topic words. To calculate this feature, we count the number of sentences that have at least one topic word. If there are less than three sentences with topic words, the concentration is high which means the distribution of topic words in different sentences is low.

(3) Specificity (SPC)

High quality evidence includes specific examples from different parts of the text, or an explanation of why the evidence is important. We use the manually provided list of topics and examples in Table 18. For each of the examples we need to answer the question of whether the student talked about this specific example or not. So the specificity feature is a vector of integer values. Each value shows the number of examples from the text mentioned in the essay for a single topic. We use the same window based algorithm which we use for NPE to calculate each value of the vector.

(4) Word Count (WOC)

is used as a fallback feature because our features do not yet completely cover all rubric cells, and in prior work and in our own data, longer essays tend to receive higher scores.

The value of the features for the example essays with Evidence score of 4 and 1 in Table 1 are shown in Table 6. For the high-scoring essay, the value of the NPE feature is four because all four topics in Table 17 are mentioned in the essay. The essay is not concentrated (CON=0) because there are more than three sentences with topic words in the essay. The value of the Specificity for topics three and eight are 1 because of these two pieces of evidence respectively: couldn’t afford medicine and winning the fight of poverty is achievable which are bolded in Table 1.

For the low-scoring essay, the value of the NPE feature is one because only one topic from Table 17 is mentioned in the essay. The essay is concentrated (CON=1) because there are less than three sentences with topic words in the essay. The value of the Specificity for topics three and four are 1 because of adults or kids and kids are dieing every day respectively. The value of the topic eight is 1 because of winning the fight of poverty is achievable which are bolded in Table 1.

Based on the defined features, we imagine generating feedback that points students to alternative sources of evidence, that highlights the need to elaborate on the included evidence, or that suggests that students be more specific in their usage of evidence. For example, a student could be given feedback such as “You provided evidence about malaria as a condition of poverty that was improved, but there is other relevant evidence in the text that you also need to focus on, such as the lack of fertilizer for crops.” For teachers, we envision providing summary information such as students’ weakness in elaborating on the evidence they provided.

Modeling the Organization Dimension

Introduction to Organization Rubric

The Organization dimension rubric in Table 7 is made of four main criteria that relates to how and to what extent students present their ideas in an organized and logical way. The first criterion is Adherence to Main Idea. This concerns the extent to which the written response focuses clearly on a key idea. Weakly organized responses often stray from the intended main idea. The second criterion is Sense of Beginning-Middle-End. Here, the expectation is that strong writing would have easily identifiable sections, often signalled by introductory and concluding paragraphs and sentences. Such elements are lacking in weak writing. Third, organization concerns the clarity with which ideas are presented. One idea should be addressed before another is brought up. Ideally too, different ideas should be treated in different paragraphs. Weakly organized writing treats ideas in little or no discernible order. The fourth criterion concerns sentence-to-sentence flow. In strong writing, this is logical and seamless. In contrast, weak writing may sound rambling. Finally, the rubric makes note of a special rule, that when the response consists mostly of a summary or word-for-word copying of the text, it automatically receives a score of 1 because the organization of the response is necessarily the organization of the original text, and does not reflect the student’s own efforts at organization.

Topic-Grid and Topic-Chains

Lexical chains (Somasundaran et al. 2014) and entity grids (Burstein et al. 2010) have been used to measure lexical cohesion. In other words, these models measure the continuity of lexical meaning. Lexical chains are sequences of related words characterized by the relation between the words, as well as by their distance and density within a given span. Entity grids capture how the same word appears in a syntactic role (Subject, Object, Other) across adjacent sentences.

Intuitively, we hypothesize that these models will not perform as well on short, noisy, and low quality essays as on longer, better written essays. When the essays are short, noisy, and of low quality (i.e., limited writing proficiency), the syntactic information produced automatically by the parser may not be reliable. Moreover, even when there is elaboration on a single topic (continuation of meaning), there may not be repetition of identical or similar words. This is because words that relate to a given topic in the context of the article may not be deemed similar according to external similarity sources such as WordNet. Take, for example, the following two sentences:

“The hospitals were in bad situation. There was no electricity or water.”

In the entity grid model, there would be no transition between these two sentences because there are no identical words. The semantic similarity of the nouns “hospitals” and “water” is very low and there would not be any chain including a relation between the words “hospitals”, “water”, and “electricity”. But if we look at the source document and the topics within it, these two sentences are actually addressing a very specific sub-topic. Therefore, we think there should be a chain containing both of these words and a relation between them. Zhang et al. (2015) addresses a similar issue of capturing information from semantically related entities by leveraging world knowledge such as “Gates is the person who created Microsoft”.

More importantly, what we are really interested in evaluating in this study is the organization and cohesion of pieces of evidence, not the lexical cohesion. These reasons, altogether, motivated us to design new topic-grid and topic chain models (inspired by entity-grids and lexical chains), which are more related to our rubric and may be able to overcome the issues we mentioned above.

A topic-grid is a grid that shows the presence or absence of each topic addressed in the source text (i.e., the article about poverty) in each text unit of a written response. The rows are analogous to the words in an entity-grid, except here they represent topics instead of individual words. The columns are text units. We consider the unit as a sentence or a sub-sentence (since long sentences can include more than one topic and we don’t want to lose the ordering and transition information from one topic to the next). We explain how we extract the units later in this section.

To build the grids, we use the information in the source text. That is, we use the manually extracted exhaustive list of topics in Table 19 which considers the temporal aspect discussed in the article. Following this, each text unit of the essay is automatically labeled with topics using a simple window-based algorithm (with a fixed window size = 10), which relies on the presence and absence of topic-words in a sliding window and chooses the most similar topic to the window. (Several equally similar topics might be chosen). If there are fewer than two words in common with the most similar topic, the window is annotated with no topic. We did not use spelling correction to handle topic words with spelling errors, although it is in our future plan.

The rule is that each column in the grid represents a text unit. A text unit is a sentence if it has no disjoint windows annotated with different topics or different examples from a topic. Otherwise, we break the sentence into multiple text units where each of them covers a different topic or example (the exact boundaries of the units are not important). Finally, if the labeling process annotates a single window with multiple topics, we add a column to the grid with multiple topics present in it.

See Table 8 for an example of a topic-grid for the essay with the Organization score of four in Table 1. Consider the following sentence from the essay:

“One example they sued show a great amount oF change when they stated at first most people thall were ill just stayed in the hospital Not even getting treated either because of the cost or the hospital didnt have it, but at the end it stated they now give free medicine to most common deseases

Table 8 The topic-grid (on the left) and topic-chains (on the right) for the example essay with Organization score of 4 in Table 1

This sentence has two disjoint windows annotated with different topics. So, we break the sentence into two text units where they cover two different topics “Hospitals_before” and “Hospitals_after”. The first part of the sentence is a unit that covers “Hospitals_before” because of a window including “Not even getting treated”. The second text unit covers “Hospitals_after” because of a window including “free medicine to most common deseases”. The third column in the grid represents the second unit of this sentence underlined which is underlined. The “x” in the third column indicates the presence of the topic “Hospital_after” which is mentioned above. The topics that are not mentioned in the essay are not included in the grid.

Then, chains are extracted from the grid. We have one chain for each topic t including both t b e f o r e and t a f t e r . Each node in a chain carries two pieces of information: the index of the text unit it appears in and whether it is a before or after state. Because transition of temporally-oriented topics are the point of interest in designing topic-chains, we ignore the topics that do not have any temporal aspect (before or after state). Examples of topic-chains are presented in Table 8. Finally, we extract several features, explained below, from the grid and the chains to represent some criteria from the rubric.

Features to Model the Rubric

As indicated above, one goal of this research in predicting Organization scores is to design a small set of rubric-based features that performs acceptably and also models what is actually important in the rubric. To this end, we designed 5 groups of features, each addressing criteria in the rubric. Some of these features are not new and have been used before to evaluate the organization and coherence of the essay; however, the features based on the topic-grid and topic-chains (inspired by entity-grids and lexical chains) are new and designed for this study. The use of before and after information to extract features is based on the rubric and the nature of the prompt.Footnote 10 Below, we explain each of the features and its relation to the rubric. Each group of features is indicated with an abbreviation that relates it to the corresponding criteria in the rubric in Table 7.

(1) Surface (SUR)

captures the surface aspect of organization; it includes two features: number of paragraphs and average sentence length. Multiple paragraphs and medium-length sentences help readers follow the essays more easily.

(2) Discourse Structure (DIS)

investigates the discourse elements in the essays. We cannot expect the essays written by students in grades 5-8 to have all the discourse elements mentioned in Burstein et al. (2003a), as might be expected of more sophisticated writers. Indeed, most of the essays in our corpora are short and single-paragraph (the median of # paragraphs is one). In terms of the structure, then, taking cues from the rubric, we are interested in the extent to which it has a clear beginning idea, concluding sentence, and well-developed middle. We define two binary features, beginning and ending. In the Topic-list, there is a general topic that represents general statements from the text and the prompt. If this topic is present at the beginning or at the end of the grid, the corresponding feature gets a value of 1. A third feature measures if the beginning and the ending match. We measure LSA-similarity (Landauer et al. 1998) of 1 to 3 sentences from the beginning and ending of the essay with respect to the length of the essay. The LSA is trained by the source document and the essays in the training corpus. The number of sentences are chosen based on the average essay length.

(3) Local Coherence and Paragraph Transitions (LCPT)

Local coherence addresses the rubric criterion related to logical sentence-to-sentence flow. It is measured by the average LSA (Foltz et al. 1998) similarity of adjacent sentences. Paragraph transitions capture the rubric criterion of discussing different topics in different paragraphs. It is measured by the average LSA similarity of all paragraphs (Foltz et al. 1998). For an essay where each paragraph addresses a different topic, the LSA similarity of paragraphs should be less than for an essay in which the same topic appears in different paragraphs. For one paragraph essays, we divide the essays into 3 equal parts and calculate the similarity of 3 parts.

The average LSA similarity of text units (sentences or paragraphs) are calculated as follows: A semantic space was constructed based on the essays in the training set. The vector for each text unit is computed (as the weighted sum of its weighted terms) and then is compared to the vector for the adjoining text unit by cosine similarity measure. The average LSA similarity is then calculated for each text by averaging these cosines between the vectors for all pairs of adjoining text units.

(4) Topic Development (TD)

Good essays should have a developed middle relevant to the assigned prompt. The following features are designed to capture how well-developed an essay is:

  • Topic-Density: Number of topics covered in the essay divided by the length of the essay. Higher Density means less development on each topic.

  • Before-only, After-only (i.e., Before and after the UN-led intervention referenced in the source text): These are two binary features. It measures if all the sentences in the essay are labeled only with “before” or only with “after” topics. A weak essay might, for example, discuss at length the condition of Kenya before the intervention (i.e., address several “before” topics) without referencing the result of the intervention (i.e., “after” topics).

  • Discourse markers: Four features that count the discourse markers from each of the four groups: contingency, expansion, comparison, and temporal, extracted by “AddDiscourse” connective tagger (Pitler and Nenkova 2009). Eight additional features represent count and percentage of discourse markers from each of the four groups that appear in sentences that are labeled with a topic.

  • Average chain size: Average number of nodes in chains. Longer chains indicate more development on each topic.

  • Number and percentage of chains with variety: A chain on a topic has variety if it discusses both aspects (‘before’ and ‘after’) of that topic.

(5) Topic Ordering and Patterns (TOP)

It is not just the number of topics and the amount of development on each topic that is important. More important is how students organized these topics in their essays. Logical and strategic organization of topics helps to strengthen arguments. Meanwhile, as reflected in the rubric in Table 7, little or no order in the discussion of topics in the essay means poor organization. Here we present the features we designed to assess the quality of the essays in terms of organization of topics.

  • Levenshtein edit-distance of the topic vector representations for “befores” and “afters”, normalized by the number of topics in the essay. If the essay has a good organization of topics, it should cover both the before and the after examples on each discussed topic. It is also important that they come in a similar order. For example, suppose the following two vectors represent the order of topics in an essay: befores=[3,4,4,5] , afters=[3,6,5]. First we compress the vectors by combining the adjacent similar topics. In this example topic number 4 will be compressed. So the final vectors are: befores=[3,4,5] , afters=[3,6,5]. The normalized Levensthein between these two vectors is 1/4, which shows the number of edits required to change one number string into the other normalized by total number of topics in the two vectors. The greater the value, the worse the pattern of discussed topics.

  • Max distance between chain’s nodes: Large distance can be a sign of repetition. The distance between two nodes is the number of text units between those nodes in the grid.

  • Number of chains starting and ending inside another chain: There should be fewer in well-organized essays.

  • Average chain length (Normalized): The length of the chain is the sum of the distances between each pair of adjacent nodes. The normalized feature is divided by the length of the essay.

  • Average chain density: Equal to average chain size divided by average chain length.

  • Topic transition probability: Transition probabilities are the proportions of topic transition types within a text. Transition types include {- -, -X, X-, XX}.

The value of the features for the example essay with Organization score of 4 in Table 1 are shown in Table 9.

Table 9 Feature vector representation of the high-scoring Organization essay from Table 1

Based on the defined features, we imagine generating feedback that helps students address criteria that received low scores. For example, if the value of the discourse structure feature beginning is false, the system could remind students to write a clear introductory sentence where they tell the reader whether or not they believe that the author provides a convincing argument that “wining the fight against poverty is achievable in our lifetime.”

Experiments and Results

Experimental Setup

We configure a series of experiments to test the validity of three hypotheses for the two dimensions. These hypotheses are designed to validate the usefulness of the model in terms of performance, generalizability of the model across different grades, and the utility of the rubrics for designing predictive features:

ᅟ:

H1: the rubric-based models can match or even outperform competitive baselines,Footnote 11

ᅟ:

H2: the rubric-based models generalize better across students from different gradesFootnote 12 (i.e., across our two datasets), and

ᅟ:

H3: the more that cells in the rubric are covered by a feature group, the more predictive utility the feature group will have in isolation.

We use our two datasets in three different ways: 1) cross validation on each dataset (to test H1 and H3), 2) combining the two datasets to a one big dataset of grades 5–8 and performing cross validation (also to test H1 and H3), 3) training the model on one dataset and testing on the other one (to test H2). The motivation behind combining the two corpora to one bigger dataset is the fact that in a small pilot study (as part of an additional experiment on analyzing the impact of training sample size on the reliability of AES), we found that each doubling of the RTA training sample size increased Quadratic Weighted Kappa by .03.

For all experiments we use 10 runs of 10 fold cross validation using Random Forest as a classifier (max-depth=5). We also tried some other classification and regression methods, such as Naive Bayes, logistic regression and gradient boosting regression, and all the conclusions remained the same. Since our dataset is imbalanced, we use SMOTE (Chawla et al. 2002) oversampling method. This method involves creating synthetic minority class examples. This algorithm generates synthetic examples by operating in feature space. The minority class is oversampled by taking each minority class sample, randomly choosing neighbors from the k nearest neighbors of it, and introducing synthetic examples along the line segments joining any/all of these nearest neighbors. We only oversampled the training data, not the testing data.

All performance measures are calculated by comparing the classifier results with the first human rater’s scores. We chose the first human rater because we do not have the scores of the second rater for the entire dataset. We report the performance as Quadratic Weighted Kappa, which is a standard evaluation measure for essay assessment systems. We use corrected paired t-test (Bouckaert and Frank 2004) to measure the significance of any difference in performance.

Baselines for Evidence

As a baseline we choose a unigram model. Unigrams are extracted and filtered down to the top 500 features by the chi-squared statistic, then a Random Forest model is trained on the resulting feature set. We choose this baseline based on the results represented in Rahimi et al. (2014) which shows unigram model is a well-performing baseline.

Baselines for Organization

We use two well-performing baselines from recent methods to evaluate organization and coherence of the essays. The first baseline (EntityGridTT) is based on the entity-grid coherence model introduced by Barzilay and Lapata (Barzilay and Lapata 2005). This method has been used to measure the coherence of student essays (Burstein et al. 2010). It includes transition probabilities and type/token ratios for each syntactic role as features. We perform a set of experiments to find the best configuration.Footnote 13 We therefore use this best configuration in all experiments. It should be noted that this works to the advantage of the entity-grid baseline since we do not have parameter tuning for the other models.

The second baseline (LEX1) is a set of features extracted from Lexical Chaining (Morris and Hirst 1991). We use Galley and McKeown (Galley and Mckeown 2003) lexical chaining and extract the first set of features (LEX1) introduced in Somasundaran et al. (2014). We do not implement the second set because we do not have the annotation or the tagger to tag discourse cues.

Results and Discussion

Evidence

We first examine the hypothesis that our new features will outperform or at least perform equally well as the baselines (H1). Our results support this hypothesis. Run 2 in Table 10 shows that the rubric-based model yields higher performance than the unigram baseline on all three datasets although it is not significantly higher on (5–6) dataset. Comparing Run 4 with Run 1 and 2 shows that adding unigrams to our rubric-based model does not improve our results but adding the rubric-based features to the unigram model improves the performance.This shows that the rubric based model has information which is not captured in the unigram model and also the rubric-based model captures much of what is already captured by the unigram baseline. The reason is that the NPE and Specificity features are designed to look for existence and co-occurrence of the important unigrams. Runs 3 and 5 investigate the performance of our model without the fallback Word-count feature. The results show that our model still outperforms the unigram baseline although not significantly and adding the rubric-based features (except word count) improves the unigram baseline significantly on two of the datasets.

Table 10 Cross-validated performance of our rubric-based Evidence model compared to the baseline on both datasets and a combination of the two datasets (5–8)

Looking at the features that were selected in our feature selection phase in Runs 4 and 5 shows that the rubric-based features: NPE, CON, and most of the Specificity features are always among the best 500 selected features for all three datasets. Looking at the confusion matrix of the rubric-based model for grades 5–8, we notice that our model performs the best on score 1 (F1 = 0.68) and the worst on score 3 (F1 = 0.38). The F1 is equal to 0.52 and 0.47 for scores 2 and 4 respectively. The F1 values are similar for all three datasets.

We configured another experiment to examine the generalizability of the models across different grades (H2). In this experiment, we used one dataset for model training and the other for testing. We divided the test data into 10 disjoint sets to be able to perform significance tests on the performance measure.Footnote 14 The results in Table 11 show that for both experiments, the rubric-based model performs significantly better than the baseline, which supports the findings from the cross-validation experiment and hypothesis (H2). Comparing the Quadratic Weighted Kappa figures across columns, the Models 1 and 3 which include the unigram features perform better when the training size is bigger. The rubric-based model performs comparably even when we train on the smaller 6–8 dataset and test on the noisier 5–6 corpus. These results suggest that our features are more robust to both lack of training data and training/test set differences.

Table 11 Performance of our rubric-based Evidence model compared to the baselines

Finally, our last hypothesis is that although each rubric-based feature group should be capturing useful information, the feature group designed to capture information about specific pieces of evidence that covers more cells in the rubric is the most important one. To test this hypothesis, we performed an experiment using each of the isolated groups of features. The results in Table 12 show that Specificity is the most predictive feature group in isolation. Specificity alone also almost matches or outperforms the unigram baseline, and approaches the performance of the full rubric-based model. The NPE rubric-based feature is also consistently more predictive than either the CON feature or the word count feature which is not based on the rubric at all.

Table 12 Cross-validated performance evaluation of Evidence feature groups in isolation on the two datasets and their combination

To further investigate the effect of the word count feature, we perform an experiment in which we compare the performance of the rubric-based model with the same model after removing the word count to predict the scores for 3 different data subsets defined by Evidence scores: 1) essays rated as 1 and 2; 2) essays rated as 1, 2 or 3; and 3) essays rated as 3 and 4. The results are in Table 13. As can be seen, including word count only significantly improves performance for the data subset of [1,2,3] and [1,2,3,4], meaning it is useful to discriminate essays with score 3 and 4 from essays with score 1 and 2. Recall that our rubric-based features most sparsely cover the rows in the score 4 column of the Evidence rubric, where we would thus expect word count to play a fallback role.

Table 13 Cross-validated performance evaluation of the word count feature

In sum, our rubric-based model is advantageous to the unigram baseline because the rubric-based model yields higher performance than the unigram baseline on all three datasets although it is not significantly higher on (5–6) dataset; the rubric based model has information which is not captured in the unigram model; and our features are more robust to both lack of training data and training/test set differences. As we hypothesized, the Specificity feature designed to capture information about specific pieces of evidence that covers more cells in the rubric is the most important feature. Word count is useful to discriminate essays with score 3 and 4 from essays with score 1 and 2 (recall that we most sparsely cover the rows in the score 4 column of the Evidence rubric, where we would thus expect word count to play a fallback role).

Organization

We first examine the hypothesis that the new features perform comparably or even better than the baselines (H1). The results on the corpus of grades 5–6 (see Table 14) show that the new features (Model 4) yield significantly higher performance than either baseline (Models 1 and 2) or the combination of the baselines (Model 3). The results of Models 5, 6, and 7 show that our new features capture information that is not in the baseline models (since each of these three models is significantly better than models 1, 2, and 3 respectively), but that the baseline features provide no value when added to the rubric-based features (since none of these three models is better than model 4). The best result in all experiments is bolded.

Table 14 Cross-validated performance of our rubric-based Organization model compared to the baselines on both datasets and their combination

We repeated the experiments on the corpus of grades 6-8. The results in Table 14 show that there is no significant difference between the rubric-based model and the baselines, except that in general, models that include lexical chaining features perform better than those with entity-grid features. Although not significant, the best result comes from adding the rubric-based features to the baseline features (Model 7).

The experiments on the combination of the two datasets show that our rubric-based model yields significantly higher performance than either baseline (Models 1 and 2) and is comparable to the combination of the baselines (Model 3). The results of Models 5, 6, and 7 show that our new features capture information that is not in the baseline models since each of these three models is significantly better than models 1, 2, and 3 respectively. The final conclusion is that the first hypothesis that the new features perform comparably or even better than the baselines is supported by the results.

Comparing the columns for Models 1, 2, 3, and 4 shows that the baseline models perform better on 6–8 dataset that has higher quality essays compared to 5–6 corpus even though the size of 6–8 dataset is smaller. But, the rubric-based model performs the same on both 6–8 and noisier 5–6 datasets and the performance increases when we combine the two corpora.

Looking at the confusion matrices of all three datasets, the performance of our rubric-based model is the worst on score 3. The F1 is equal to 0.497, 0.504, 0.369, and 0.487 for scores 1 to 4 respectively on 5–8 dataset. The F1 values are similar for all three datasets.

We configured another experiment to examine the generalizability of the models (Hypothesis H2) across different grades. In this experiment, we used one dataset for model training and the other for testing. We divided the test data into 10 disjoint sets to be able to perform significance tests on the performance measure. The results in Table 15 show that for both experiments, the rubric-based model performs at least as well as the baselines. Where the training is on grades 6-8 and we test the model on the shorter and noisier set of 5-6, the rubric-based model performs significantly better than the baselines. Where we test on the 6-8 corpus, the rubric-based model performs better than the baselines (although not always significantly), and adding it to the baselines (Model 5) adds value to them significantly. Comparing the columns, all the models perform better when the training size is bigger.

Table 15 Performance of our rubric-based Organization model compared to the baselines

As for our last hypothesis, we investigate the effect of rubric-based features in isolation. To do so, we repeated the cross-validated experiments using each of the isolated groups of features. The results in Table 16 show that Topic-Development and Topic-Ordering are the most predictive set of features. This result supports the hypothesis since these two feature groups cover more cells in the rubric. While the topic-based features may not be better than the baselines, they can be improved. One potential improvement is to enhance the alignment of the sentences with their corresponding topics (since we currently use a very simple model for alignment). Moreover, we believe that the topic ordering features are more substantive and potentially provide more useful information for students and teachers in downstream applications such as providing feedback and analytics.

Table 16 Cross-validated performance of Organization feature groups in isolation. The numbers in brackets show the size of the dataset in use

In sum, our rubric-based model has some advantages to the baseline models. First, it yields either significantly higher performance than baselines or comparable to them. Second, our new features capture information that is not in the baseline models. Third, the rubric-based model performs the same on both 6–8 and noisier 5–6 datasets while the baseline models perform better on 6–8 dataset that has higher quality essays compared to 5–6 corpus. Finally, the rubric-based model is tied to the rubric of the construct. Moreover, in cross-dataset experiments, the rubric-based model performs at least as well as the baselines but all the models perform better when the training size is bigger. Topic-Development and Topic-Ordering, which cover more cells in the rubric, are the most predictive sets of features.

Conclusion and Future Work

In this study, we attempt to measure two targeted constructs within analytic text-based writing: 1) students’ effective use of evidence and, 2) their organization of ideas and evidence in support of their claim. We present the results for predicting the score of the Evidence and Organization dimensions of a response-to-text assessment in a way that aligns with the scoring rubric. We used two datasets of essays written by students in grades 5–6 and 6–8. We designed a set of features aligned with the rubric that we believe will be meaningful and easy to interpret given the writing task. Our experimental results show that our task-dependent model (consistent with the rubric) performs as well as if not outperforms the baselines. We also show the potential generalizability of the rubric-based model by performing cross-corpus experiments. Finally, we show that the more a designed feature group covers criteria in the rubric, the more predictive utility the feature group generally has. In sum, our set of results thus provides support for all three of the hypotheses motivating our experiments.

There are several ways to improve our work. First, we plan to use a more sophisticated method to annotate text units, such as information retrieval or sentence similarity based approaches. Currently we are using a simple window-based algorithm that looks for word overlaps. In the future, we will incorporate more sophisticated methods such as text similarity approaches based on word-embedding representations. Second, we are working towards replacing manually extracted topics and examples by automatically extracted ones, as our current approach requires these to be manually defined by experts (although this task needs to be only done once for each new text and prompt). We proposed (Rahimi and Litman 2016) to use a data-driven model enabled by LDA topic modeling to automatically extract the topical components (i.e., topic words and significant N-grams (N=1) as examples for each topic) needed for our scoring approach. Our preliminary results are promising. Third, we will design and validate a system for providing automated formative feedback to students on their responses. We will investigate the extent to which our automated essay scoring system serves this purpose. Specifically, we will build on our research to study the influence of the formative feedback generated by the AES system on the quality of students’ writing and teachers’ instruction. Fourth, we need to develop additional features to fully operationalize both the Evidence and Organization rubrics. Next, we have a new dataset from a second prompt which we will use to further test the generalizability of our model. We hypothesize the same approach works on the data for the new prompt. Finally, we need to tune all our parameters that were chosen intuitively or were set to the default value.