Assessing Students’ Use of Evidence and Organization in Response-to-Text Writing: Using Natural Language Processing for Rubric-Based Automated Scoring

Rahimi, Zahra; Litman, Diane; Correnti, Richard; Wang, Elaine; Matsumura, Lindsay Clare

doi:10.1007/s40593-017-0143-2

Assessing Students’ Use of Evidence and Organization in Response-to-Text Writing: Using Natural Language Processing for Rubric-Based Automated Scoring

Article
Published: 13 March 2017

Volume 27, pages 694–728, (2017)
Cite this article

Download PDF

International Journal of Artificial Intelligence in Education Aims and scope Submit manuscript

Assessing Students’ Use of Evidence and Organization in Response-to-Text Writing: Using Natural Language Processing for Rubric-Based Automated Scoring

Download PDF

Zahra Rahimi¹,
Diane Litman¹,
Richard Correnti¹,
Elaine Wang¹ &
…
Lindsay Clare Matsumura¹

4174 Accesses
26 Citations
10 Altmetric
1 Mention
Explore all metrics

Abstract

This paper presents an investigation of score prediction based on natural language processing for two targeted constructs within analytic text-based writing: 1) students’ effective use of evidence and, 2) their organization of ideas and evidence in support of their claim. With the long-term goal of producing feedback for students and teachers, we designed a task-dependent model, for each dimension, that aligns with the scoring rubric and makes use of the source material. We believe the model will be meaningful and easy to interpret given the writing task. We used two datasets of essays written by students in grades 5–6 and 6–8. Our experimental results show that our task-dependent model (consistent with the rubric) performs as well as if not outperforms competitive baselines. We also show the potential generalizability of the rubric-based model by performing cross-corpus experiments. Finally, we show that the predictive utility of different feature groups in our rubric-based modeling approach is related to how much each feature group covers a rubric’s criteria.

Artificial intelligence in higher education: the state of the field

Article Open access 24 April 2023

Helen Crompton & Diane Burke

Re-evaluating GPT-4’s bar exam performance

Article Open access 30 March 2024

Eric Martínez

An automated essay scoring systems: a systematic literature review

Article 23 September 2021

Dadi Ramesh & Suresh Kumar Sanampudi

Introduction

The 2010 Common Core State Standards for student learning emphasize the ability of students as young as the fourth grade to construct essays where they interpret and evaluate a text, construct logical arguments based on substantive claims, and marshal appropriate evidence in support of these claims (Correnti et al. 2013). The Response to Text Assessment (RTA) is developed for research purposes to assess skills at generating analytical text-based writing, and to provide an outcome measure that is independent of a state’s accountability test. Specifically, the RTA, unlike available large-scale assessments, is designed to evaluate the integration of reading comprehension and writing skills (Correnti et al. 2013). Our research takes a first step towards developing an automatic essay assessment system for the RTA. Our goal is to develop a tool that can further large-scale research on the impact of instruction, interventions, and policies that influence the development of this writing skill.

Because scoring text-based writing assessments is typically labor intensive and requires extensive training and expertise on the part of raters to obtain reliable scores, automated essay scoring has been proposed as a fast, effective, and affordable solution to the problem of assessing student writing at scale. For example, a recent contrastive analysis of 9 state-of-the-art systems on 8 essay scoring prompts drawn from high-stakes assessments claimed that Automated Essay Scoring (AES) systems had as high a level of agreement with human graders as human graders had with each other (Shermis and Hamner 2012). However, critics of AES argue that AES scores typically under-represent the construct of writing (Condon 2013; Perelman 2013) and even ardent supporters of AES acknowledge its limitations (Shermis and Hamner 2012; Deane 2013).

First, many essay assessment systems rely on holistic rather than trait-based rubrics (Attali and Burstein 2006; Elliot 2003; Page 2003; Attali et al. 2013), and thus tend to focus on summative rather than formative assessment. While holistic methods are typically more efficient and provide more reliable scores, trait-based methods are better at providing diagnostic insight on student performance (Bacha 2001; Weigle 2002). Such insight is particularly useful for systems that not only score but also provide formative feedback. Even when systems do trait-based scoring, critics maintain that trait-based AES has focused on surface dimensions of writing such as grammar rather than more substantive dimensions (Attali and Powers 2008; Perelman 2012). Our system for automatically scoring the RTA is trait-based rather than holistic, scores two of the RTA’s substantive writing traits (namely, Evidence and Organization), and is motivated by formative rather than summative assessment.

Second, in terms of writing tasks, most systems (whether holistic or trait-based) focus on assessing writing in response to open-ended prompts (Attali and Burstein 2006; Crossley et al. 2013; Elliot 2003; Lee et al. 2008; Page 2003; Klebanov and Higgins 2012) rather than in response to text. They usually use more generic rubrics instead of task-specific ones. One advantage of task-dependent rubrics is the ability to provide feedback that is better aligned with the task. Existing systems also do not explicitly evaluate the quality of reasoning based on information from only the text, and instead evaluate dimensions such as structure, elaboration, and vocabulary sophistication (Shermis and Burstein 2003). Our system for automatically scoring the RTA focus on assessing writing in response to text using task-dependent rubrics.

Third, in terms of scoring method, many AES systems do not consider construct validity (Condon 2013; Perelman 2012). Existing AES systems are limited in evaluation of higher-order aspects of writing, such as the quality of content and its organization. For example, AES achieves high reliability in evaluation of content and ideas mostly by using “bag-of-words” approaches that bear little relationship to the scoring rubric for the construct (Landauer et al. 1998; Attali and Burstein 2006; Attali 2011). In contrast, our model for automatically scoring the RTA is consistent with the rubric criteria and easily explainable. Others in the AES community are similarly arguing that automated scoring models should reflect important aspects of the construct being measured, following common practice in the measurement community. That is, dimensions of the construct should be well represented by the features used in the scoring model, and the features contained in the model should not be irrelevant to the rubric for the construct (Loukina et al. 2015). A model with construct validity has greater potential to generate useful formative feedback to students and teachers.

Finally, current AES systems typically score writing that is generated by upper middle-school, secondary, post-secondary students, or by adults for a high-stakes exam (Burstein et al. 1999; Deane et al. 2013; Klebanov and Higgins 2012). For example, the sample of essays in the contrastive analysis of Shermis and Hammer (Shermis and Hamner 2012) described above were from Grades 7, 8, and 10. Our work, in contrast, focuses on writing in Grades 5 through 8, which poses challenges for existing AES methods as RTA essays are typically shorter, contain more grammatical and spelling errors, and are less sophisticated in terms of use and organization of evidence. Our work thus tackles the challenge of using computational techniques on data that is particularly noisy given the stage of writing development of the students.

In the following sections, we first introduce the previous research on this topic. Next we talk about the data, the rubric dimensions and the prompt that we use in our study. Then, we explain the two models we designed to extract features for the Evidence and Organization dimensions of our rubric. Next, we discuss the experiments and results. Finally, we recap our conclusions and discuss future work. Our results show that in general, our rubric-based task-dependent model performs as well as (if not better than) the rigorous baselines we used. Moreover, the combination of our new features with the baseline features often yields better results than either the proposed or baseline features in isolation. Both within-corpus and cross-corpus experiments yield similar conclusions, supporting the robustness of our approach. Finally, feature ablation studies suggest that feature utility is related to rubric coverage.

Related Work

Natural Language Processing techniques have been used to evaluate both the content and organization of writing. One approach of evaluating the content of student essays is to detect whether they are off-topic or on-topic (Louis and Higgins 2010; Higgins et al. 2006). Adherence to the prompt (Persing and Ng 2014) is another way to measure text topicality. Yet another approach to estimating the quality of content is to compare the essay to sets of training essays with different scores (Attali and Burstein 2006; Kakkonen et al. 2005; Xie et al. 2012). These prior studies differ from our response-to-text task in that they do not target source-based writing in which the quality of content should be measured with regard to how the essays use the source material.

Source-based writing refers to types of writing that require students to generate responses that are based on and that reference one or multiple source text(s). Generally, responses are expected to demonstrate close reading and deep comprehension of texts through effective use of evidence from the source text(s). For example, having read a novel, students might be asked to analyze the main theme, providing evidence from the novel. Or, having read two articles representing opposing viewpoints on a topic, students might be asked to write an opinion or argumentative essay in which they use points from the text to support their claim or rebut the opposing perspective. Professional standards for literacy in K-12 education are increasingly emphasizing such source-based writing (e.g., NCTE/IRA, 2012; NGAC/CCSSO, 2011).^{Footnote 1}

In contrast, quality of content is evaluated with regard to integrating information from the source materials in Kakkonen et al. (2005) and Lemaire and Dessus (2001). These studies also differ from our task. In our work, we care about pieces of evidence that students provide from the source material. So, the task is beyond simply deciding if the essay is semantically similar to the source material or not. To be able to score based on the criteria in the rubric and to have the ability of providing feedback based on the detailed information in the essays, we need to localize pieces of evidence. With an ultimate goal of scoring essays, Klebanov et al. (2014) evaluated different content importance models that help predict which parts of the source material should be selected by the students. This study is in a similar direction with our preliminary study (Rahimi and Litman 2016) for automatically extracting important pieces of evidence from the source material.

Another related area of research is to first find argumentation components using argumentation mining techniques, and then use the results of argumentation mining for scoring the essays (Ong et al. 2014; Burstein et al. 2003a; Song et al. 2014; Persing and Ng 2015). Mostly, argumentation mining in the domain of essay evaluation is applied to persuasive essay corpora written in response to a prompt (Stab and Gurevych 2014a,2014b) rather than to source-based writing. Similarly, the definition of Evidence in our task is related to source material and is different from more general definitions of Evidence, Premise, etc. in persuasive essays. Another difference from prior work is that in our study, the essays are written by young kids and we do not expect them to follow a sophisticated argumentation structure.

As a construct, ‘Organization’ has figured in systems for scoring student writing for decades. When organization is considered as a separate dimension, some surface features of organization are considered. Such surface features include: effective sequencing; strong inviting beginning; strong satisfying conclusion; and smooth transitions.^{Footnote 2} Assessments aligned to the Common Core State Standards (CCSS), the academic standards adopted^{Footnote 3} widely in 2011 that guide K-12 education, reflect a shift in thinking about the scoring of organization in writing to consider the coherence of ideas in the text.^{Footnote 4} The consideration of idea coherence as a critical aspect of organization of writing is relatively new.

Notably, prior studies in natural language processing have examined the concept of discourse coherence, which is highly related to the coherence of topics in an essay, as a measure of the organization of analytic writing. For example, in Somasundaran et al. (2014) the coherence elements are adherence to the essay topic, elaboration, usage of varied vocabulary, and sound organization of thoughts and ideas. In Scott and McNamara (2011) the elements are effective lead, clear purpose, clear plan, topic sentences, paragraph transitions, organization, unity, perspective, conviction, grammar, syntax, and mechanics.

Many computational methods are used to measure such elements of discourse coherence. Vector-based similarity methods measure lexical relatedness between text segments (Foltz et al. 1998) or between discourse segments (Higgins et al. 2004). Centering theory (Grosz et al. 1995) addresses local coherence (Miltsakaki and Kukich 2000). Entity-based essay representation along with type/token ratios for each syntactic role is another method to evaluate coherence (Burstein et al. 2010) that is shown in Burstein et al. (2013) to be a predictive model on a corpus of essays from grades 6-12. Lexical chaining addresses multiple aspects of coherence such as elaboration, usage of varied vocabulary, and sound organization of thoughts and ideas (Somasundaran et al. 2014). Discourse structure is used to measure the organization of argumentative writing (Cohen 1987, Burstein et al. 1998, 2003b). All these works rely on lexical information to measure coherence. In contrast, our proposed model uses more coarse-grained topic information. Based on the rubric, we are interested in localizing the pieces of evidence for different topics in essays and evaluate the transition between these topics. For this purpose, we proposed the concept of topic-grid and topic-chain.

In previous studies, assessments of text coherence have been task-independent, which means that these models are designed to be able to evaluate the coherence of the response to any writing task. Task-independence is often the goal for automated scoring systems, but it is also important to measure the quality of students’ organization skills when they are responding to a task-dependent prompt. One advantage of task-dependent scores is the ability to provide feedback that is better aligned with the task. Our model to evaluate the Organization dimension is task-dependent which means it is designed based on the detailed criteria in the rubric and makes use of the source material by evaluating the transition of important topics and pieces of evidence adopted from the source in essays.

Our preliminary studies addressing the task-dependent automatic scoring of both the Evidence (Rahimi et al. 2014) and Organization (Rahimi et al. 2015) dimensions of the RTA were motivated by the differences with prior work discussed above. Our initial method for localizing and analyzing the quality of Evidence in source-based writing was presented in Rahimi et al. (2014) and evaluated on a corpus of essays from grades 5–6. Here we extend this earlier work by taking advantage of a second corpus of essays from grades 6–8 (obtained from a different school district) to conduct new types of evaluations such as using cross-validation within each corpus separately and combined, and performing cross-corpus training versus testing. We also address an unbalanced score distribution issue that occurs in both corpora using an oversampling method, and conduct new feature ablation studies. Our initial method for analyzing the organization of ideas and evidence in source-based writing was presented in Rahimi et al. (2015). With the motivation of experimenting on a bigger corpus, in the current paper we conduct several new evaluations that combine our two available datasets from different grades and schools to create a third larger corpus.

Data

Our data consists of students’ writings from the RTA introduced in Correnti et al. (2013). Specifically, we have datasets from two different age groups (grades 5–6 and grades 6–8) which represent different levels of writing proficiency. The two datasets are also from two different school districts.

The administration of the RTA involved having the classroom teacher read aloud a text while students followed along with their own copy. The text is an article from Time for Kids about a United Nations effort (the Millennium Villages Project) to eradicate poverty in a rural village in Kenya. After a guided discussion of the article as part of the read-aloud, students wrote an essay in response to a prompt that requires them to make a claim and support it using details from the text. A small excerpt from the article, the prompt, and three student essays from grades 5–6 are shown in Table 1.

Table 1 A small excerpt from the Time for Kids article, the prompt, and sample low and high-scoring essays with supporting evidence in bold from grades 5–6

Assessing Students’ Use of Evidence and Organization in Response-to-Text Writing: Using Natural Language Processing for Rubric-Based Automated Scoring

Abstract

Similar content being viewed by others

Artificial intelligence in higher education: the state of the field

Re-evaluating GPT-4’s bar exam performance

An automated essay scoring systems: a systematic literature review

Introduction

Related Work

Data

Modeling the Source Article

Modeling the Evidence Dimension

Introduction to Evidence Rubric

Features to Model the Rubric

(1) Number of Pieces of Evidence (NPE)

(2) Concentration (CON)

(3) Specificity (SPC)

(4) Word Count (WOC)

Modeling the Organization Dimension

Introduction to Organization Rubric

Topic-Grid and Topic-Chains

Features to Model the Rubric

(1) Surface (SUR)

(2) Discourse Structure (DIS)

(3) Local Coherence and Paragraph Transitions (LCPT)

(4) Topic Development (TD)

(5) Topic Ordering and Patterns (TOP)

Experiments and Results

Experimental Setup

Baselines for Evidence

Baselines for Organization

Results and Discussion

Evidence

Organization

Conclusion and Future Work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation