Iterative Design and Classroom Evaluation of Automated Formative Feedback for Improving Peer Feedback Localization

Nguyen, Huy; Xiong, Wenting; Litman, Diane

doi:10.1007/s40593-016-0136-6

Iterative Design and Classroom Evaluation of Automated Formative Feedback for Improving Peer Feedback Localization

Article
Published: 06 January 2017

Volume 27, pages 582–622, (2017)
Cite this article

Download PDF

International Journal of Artificial Intelligence in Education Aims and scope Submit manuscript

Iterative Design and Classroom Evaluation of Automated Formative Feedback for Improving Peer Feedback Localization

Download PDF

2664 Accesses
22 Citations
Explore all metrics

Abstract

A peer-review system that automatically evaluates and provides formative feedback on free-text feedback comments of students was iteratively designed and evaluated in college and high-school classrooms. Classroom assignments required students to write paper drafts and submit them to a peer-review system. When student peers later submitted feedback comments on the papers to the system, Natural Language Processing was used to automatically evaluate peer feedback quality with respect to localization (i.e., pinpointing the source of the comment in the paper being reviewed). These evaluations in turn triggered immediate formative feedback by the system, which was designed to increase peer feedback localization whenever a feedback submission was predicted to have a ratio of localized comments less than a threshold. System feedback was dynamically generated based on the results of localization prediction. Reviewers could choose to either revise their feedback comments to address the system’s feedback or could ignore the feedback. Our analysis of data from system logs demonstrates that our peer feedback localization prediction model triggered the formative feedback with high precision, particularly when peer feedback comments were written by college students. Our findings also show that although students often incorrectly disagree with the system’s feedback, when they do revise their peer feedback comments, the system feedback was successful in increasing peer feedback localization (although the sample size was low). Finally, while most peer comments were revised immediately after the system feedback, the desired revision behavior also occurred further after such system feedback.

Classroom Evaluation of a Scaffolding Intervention for Improving Peer Review Localization

Usefulness of Peer Comments for English Language Writing Through Web-Based Peer Assessment

Trained Peer Written Feedback and Teacher Written Feedback: Similar or Different?

Article 19 July 2018

Introduction

A typical peer assessment practice when learning to write is asking students to reciprocally review other students’ work and generate peer feedback. Peer feedback is an important alternative to teacher feedback and is used frequently because it enhances students’ learning by giving them learning opportunities in their roles as both author and reviewer while not increasing teacher workload (Kern et al. 2003; Cho et al. 2007; Lundstrom and Baker 2009; Cho and MacArthur 2011; Nicol et al. 2014). In the domain of writing, peer feedback is usually referred to as “peer review” or “peer assistance when writing” (Gielen et al. 2010). Peer feedback can take many forms: it may be face-to-face or written feedback, and feedback may involve numeric ratings, free-text comments or both. Our current research focuses on peer feedback with free-text comments.

While peer feedback with free-text comments is a promising approach for helping students improve their writing, feedback comments from peer reviewers can be of mixed quality (Nilson 2003; Nelson and Schunn 2009; Cho and MacArthur 2010; Gielen et al. 2010). Prior work has operationalized feedback quality from different perspectives: (1) accuracy and consistency across reviewers and/or with teacher’s feedback (Steendam et al. 2010), (2) content and/or style characteristics (see the work by Gielen et al. (2010) for a review of prior research from this perspective). The advantage of the second approach to feedback quality is that the proposed feedback characteristics are task-independent, thus the acquired peer assessment skills are transferable to other settings. While feedback characteristics were usually derived from expert assessment reports and grounded in learning theories (Gielen et al. 2010), follow-up research was conducted to empirically investigate their contribution to writing performance. For example, Nelson and Schunn (2009) and Lippman et al. (2012) showed that feedback comments are more likely to be implemented in a paper revision when the comments are localized, that is, pinpoint the location of the problem mentioned in the feedback. Gielen et al. (2010) found that feedback comments with justification, that is, include an explanation of judgment, significantly improve writing performance.

As the first steps towards helping student reviewers improve the quality of their peer feedback, Natural Language Processing and Machine Learning have been used to build models for automated peer feedback assessment (Cho et al. 2008; Xiong and Litman 2010; Ramachandran and Gehringer 2011; Nguyen and Litman 2013). For example, Ramachandran and Gehringer (2013) developed a model to automatically provide formative assessment of peer feedback on different metrics such as review content type and review tone type. In similar veins of study, research has been conducted to automatically detect whether feedback comments lack localization or other desirable properties (Xiong et al. 2010; Nguyen and Litman 2013; 2014). To date, however, while models for assessing such properties of peer feedback have been evaluated intrinsically (i.e., with respect to predicting gold standard manual annotations), extrinsic evaluation of their application in real-world tasks (e.g., being incorporated into a peer-review system to improve peer feedback quality) has been studied limitedly. To the best of our knowledge, the work by Ramachandran and Gehringer (2013) was the only research on evaluating an automated peer feedback assessment system. However, their system only provided feedback to student reviewers at the end of the peer feedback process, and no analysis of the impact of the system’s evaluative feedback on peer feedback quality was conducted. Currently no research has investigated peer-review systems that provide interactive feedback to help reviewers improve their comments within the peer feedback process. On the other hand, while intrinsic evaluations have shown that models for automatically evaluating peer feedback quality can yield high accuracy when trained and tested on peer comments from the same writing assignment, the performance of such models on more challenging and realistic types of test data (e.g., from different writing assignments, academic disciplines, student grade levels) has typically not been examined.

To address these issues, we enhanced a peer-review system by developing an automated formative feedback strategy that uses Natural Language Processing to automatically evaluate the quality of peer feedback with respect to localization, then uses these evaluations to trigger dynamically-generated formative feedback designed to improve peer feedback localization. Our peer feedback localization prediction model processes all peer feedback comments as they are submitted to the peer-review system, and provides real-time feedback to peer reviewers indicating whether their comments were identified as localized or not. Student reviewers can choose to revise their comments regarding feedback localization and resubmit their comments, or ignore the system’s feedback and submit the original peer comments. We intrinsically and extrinsically evaluated the automated formative feedback strategy on four writing assignments, using peer comment data from system deployments in one high-school Math and two college Psychology classes. The goal of our comprehensive evaluation of a peer-review system is to find answers to the following three research questions:

1.
How precisely does our automated formative feedback strategy evaluate peer feedback localization? An insight of prediction accuracy will help us further improve system performance.
2.
Are student reviewers likely to agree or disagree with system’s formative localization feedback? This question helps us understand reviewer behaviors toward the system feedback, and how such responses relate to the system performance.
3.
How does the system’s feedback impact peer feedback revision? In other words, we want to know whether the designed formative feedback helps student reviewers improve their feedback comments with respect to localization. Answering this question will give us insight on student learning and its relation with interface design to further optimize the system.

Feedback Localization

In the domain of writing, feedback localization is defined as pinpointing the source or location with respect to the paper being reviewed of the issue being discussed in the feedback comments (Nelson and Schunn 2009). For an operational definition of feedback localization that supports our prediction model, we consider three types of localization. For each type of localization, example comments are provided in italic text, and localization text is in boldface.

1.
Explicit localization by position: specifying the position in the reviewed paper, using absolute page, paragraph, and/or sentence numbers, a relative positional expression, or section heading:
- Your entire hould take up page 2, therefore, the Intro should start on page 3 and should not say introduction, rather should say your title (APA)
- In the fifth paragraph on the first page, you say “holds much importance”. This just sounds awkward and should be revised.
2.
Implicit localization by content/topic: referring to the content/topic of the reviewed aspect of the paper:
- I would check the sentence where you begin to talk about perceiving and discriminating things based on features a look.
- In the third paragraph you give the reader an idea of what featural processing and configural processing is, it would be helpful to do this earlier in the paper.
3.
Quoted text: quoting excerpts from the reviewed paper:
- You had a few instances of awkard wording throughout your paper. For example, in the third paragraph, you say, “The mothers’ ages were around 37.”
- You have thrown me off a bit when you said “The rats will stay alive the whole first year they are there”. Why and where are they? This is a detail to include when nothing or nobody will go missing.

It is possible that the reviewer refers to the text of the entire document, e.g., “The biggest problem was grammar and punctuation”, but we find this type of document-level location information has very limited benefit to the student authors. Our student authors, especially high-school or lower-grade students, are just novice writers with very little writing experience. Peer feedback that mentions a general issue without any particular example makes them confused in understanding the issues and revising their writing. In fact, our annotated data shows evidence that comments with only high-level localization received a very low implementation rate. Therefore, we do not consider a reference to the entire text of the document as a localization in this study.

Paper Structure

The rest of this paper is organized as follows. The next section discusses work that motivated our current research, ranging from educational to technical aspects. We then describe the SWoRD peer-review system as a software platform for developing and evaluating our automated formative feedback strategy for improving peer feedback localization. Next, we split our research into three studies and describe them in three following sections, respectively. Our first study (Study 1 section) focuses on the implementation of the automated formative feedback strategy and addresses our first research question. We describe the training data and Machine Learning algorithm used to build the peer feedback localization prediction model. We then present prediction performance at both peer feedback comment and peer feedback submission levels for different classes. Our second study (Study 2 section) analyzes data collected from different deployments of the automated formative feedback strategy to evaluate student reviewer responses to the system’s feedback, i.e., agree vs. disagree, and addresses our second research question. Our third study (Study 3 section) evaluates the impact of the system’s feedback on feedback revision by student reviewers, which addresses the third research question. The last two studies thus investigate reactions of student reviewers to the system’s formative feedback in terms of whether they agreed or disagreed with the system feedback, and how they revised their comments when they agreed with the system, respectively. The final section of this paper presents a concluding discussion of the contributions, limitations and implications of our research.

Related Work

Improving Peer Feedback Quality

In educational contexts, peer assessment has been shown to help learners improve self-evaluation skills and better understand concepts being studied. As a result, the practice is being used with increasing frequency across disciplines, especially in content-area courses (Kern et al. 2003; Topping 2009; Lundstrom and Baker 2009). Different aspects of peer assessment such as learners’ perceptions (Mulder et al. 2014), impact on revision (Kaufman and Schunn 2011), design principles and effective uses (Berg et al. 2006; Landry et al. 2014), and cognitive processes of reviewing activities (Nicol et al. 2014) have been studied to best promote student learning through the practice.

Regarding peer feedback in the form of free-text comments, studies have empirically demonstrated important characteristics of written peer feedback that relate to writing performance and feedback implementation. Gielen et al. (2010) found that the presence of justification (i.e., explanation of judgment) significantly improved writing performance. Similarly, a study by Strijbos et al. (2010) revealed that elaborated specific feedback, e.g., feedback that addresses knowledge about concepts and mistakes, leads to improved performance and outcomes. In a different study, Nelson and Schunn (2009) argued that feedback features (e.g., summarization, specificity, explanation, scope, and effective language) may not directly affect feedback implementation but instead impact implementation through internal mediators (i.e., the feedback receiver’s understanding and agreement) because of the complex nature of writing performance. Their study found that two components of the specificity feature, which were offering a solution and localization, significantly correlated with an understanding of the problem. Understanding, in turn, was found to have a significant positive relationship with feedback implementation.

To further promote the desired quality characteristic of peer feedback, research has considered using instructions and question prompts to elicit high-quality feedback (Gan and Hattie 2014; Gielen and De Wever 2015). In a prior work, Nilson (2003) proposed a set of feedback prompts that does not ask for judgment or opinion which may evoke emotion, but requires student reviewers to attend to the details of the peer’s work, which encourages feedback localization. Given both the theoretical and empirical recognition that localization is an important characteristic of written feedback, we are motivated to develop an automated formative feedback strategy aimed at increasing the amount of localization in the written comments that students produce during peer feedback.

Besides providing peer feedback in the form of end-comments as we have discussed so far, graphical feedback interfaces that allow reviewers to directly annotate papers have also been used (see the NB project - nb.mit.edu - for such a system). While such feedback interfaces inherently bind comments to problem sources in the papers, research has shown that on-paper annotations encourage primarily feedback on low-level text issues (e.g., grammar, punctuation, spelling) and lead to simple erasures rather than substantive revisions (McCarthey et al. 2013; 2014). Similar findings were found by Ellis (2011) who conducted two parallel peer feedback classes (i.e., same teacher and identical feedback instructions): one let students write feedback on the printed papers and the other used an online-blog system to edit feedback on papers. The on-paper class had a much higher incidence of surface-revision comments while the blog-based class yielded more macro structure-revision comments, e.g., including main meaning or an overall summary of the entire text. In addition, end-comment feedback has been found to be more useful when providing feedback that refers to multiple locations (Ferris et al. 2013). Although the primary goal of our current research is peer feedback localization, we also care about the overall helpfulness of peer feedback in a balance between surface-level and meaning-level comments. Therefore, we currently focus on encouraging localization in end comments to make the system simple yet robust. In the future, a peer-review system which offers a combined editing mode is worth exploring.

Automated Peer Feedback Assessment

Based on findings such as discussed above, research in Computer Science has used Natural Language Processing and supervised Machine Learning to automatically evaluate whether a free-text feedback comment exhibits a desirable quality, with downstream goals such as automatically prompting student reviewers to improve their feedback quality. Xiong and Litman (2010) developed models for predicting localization in peer feedback comments on student papers, using features derived from regular expressions and dependency parse trees. Nguyen and Litman (2013) developed a feedback localization prediction model tailored to feedback on diagrams rather than papers, by considering common words between feedback comments and the target diagram. Similar methods have been used to identify feedback helpfulness labels (e.g., helpful versus not helpful) (Cho 2008), numeric helpfulness ratings (Xiong and Litman 2011), and other measures of feedback quality such as the presence of solutions to problems (Xiong et al. 2012; Nguyen and Litman 2014). Feedback quality has also been determined by content type (e.g., the presence of problem identification and solution suggestion), relevance, coverage, tone (e.g., positive, negative or neutral), and plagiarism (Ramachandran and Gehringer 2011; Ramachandran et al. 2016). Often these measures are not independent, e.g., we found in our prior work that the percentage of localized comments contributed to improving performance when predicting numeric helpfulness ratings (Xiong and Litman 2011). Of similar motivation but in a different perspective, Ramachandran and Gehringer (2015) created a model to identify the content type of peer feedbcak, i.e., summative, advisory, or problem identification. Their study has a potential application in incorporating peer feedback based on helpfulness and content types into automated essay assessment. In this paper, instead of focusing on developing new methods for automatically predicting the presence of peer feedback features, i.e., localization, we focus on integrating automated prediction research into a working peer-review system with the goal of improving peer feedback localization. Unlike prior work, we not only conduct traditional prediction performance evaluations with respect to predicting localization, but also evaluations focused on triggering system feedback and quantifying students’ responses. To do so, we intrinsically (Study 1) and extrinsically (Studies 2 and 3) evaluate the peer-review system using peer feedback test data from multiple classroom deployments.

Use of Automated Peer Feedback Assessment

To the best of our knowledge there are only a very few studies that analyze the helpfulness of automated peer feedback assessment in working peer-review systems. Ramachandran and Gehringer (2013) conducted a small user study (24 participants) of a peer-review system that incorporated an automated peer feedback assessment feature. Their results showed that student reviewers found system feedback regarding peer feedback’s content type and plagiarism to be informative. However, feedback by their system was provided to reviewers at the end of their reviewing practice. We believe interactive feedback should give student reviewers different learning opportunities. Therefore, our strategy for using automated peer feedback assessment includes analyzing every single comment of peer feedback, highlighting comments that need revision, and allowing reviewers to revise and resubmit their feedback. We perform novel evaluations examining how student reviewers respond to a system’s formative feedback interactively with respect to both accepting the system’s feedback (Study 2) and increasing localization in revised peer feedback (Study 3).

Formative Feedback Design

To assure the main goal of formative feedback which is to improve learning, sets of guidelines on feedback characteristics, feedback timing, and feedback delivery have been proposed (Hattie and Timperley 2007; Shute 2008; Narciss 2013). Narciss (2013) further emphasized the tutoring function of formative feedback, which tutors “students to detect errors, overcome obstacles and apply more efficient strategies for solving learning tasks.” However, designing formative feedback strategies for digital learning environments is challenging, and many digital educational systems do not provide tutorial feedback strategies but simple feedback strategies offering knowledge of results and/or correct responses (Narciss 2013). In the context of our formative feedback, providing system feedback on the correctness of a learner response, i.e., presence of localization in a peer feedback comment, is challenging because the system can only rely on bounded performance of feedback localization prediction. Despite such performance limitations, our design of formative feedback used our predictions in conjunction with guidelines regarding feedback display, e.g., not-localized comment flagging, highlighting of localized comments and localization text within those comments, examples of pre-selected localized comments.

Feedback highlights and error flags have been shown to be effective in automatically tutoring students. Heift (2004) found that students are more likely to revise their mistakes when given meta-linguistic feedback highlighting the mistake than when given feedback with no highlighting but with explanation. In another study that examined feedback highlighting but in a different context, Kumar (2010) showed that when error-flagging was provided during tests on introductory programming concepts, student scores improved. To implement error-flagging, correct student answers were displayed in green and incorrect answers were displayed in red; in addition, no reasons why the answers were incorrect were provided.

With respect to scheduling system feedback, because our students are not trained on feedback localization we do not expect them to know when they need a hint, and thus choose to trigger system feedback proactively whenever a student feedback lacks sufficient localization. In a prior study, Razzaq and Heffernan (2010) compared two approaches for giving hints during tutoring: proactively when students make errors, versus on-demand when students ask for a hint. They found no difference in learning gains for students who did not ask for many hints. However, it is shown that students pay more attention to immediate than delayed feedback (Van der Kleij et al. 2012).

Automated Formative Feedback Strategy in SWoRD

SWoRD Peer-review System

As a software platform for developing and evaluating our strategy of automated formative feedback for improving peer feedback localization, we use SWoRD (Scaffolded Writing and Rewriting in the Disciplines), a web-based reciprocal peer-review system (Cho and Schunn 2007). A typical peer-review writing exercise using SWoRD involves four main phases: (1) students (as authors) submit first draft papers, (2) students (as reviewers) submit feedback in the form of end-comment reviews and numeric ratings on peers’ papers, (3) students (as authors) receive peer feedback on their papers, and (4) students (as authors) submit a paper revision that addresses feedback from their peers. See Appendix B for a more detailed description with examples of SWoRD’s work-flow.

The original version of SWoRD only facilitated the document management and paper-reviewer assignment aspects of the peer feedback process as described above. To further enhance the utility of SWoRD, we have implemented automated prediction of and formative feedback on peer feedback localization in phase 2 of SWoRD (i.e., peer feedback submission). The goal of our work is to improve the localization of written feedback comments on a paper before they are shown to the author, by automatically providing formative feedback to student reviewers whenever localization is poor at the time of feedback submission.

In particular, we first integrated an improved version of the feedback localization prediction model developed by Xiong and Litman (2010) into SWoRD. Next, we iteratively designed and implemented automated formative feedback to improve students’ use of localization in peer feedback. The work-flow of the automated formative feedback strategy in phase 2 of SWoRD is as follows. (2.1) Whenever peer feedback is submitted, the feedback localization prediction model is first used to predict whether every feedback comment is localized or not. (2.2) If the submitted feedback is predicted to have a ratio of localized comments less than a threshold of 0.5 (justified in detail in Study 1), formative feedback will be triggered automatically. In the formative feedback, e.g., Fig. 1, the system displays an on-screen message which suggests comment revision and provides advice for doing so (part A of the figure). Comments are displayed in the text boxes (parts B, C) for the reviewer to edit. (2.3) Finally, the reviewer can choose to revise his/her feedback comments and resubmit (click the left button in part A) or to submit the peer feedback without revision (click the right button), which implies agreement or disagreement with the system’s feedback, respectively.

Iterative Design of Automated Formative Feedback

We have iteratively developed the automated formative feedback strategy across two different system deployments in actual classrooms. Given the lessons learned from our first deployment in 2013 (Fig. 1), we modified the formative feedback strategy before our second deployment in 2014 (Fig. 2). While the main work-flow was the same across deployments, we changed how the system processed peer feedback submissions versus resubmissions, and also enhanced the formative feedback interface.

For both of our formative feedback strategy deployments, we define a reviewing session as a work-session that starts when the interface (see Fig. 5 in Appendix B) is opened for a reviewer to enter his/her comments, and terminates when the feedback is submitted successfully to SWoRD or the reviewer closes the interface. In the 2013 deployment, the formative feedback strategy evaluates peer feedback comments for localization both during the first feedback submission as well as during any resubmissions in a reviewing session. In that deployment, feedback comments were submitted successfully only when the (re)submission passed the peer feedback localization check or when the reviewer clicked the Disagree button in the formative feedback interface. That means a student reviewer could receive more than one system feedback in a session if he/she kept failing to provide location information in comments. However, a data analysis of the 2013 deployment showed that reviewers were much more reluctant to edit their comments when the system’s feedback occurred in later resubmissions. In particular, among all edits made in revisions due to system feedback, on average 12 % were from resubmissions while 88 % were made to first submissions. Also, only 17 % of comments of resubmissions that received system feedback were edited, while this ratio in first submissions was 31 %.

Therefore, in our 2014 deployment the automated formative feedback strategy only checked the first feedback submission in each reviewing session for peer feedback localization. This means that when a reviewer clicked the Revise button, the resubmitted comments no longer went through the feedback localization prediction procedure but was instead submitted right away. Thus, while the former deployment allowed many resubmissions during a reviewing session, the latter deployment allowed at most one resubmission in each session.

Along with the above difference in processing resubmissions, the 2014 deployment made changes to the formative feedback interface to better call student reviewers’ attention to peer feedback localization. Unlike the original formative feedback interface (Fig. 1), the improved interface used boldface to highlight examples of localization text within the predicted localized comments (Fig. 2). This change was designed to provide student reviewers with real examples of feedback localization from their own comments. As our system now highlights localization text, we simplify the formative feedback message (“Make sure that...” in part A) so that it only shows a localization template (randomly chosen from three pre-defined templates - the first lines of the 3 examples in Fig. 9). In addition, the formative feedback interface includes buttons asking reviewers to provide feedback on system’s feedback localization prediction (part C). This change is made to encourage students’ reflection on localization. We hypothesized that asking students to reason about feedback localization in their own comments would also promote peer feedback revision. The reviewer responses to these buttons also provide new annotated examples for supervised Machine Learning that may potentially improve the performance of our prediction model for future deployments.

Study 1: Evaluating Peer Feedback Localization Prediction

Introduction

Our first study involves two intrinsic evaluations that aim to evaluate the precision of predicting non-localization at the comment-level, and of using these predictions to trigger system feedback of the peer feedback localization. For peer feedback localization prediction, we adapt the model that was built in our initial work (Xiong and Litman 2010). Our initial set of predictive features were developed using Natural Language Processing and evaluated in a laboratory setting using cross-validation on peer feedback comments collected in a college History class. To improve model robustness for the current study, where the model was tested on data from courses for which we had no training data, we developed additional features as well as annotated more diverse training data and retrained the prediction model. In this section we describe how we extract features from peer feedback comments and associated papers. We also discuss how we bias the Machine Learning process to yield a model with high precision for triggering system feedback.

Method

Peer Feedback Data for Evaluation

In the present study, we evaluate our pre-trained feedback localization prediction models using data from two system deployments with four peer feedback assignments spanning three different classes. The first version of our feedback localization model was deployed in a Research Methods Lab at the University of Pittsburgh during Fall 2013. The second version was deployed in a Cognitive Psychology class at the University of Pittsburgh during Spring 2014, as well as in an Interactive Mathematics class at a Pittsburgh charter high school during Fall 2014. For both Research Methods and Cognitive Psychology classes, collected data consists of peer feedback comments for the first drafts of one writing assignment per class. For Interactive Mathematics class, our data includes feedback comments for the first drafts of two different writing assignments. A general description of the four peer feedback datasets is given in Table 1. Appendix A contains a detailed description for the four writing assignments in which we collect the feedback comments.

Table 1 General description of peer feedback datasets

Full size table

To support the evaluation of our feedback localization models and prepare data for our second and third studies, we collected all first peer feedback submissions which triggered system’s feedback, as well as their immediately subsequent resubmissions (if any). By pairing each comment with its revision, we aim to evaluate how the system feedback impacted student reviewers’ revision of their peer feedback. In addition, since reviewers can edit and resubmit their previously submitted feedback, we observed that there were a number of edited comments that did not immediately follow a system feedback. Thus, we also collected original and revised comments where the revision occurred without immediate system feedback, but where the reviewer had previously received the system feedback. This data enables us to explore whether there were retention effects of the system feedback for improving feedback localization.

Because peer feedback comments were collected from system’s log data, each comment was associated with a predicted label of localization Thus, by comparing human-annotated labels with predicted labels, we could obtain prediction performance of our deployed prediction models. Following the localization annotation scheme used by Lippman et al. (2012), an annotator who had inter-rater Kappa of 0.8 when coding prior peer feedback data was chosen to code the collected peer feedback comments. A comment is coded as Localized if it contains at least one text span indicating where in the target paper the comment is applied. The comment is coded as NOT-Localized otherwise.

Descriptive statistics of the peer feedback data for our current study are given in Table 2. Because not all students completed every assignment and we only analyze complete reviews that were properly submitted, there are differences in the number of student authors and reviewers across assignments. As shown in the %SystemFeedback row of the table, the college classes have smaller ratios of feedback submissions that triggered system’s feedback than the high-school class. In particular, both college classes have very low system feedback ratios, 7 % for Research Methods and 3 % for Cognitive Psychology. One reason could be that peer feedback prompts (see Appendix A) in the two college classes were more specific, with some targeting low-level writing issues, e.g., spelling/grammar, and others targeting particular sections in the writing, e.g., abstract, introduction. As a consequence, location information might have been more commonly added to the comments to properly address the prompts.

Table 2 Peer feedback data

Full size table

Prediction Features for Feedback Localization

We use 11 features to learn our feedback localization prediction model that works with different types of feedback localization that are studied in this paper (e.g., explicit localization by position in, implicit localization by content/topic of, and quoted text from the reviewed paper – see our definitions in Feedback Localization Section). Out of the 11 features, 6 were from our laboratory study of feedback localization prediction (Xiong and Litman 2010). In particular, the regular expression feature was designed to model explicit localization by position, while the domain word and overlapping window features were designed to model the content/topic and quoted text types of localization. We, however, eliminated the fourth group of syntactic features proposed in the prior work, as performing syntactic parsing led to an unacceptable computation time in our deployment.

Word count: number of words in the comment.
Quoted word counts: number of words in quoted text in the comment. Quotations in comments are recognized by the occurrences of double-quote symbol (e.g., you say, “The mothers’ ages were around 37.”).
Comment order: order of the comment in the feedback.
Regular expression tag: a Boolean feature that indicates whether any of a predefined set of regular expressions (e.g., on page 5, the section about) are matched in a given comment. We develop two sets of regular expressions. The element set consists of expressions to extract location information regarding session, paragraph and sentence. The construction set includes patterns that express introduction, thesis, and conclusion of the paper.
Domain word count: intuitively, localized comments tend to use words from a paper’s topic domain. Examples of domain words extracted from History writings in our data are: ‘rights’, ‘states’, ‘political’, ‘democracy’, ‘government’, ‘constitution’. To build a domain vocabulary, the training data was pre-processed to extract bigrams with frequency–inverse document frequency (TF-IDF) above average. The unigrams constituting these bigrams were then considered to be the domain vocabulary. The feature itself counts the number of domain unigrams in the comment.
Overlapping window size: an overlapping-window algorithm (Ernst-Gerlach and Crane 2008) was used to search for common text spans between a feedback comment and the paper being reviewed. The algorithm iteratively searches through the paper for the referred windows of the most likely text span in the comment, and merges any two windows that are found to overlap. A larger merged window suggests more overlapped textual content, so we consider the length of the maximal window as one of our localization features. While the quoted word count feature is expected to recognize the type-3 localization (quotation excerpts, see our definition of localization types), the overlapping window size feature targets to the type-2 localization (reference to content/topic), which involves original terms but might be used in different sentence structures.

To further model the linguistic signals of localization in feedback comments (to both increase quantitative model predictive performance, and to better identify the localization text spans within comments that will be highlighted in the system’s feedback), we introduce 5 additional features that can be characterized into two types:

Location phrase features: 4 features fall under this category. The basic idea is to mine words and phrases that are good signals of positional localization in a semi-supervised manner. While this feature set has a similar function as the regular expression feature, these features are based on a data-driven approach to increase coverage compared to the pre-defined list of regular expressions.
Similarity score sum feature: this feature supplements the domain word and window size features described above (which are based on exact lexical matching) by incorporating ideas from the Computational Linguistics literature on detecting paraphrases. While the idea of making use of lexical similarity to reason about semantic similarity at higher levels, e.g., sentence and paragraph, are very popular (Corley and Mihalcea 2005; Li et al. 2006; Islam and Inkpen 2008), we instead form different abstractions of the original sentences and apply different distance metrics to measure the similarity of every pair of abstracted strings (Malakasiotis 2009) to keep our deployment simple.

Our set of location phrase features are created as follows. Using a development set of peer feedback comments from a Cognitive Psychology class in 2007, we first collected a list of 14 location seeds which appeared to be good lexical localization signals and had occurrence greater than or equal to 50:

For each seed, we then found all words in the development data that occurred in the same context as the seed. Two words have the same context if they have the same preceding and following tokens in the corpus, e.g., first paragraph of and first part of. Next, for each word in the set of location seeds and their same-context words, we considered a bigram of the word and its preceding word to be a location bigram if the bigram indicates localization while the preceding word alone is a general term that can be used in not-localized comment as well. The following location bigrams are used in our model:

Finally, given the above, the location phrase feature set for each comment included the following 4 features: number of element patterns, number of construction patterns, number of location seeds and number of location bigrams. Element and construction patterns are based on the pre-defined regular expressions of the regular expression tag feature. We separate them here to better model different subtypes of positional localization. Feedback comment and signal words are stemmed before being counted.

With respect to the similarity feature, given the maximal window in the paper under review returned by the overlapping-window algorithm, we also collect two preceding and two following sentences of the overlapping window for a maximal total of 5. We generate all possible pairs of comment sentence and one of the 5 sentences found. For each pair P=〈S _c, S _w〉, in which S _c and S _w are the comment sentence and a paper sentence found by maximal window algorithm respectively, we extract from the longer sentence all subsequences seq of consecutive tokens that have the same number of tokens as the shorter sentence (pivot sentence p). We define different similarity functions that apply to pairs of strings, and given a similarity function f the similarity score of P=〈S _c, S _w〉 is aggregated as follows:

$$Sim^{f}(\mathsf{P}) = \max\left( Sim(seq, p)\right)$$

Then for each comment C, we calculate its maximum similarity score:

$$Sim^{f}(\mathsf{C}) = \max(Sim^{f}(\mathsf{P})) \qquad \forall \mathsf{P}$$

We use six similarity metrics:

1.
Levenshtein similarity: inverse of Levenshtein distance which is the minimum number of edits (insertion, deletion, substitution) to transform a sequence to the other normalized by sequence’s length.
2.
Hamming similarity: number of positions at which corresponding elements are the same, normalized by sequence’s length.
3.
Variance similarity: total occurrences of common elements of the two sequences, normalized by total length of two sequences.
4.
Trigram similarity: variance similarity of two 3-gram sequences.
5.
Binary similarity: number of common elements normalized by sequence’s length, element repetition is not counted as in variance similarity.
6.
Cosine similarity: each input sequence is transformed to a frequency vector whose dimensions correspond to elements of two sequences, and value at each dimension is the total occurrence of corresponding element in two sequences. Cosine similarity is cosine of the angle between two vectors calculated using Euclidean dot product formula.

We then apply them to four different abstractions of the original sentences:

1.
Sequence of original tokens. For example, {The, introduction, does, not, have, a, clear, thesis, statement, .}
2.
Sequence of part-of-speech tags, e.g., {DT, NN, VBZ, RB, VB, DT, JJ, NN, NN, .}
3.
Sequence of tokens that are recognized as noun, i.e., have part-of-speech tag in {NN, NNS, NNP, NNPS}
4.
Sequence of tokens that are recognized as verb, i.e., have part-of-speech tag in {VB, VBD, VBG, VBN, VBP, VBZ}

Overall we thus have 24 similarity functions f. Given a feedback comment, its similarity score sum feature is the sum of its 24 similarity score ${\sum }_{f}Sim^{f}(\mathsf {C})$. While pivot sentence p and the token sequence seq have the same number of tokens, their sequences of nouns (or verb) may have different length. In this case, we take the shorter as pivot sequence, and extract all sub-sequences of the longer in a similar way of what we do with P=〈S _c, S _w〉.

Pre-trained Models and Cost-Sensitive Machine Learning

The models for predicting feedback localization were pre-trained using the above features with the Logistic Regression algorithm implemented in Weka (Hall et al. 2009). We experimented with different learning algorithms, e.g., Decision Tree, Support Vector Machine, and observed that they yielded lower 10-fold cross validation performance than Logistic Regression. Since the effectiveness of supervised Machine Learning depends not only on the features available for prediction, but also on both the amount of training data and the similarity of training and testing data, we supplemented the training data used in our initial laboratory study with additional annotated data.

Because the 2013 deployment was going to be evaluated in a college class, we added an annotated corpus of peer feedback comments from a college Computer Science course and used these in addition to the original college History data (Nelson and Schunn 2009; Xiong and Litman 2010). In the 2014 deployment, the peer-review system was going to be evaluated in both college and high-school classes, so we added an annotated corpus of peer feedback comments collected from a high-school Literature course to the training corpus from the first deployment and retrained the model. Since we did not have access to new data from the actual academic disciplines that would be the testbeds for our deployments, we could only increase similarity in terms of whether the added training data came from college or high school classrooms. Label distributions of training data for the two deployments are shown in Table 3.

Table 3 Training data for feedback localization prediction

Full size table

In addition, we biased the Machine Learning process to yield a localization prediction that would trigger the system feedback with high precision (rather than optimizing for feedback localization prediction performance). By system feedback precision, we mean that feedback submissions which triggered system feedback are those actually in need of localization revision, i.e., have at least a feedback comment that is not localized. We decided that system feedback precision was more important than recall, because we thought it would be better to miss some feedback opportunities than to provide incorrect feedback (e.g., by telling a student to revise his/her feedback when all of the comments were already localized).

Designing the system feedback strategy, we first chose the localization threshold to trigger system feedback, i.e., a peer feedback submission would trigger system feedback if its ratio of localized comments over its total number of comments was less than the threshold value. While we aimed for detecting peer feedback submissions with at least one not-localized comment, our feedback localization model was not perfect so setting the threshold to 1 triggered almost all feedback submissions, which was not desired. Instead, we conducted a pilot study where for each feedback comment, we collected predicted labels by the feedback localization prediction model and true labels annotated by experts. By varying the threshold value, we observed that our feedback localization model best agreed with human experts with respect to triggering system feedback when the threshold was set to 0.5. Therefore, we fixed a threshold of 0.5 for all later deployments (i.e., testing on totally new data separate from the pilot study data).

Next, we implemented cost sensitive learning for our feedback localization prediction model, i.e., weighing certain types of prediction errors more heavily than others during model training using a cost-matrix. In particular, to optimize precision when triggering system feedback, we penalized predicting Localized comments as Not-localized during model training. Because our feedback localization models were trained to predict a feedback comment as Localized or Not-localized, a cost-sensitive learning paradigm required two misclassification costs as additional parameters:

False-Negative (FN) cost: when a Localized comment was predicted as Not-localized
False-Positive (FP) cost: when a Not-localized comment was predicted as Localized

In each deployment, best values of the misclassification costs were selected through cross-validation using training data. In the 2013 deployment, we obtained FN-cost = 5, FP-cost = 1. In the 2014 deployment, we had FN-cost = 3, FP-cost = 1. Because we did not optimize for prediction performance at the comment level, we obtained lower training performance with cost-sensitive learning.

Results and Discussion

Feedback Localization Prediction Performance

At the comment-level, we evaluate how well the feedback localization prediction models in the two deployments predicted the presence or absence of localization compared to the human annotations. We also compare the models’ performance to two corresponding baselines which are Majority-class and Bag-of-words. The Majority-class model assigns all comments to the label of the majority class which is NOT-Localized for all comment sets as shown in Table 2. This is expected as most of the comments in the corpus subset used for our current annotation were from feedback submissions triggering system feedback, which should have low localization ratios. This does not imply that not-localized comments are the majority class of our comment data. In fact, in our earlier work which focused on training prediction models (as opposed to the current study, which focuses on the system feedback strategy), all of the comments in each training corpus were annotated. In those corpora we found that both localized and not-localized can be the majority class, with the majority ratios varying from as low as 52 % to as high as 72 %.

Bag-of-words (BoW) is a sparse vector model of occurrence counts of tokens in comments. We include unigrams, bigrams and trigrams as tokens because localization patterns can occur as a single word (e.g., abstract) or multiple-word expression (e.g., first two sentences). For each deployment, we train the BoW model with the corresponding training data that was used for our deployed model using a Logistic Regression algorithm. Because the cost-matrix was optimized for obtaining high system feedback precision using our proposed models, directly applying cost-sensitive learning to BoW models greatly degraded feedback localization prediction. Thus, we did not train BoW models with a cost-matrix. We, however, further improved BoW performance by adding ridge regularization to address the sparsity of ngram features. We did not stem words or remove stop words as these pre-processing reduced training performance. Actually, stemming and stop-word removal also decreased overall test performance of BoW on the deployment data.

Table 4 reports the prediction performance of our pre-trained feedback localization models on peer feedback data in our two deployments. Since two of our datasets have skewed class distribution (a large majority of NOT-Localized comments), we report macro-average F1-score. The results show that our feedback localization prediction models consistently outperformed the baseline models in all four feedback comment sets with respect to all performance metrics. The F1 of NOT-Localized is higher than that of Localized, which reflects the cost-matrix used when training the models (i.e., the matrix was designed to yield high precision of system feedback) as described above.

Table 4 Localization prediction performance at the comment level of our deployed models compared to two baseline models: Majority-class (Majority) and Bag-of-words (BoW)

Full size table

In addition, when comparing these results with our first reported results for localization prediction (Kappa of 0.55 for the feedback localization model in Xiong and Litman (2010), where the model was cross-validated using a single dataset of peer reviews from a college history class), we see that the worst performance on our two Psychology college classes is only slightly degraded in our cross-course evaluation setting in which training and test data are from different courses/classes. That is, the model trained on data from college History and Computer Science classes yielded a Kappa of 0.46 when deployed in Research Methods lab. This evaluation setting is more difficult than cross-validation because the feedback localization model was trained using peer feedback comments on student papers of different writing topics, and from different academic disciplines than the test corpus of interest in this study. The best test performance was achieved during the second deployment for the Cognitive Psychology class, with cross-course Kappa of 0.69 which is even higher than the reported Kappa in the original publication (Xiong and Litman 2010). These results show that our feedback localization models obtained high prediction performance at the comment level when deployed in college classes.

We see lower prediction performance when the model that was deployed in Cognitive Psychology was later tested on the two comment sets from the Interactive Mathematics class. However, the fact that our model yielded higher performance than the Bag-of-words baseline has shown the advantages of our proposed features. The features were designed to capture linguistic styles (e.g., location phrases) and to abstract over content mentioned in the paper (e.g., domain word count, maximal window size, similarity score sum). Thus, they are expected to be more domain-adaptable than generic n-grams. In fact, out of 11 features that we used, 4 location phrase features, domain word count, and regular expression tag have the largest weights returned by Logistic Regression algorithm, which shows that they are the most effective features.

Although both high-school and college data were used to train the peer feedback localization prediction model for the 2014 deployment, results revealed difficulties that our model faced when tested on high-school data. While an analysis that investigates style and quality differences of textual comments between college and high-school students, which might cause the performance disparity among data sets, is beyond the scope of our current study, the lower performance with high-school data points to the need to either re-train the model with even larger and better data in future work, and/or develop new features better tailored to peer feedback comments in high school classes.

System Feedback Precision

At the feedback-submission level, we evaluate how often system feedback was triggered precisely, i.e., the submission that triggered system feedback was not already fully localized. Recall that in both two deployments, student reviewers were encouraged to add location information into their comments (see the formative feedback messages in Figs. 1 and 2), but they did not know the localization ratio that activated the system feedback. Thus, we consider a trigger to system feedback to be precise (with respect to students’ opportunity to revise a non-localized comment) when at least one of the comments in the peer feedback submission is human-coded as NOT-Localized. We use human-annotated localization labels to compute the true localization ratio of every peer feedback submission that triggered system feedback, and label a trigger to system feedback as Incorrect if the corresponding peer feedback submission had true localization ratio of 1, and Correct otherwise. As shown in Table 5, the peer feedback localization prediction models yielded no incorrect trigger to system feedback in the Research Methods Lab and Cognitive Psychology classes. However, up to 23 % of triggers to system feedbacks were incorrect in the Interactive Mathematics class, which likely reflects the lower prediction performance at the comment level.

Table 5 System feedback precision

Full size table

In sum, our results show that with real college classroom settings, our models predicted localization at the comment level accurately enough to in turn trigger system feedback with high precision. While the prediction performance with high-school peer feedback data was limited at the comment level, the system feedback precision was nonetheless promising with 77 % to 83 % of system feedback triggered correctly.

As we described, the annotated data for this study were comments of peer feedback submissions that either triggered system feedback or were revised (without immediately prior system feedback). We did not code any submissions that passed the localization check (i.e., did not trigger the system feedback) and had no revision. Therefore, we obviously missed the False-Negative instances where the peer-review system accepted submissions that should have triggered system feedback. Due to time and resource limitations for the annotation work, and our focus on system feedback precision over recall, our current study evaluates the peer-review system on how well the system performed when it provided feedback on peer feedback submissions, which we thought more urgent than the dual evaluation on how accurately the system passed peer feedback submissions. A comprehensive analysis that covers all peer feedback submissions is left for future work.