Keywords

1 Introduction

Learning rubrics are scoring guides constructed of descriptors (or evaluative criteria) that establish specifications to assess. These criteria should align with the formative objectives [1] and are sometimes called evaluation tables, as they are typically arranged in a table format.

One of the main objectives of a rubric is to standardize and accelerate the evaluation process by highlighting the most relevant aspects of the subject matter. However, many authors claim that rubrics should go beyond assessment, as their continuous development has made them useful for informing and motivating the assessed subjects.

Rubrics can be classified according to different criteria [2]. Holistic rubrics offer a global view of the evaluation, whereas analytic rubrics provide a detailed view of different items. Similarly, general rubrics can be used for an entire course, while task-specific rubrics may focus on one particular assignment or project. Summative rubrics produce a final global score, sorting subjects by those who pass and those who fail the evaluation. Formative rubrics provide performance feedback by conveying information about the strengths and weaknesses of the subjects [3]. The research reported in this paper focuses on analytic task-specific formative rubrics.

Formative rubrics help assessed subjects determine their own progress throughout the training period. Since different formative levels have different needs, these rubrics must be designed, formatted, and applied based on their specific purpose and context of use. In particular, we are interested in self-evolving rubrics that can adapt automatically to the learning pace of each student.

This paper examines the design and use of formative rubrics in higher education, where rubrics for specialized content are common [4]. The use of computer technologies and formats is discussed as a means to maximize the benefits of adaptable and adaptive rubrics. To this end, certain characteristics such as criteria dichotomization, weighted evaluation criteria to provide various levels of importance, and go/no-go criteria have been linked to new strategies that adjust to the learning pace of each student. More specifically, we describe the use of the go/no-go criteria in combination with dichotomization and weighted criteria as a strategy to better guide the learner.

The paper is structured as follows: in the next section, a brief review of the state of the art in rubrics is provided as a set of commonly accepted lessons learned. Next, we examine the concepts of dichotomization and levels of importance in evaluation criteria and discuss strategies on how they can be used to control the student’s assimilation pace. Two types of go/no-go criteria are studied and validated in a computer modeling context. Finally, general recommendations are discussed based on the results of two experimental studies.

2 State of the Art

The state of the art on rubrics development can be summarized as a set of commonly accepted lessons-learned. Rubrics can manage complex evaluation scenarios by defining subsets of homogeneous criteria called dimensions. In this regard, clustering techniques can be used to join descriptors in relatively homogeneous natural groupings. It has been stated that “dimensions are useful to work with hierarchical rubrics, which organize criteria in different levels” [5]. Dimensions are also useful to work with complex assessments, which evaluate heterogeneous criteria [6].

Each criterion is evaluated by measuring the degree of compliance or achievement level of a particular situation. The achievement levels should be stated using the same terminology as the corresponding criteria. It is important to use a consistent scale throughout all achievement levels and avoid mixing positive and negative scales [7]. The number of achievement levels may vary depending on the situation. Although in some cases, the level of compliance may be dichotomously determined, whether or not a criterion is met is generally measured through a finite set of levels that discretize a continuum. A commonly used system for establishing discrete levels is based on Likert elements, usually with five achievement levels or points [8]. Rohrmann [9] states that category scaling enhances the usability of assessment instruments and well-defined qualifiers facilitate unbiased judgments.

While objective scoring is difficult to achieve, especially when self-assessing, achievement level categories provide unambiguous scales to properly rate quality. When providing performance scores to students, a preferred strategy involves moderate leniency, so confidence can be gradually built. Instead of penalizing students for each individual mistake, a proper assessment perspective may involve viewing those instances in a larger context to better determine whether they are significant enough to prevent awarding the highest rating.

Rubrics are designed to homogenize evaluation processes. A common strategy towards this end is to complement rubrics with anchors; i.e. written descriptions or examples that illustrate the various levels of achievement, or work samples [10]. Descriptions of good practices should be integrated into formative rubrics and used on demand, so the user of the rubric can be guided throughout the process of determining which criterion to check, how to do it, and the importance of the criterion in the overall result. Good practices are particularly important in distance and self-paced training courses, where the instructor is not always available.

An important consideration when designing a rubric is the type of format. Static formats can be easily implemented but cannot provide feedback to the user, so they are only suitable for summative assessment. Formative assessment requires rubrics that can adapt. In this sense, formative rubrics must be dynamic [6].

Dynamic formats process information to provide feedback to the user, and can be adapted to specific cases and users with different levels of expertise. Two types of dynamic rubrics can be distinguished. Rubrics are adaptive if the instructor can design different rubrics for different stages of formation [11]. They are adaptable if users can vary the level of detail interactively to adapt the rubric to their learning rhythms [6]. Clearly, implementing these functionalities require the use of electronic rubrics (e-rubrics). To this end, dedicated applications to manage rubrics are needed, since standard tools such as spreadsheets are not fully adequate [12]. In the next sections, strategies to design adaptable and adaptive rubrics are discussed.

2.1 Adaptable Rubrics

Rubric criteria must be arranged according to gradually increasing levels of detail, so every user has the opportunity to select the level that better adapts to his/her understanding, thus optimizing the formative action. According to Company et al. [6], an effective strategy to accomplish this functionality with e-rubric tools is to allow the user to “fold” and “unfold” the level of detail of the criteria interactively. A basic rubric with hierarchical levels of criteria is illustrated in Fig. 1. All rubric criteria are shown unfolded, as interactivity cannot be shown in a static image.

Fig. 1.
figure 1

Example of a rubric with criteria showing two levels of detail (unfold view). Italicized numbers in column “weight” represent criteria

Dichotomous criteria can be defined when assessing simple aspects of a task, but also when itemizing the assessment with large amounts of criteria. Therefore, level decomposition indirectly favors the dichotomization of the assessment process.

Similarly, the role of weights as focal pointers is essential. Disagreements between student’s and instructor’s perceptions will naturally reveal discrepancies on the goodness of the evaluated task. Therefore, making the perceived importance of each criterion explicit helps students focus on “what counts” (those criteria that the instructor wants to prioritize). Additionally, adjusting weights throughout the training period is a method to shift focus from criteria that are already achieved to those that require a longer maturing time.

Finally, all levels must be described using the same terms as the main criterion, but each level is characterized by appropriate qualifiers for each type of attribute. The attributes are the characteristics underlying the criteria, such as, for example, frequency, intensity or probability. Other authors such as Rohrmann [9] established achievement levels based on different attributes.

2.2 Adaptive Rubrics

To benefit from the formative nature of rubrics and accommodate to the student’s learning pace, general concepts are typically revealed gradually, as successive rubrics throughout a course. Alternatively, an unfold/fold strategy can also be adopted where only low level (deployed) criteria are shown at the beginning of a course, and other versions displaying higher level criteria are introduced only to students who already mastered the low level criteria. Consequently, the unfold/fold strategy is useful both for the student, who can adapt the criteria to her optimal level of understanding, and the instructor, who can make criteria less specific throughout the learning process as needed.

For a more accurate adjustment to a user’s level, an e-rubric tool must allow the configuration of gates, i.e. the different options that are available to the user based on his/her previous responses to particular criteria, so the rubric can automatically update with new and more complete content every time the user reaches a certain level of achievement in a previous (more basic) rubric. For example, a rubric can be configured so that only when a user obtains a minimum score of 8 points out of 10 in lesson 2, will the contents of lesson 3 be available. Another possibility is to make the system display messages that reinforce learning based on the results achieved. If the score does not reach a minimum threshold in certain questions, an automatic message can inform the student to review the particular lesson(s) related to those contents.

Gates provide active and customized feedback depending on user progress. The instructor has adaptive control over the gates from one stage to another, whereas users have adaptable control of the information needed to complete each stage. Therefore, adaptive rubrics must be coordinated with the lesson plans.

A sample course schedule with the introduction plan of lower level criteria during a period of six weeks (following a bottom-up approach) is shown in Fig. 2. These low-level criteria will later be replaced by more general criteria. Note that knowledge and/or procedures may become exclusionary in the last weeks (indicated with an “X”). In the next section, the use of go/no-go criteria is discussed as a means to implement this behavior.

Fig. 2.
figure 2

Example of schedule as a basis for an adaptive rubric system. “X” indicates exclusionary criteria.

3 Go/No-Go Criteria

Go/no-go criteria are defined as exclusionary conditions inside descriptors. They establish basic conditions that must be met. Otherwise, the evaluation process is interrupted and the final score will be zero, regardless of other criteria. These go/no-go criteria must be explicitly identified. For example: “If the deliverable document contains many spelling mistakes, evaluation does not continue”. Figure 3 shows a hard go/no-go criterion, placed at the beginning of the rubric to avoid unnecessary assessments.

Fig. 3.
figure 3

Example of rubric with a go/no-go criterion and a threshold parameter

An alternative form of go/no-go is establishing a threshold parameter for pass/fail. An example of such a soft go/no-go criterion embedded in the final score (considering that after 10 errors, the grade becomes zero, regardless of other criteria) is shown in Fig. 3. As a result, catastrophic failures result in a no-go, while moderate failures reduce the final score, but do not prevent assessing other rubric dimensions. This soft go/no go is a recommended academic scoring alternative necessary to highlight critical failures, while avoiding unnecessary punitive student exam scores (so that maximum partial credit could be awarded). On the contrary, hard go/no-go criteria clearly send the message that some failures are unacceptable for already formed students.

3.1 Experimental Evaluation of Go/No-Go Criteria

The goal of our study was to validate the hypothesis that neither the use of hard nor soft go/no-go criteria affects the ability of students to self-assess their work. In other words, that the strong warning message sent by this type of criteria does not prevent the perception of the other criteria.

To this end, two pilot experiments were conducted (mid-term and final exam) to assess student understanding of CAD assemblies. These examples represent complex evaluation cases, where the following quality dimensions of CAD models/assemblies [5] were used:

  1. 1.

    Models are valid if they can be accessed successfully by suitable software with no errors or warnings.

  2. 2.

    Models are complete if all necessary product characteristics are provided for all design purposes.

  3. 3.

    Models are consistent if they do not crash during normal design exploration or common editing tasks.

  4. 4.

    Models are concise if they do not contain any extraneous (repetitive or fragmented) information or techniques.

  5. 5.

    Models are clear and coherent if they are understood at first glance.

  6. 6.

    Models are effective if they convey design intent.

Undergraduate students (beginning CAD users) at a Spanish university were introduced to a prototype system of assembly rubrics after being exposed to parts rubrics earlier in the semester. Detailed explanations of the assembly rubric dimensions were discussed and provided to the students prior to their exams. This material included thorough descriptions of the definition and significance of the quality dimensions as well as clarifications of the detailed criteria used to measure the degree of accomplishment of such dimensions. The five achievement levels were defined as: No/Never, Almost Never/Rarely, Sometimes, Almost Always/Mostly, Yes/Always (Rohrmann 2007) and quantified as 0, 0.25, 0.5, 0.75 and 1, respectively.

Completion of rubrics was required and considered correct if it matched the primary instructor’s (Instructor 1) evaluation (ideal). The primary instructor (Instructor 1) was the professor of record of the course; instructor 2 was a faculty member at a different institution whose sole responsibility was to assess the student work.

While the system was primarily developed to assess CAD model quality, the rubric itself can be assessed in terms of ease of understanding and use (which is an underlying research hypothesis). If a rubric is clearly understood each rater (instructor and student) should produce similar assessments.

Experiment 1

For this experiment, students were required to assemble a fitness equipment pulley (Fig. 4, right) using four non-standard parts (previously modeled, as displayed in Fig. 4, left) and various standard parts.

Fig. 4.
figure 4

Assembly and non-standard parts (left) used in Experiment 1.

The original intent, as explained to students, was to identify validity as a hard go/no-go switch, so the assembly would fail assessment if all linked files could not be located or used. Thus, weights were assigned as follows: valid (hard go/no-go criterion that multiplies the overall score using the remaining rubric dimensions), 0%; complete, 20%; consistent, 30%; concise, 20%; clear, 15%; and design intent, 15%.

Students were provided the solution after submitting their exams in order to self-assess their performance against an ideal solution. Although students were informed that Dimension 1 (validity) would be a hard go/no-go criterion (i.e., failure to submit a valid file would result in a non-passing grade for the exam), a soft go/no-go criterion was enforced by instructors (with up to half-credit being awarded to avoid unnecessarily punitive scoring).

A total of fifty students took the mid-term exam, but only forty-six submitted the self-assessment rubrics. Table 1 summarizes the results of the experiment which can be found at Otey (2017), and illustrates the inter-rater reliability scores for the mid-term exam (for the student and both instructors). As described by Otey [13], the experiment demonstrated stronger agreement between instructors than either instructor with the students, for all dimensions. Agreement between instructors and students was obtained for the dimensions of validity, completeness, and consistency, but weak agreement exists for conciseness, clarity and design intent.

Table 1. Inter-rater reliability scores for midterm exam

Experiment 2

Following a similar procedure as the first experiment, the final exam required assembling a mechanism. More specifically, students were asked to assemble a mechanical filter using four non-standard parts (previously modeled) and assorted standard parts. The assembly is shown in Fig. 5.

Fig. 5.
figure 5

Two section views of the filter assembly used in Experiment 2

Fifty-one students submitted self-assessment e-rubrics using our system. The students were assessed on assembly sequence and the use of sub-assemblies. This time, however, the students were informed that Dimension 1 (validity) would be assessed as a soft go/no-go criterion. As an example, a validity score of 0.5 would result in the remaining criteria receiving half value.

The inter-rater reliability scores for the final exam (for the student and both instructors) are shown in Table 2. Once again, there is greater agreement between the instructors than between instructor and students. There exists moderate to strong agreement for Dimension 1, between both instructors and between instructors and students. There is strong agreement between instructors for Dimensions 1, 2, 4, and 5 and little agreement between instructors and students for any dimension other than validity. It appears that there is no measureable increase in agreement for all dimensions other than validity, for instructors and students, between the mid-term and final exam. Reasons for the lack of increase could be attributed to time, as there was only three weeks between exams. This little time could not have been enough for many students to grasp missed concepts and improve their performance.

Table 2. Inter-rater reliability scores for final exam

A slight increase can be observed when comparing Tables 1 and 2 in the agreement of some dimensions over time, which we speculate could be due to the previous exposure to the rubric. However, there are no significant changes in the inter-rater reliability. Thus, we can validate the hypothesis that switching from hard to soft go/no-go criteria does not affect the ability of students to self-assess their work, since the similarities in the assessments of raters (instructors and student) imply that the rubric is clearly understood.

Ideally, it would be useful to determine whether the correlation for each dimension significantly improved or decreased. However, since the r-value is bound between 0 and 1, it is exceedingly difficult to construct meaningful conclusions about this matter. A linear relationship between the correlation values cannot be assumed, but even if the change in correlation values were significant, would it be consequential? Even with perfectly defined rubric dimensions, it is impossible to remove all subjectivity, which may cloud any definitive judgment. In such cases, only the professional expertise of the investigator would guide those determinations. Regardless of this lack of statistical certainty, a pronounced general pattern emerges that reflects a positive directional improvement for a majority of rubric dimensions (between both instructor and student, and between instructors).

Finally, it is worth investigating in the future, whether the clearly increased agreement for Dimension 1 is only due to the previous exposure, or to the fact that the explicit declaration of the go/no-go criterion as soft frees the students from a possible panic to recognize the failure, since it is no longer catastrophic.

4 Conclusions

Rubrics, particularly summative rubrics, are mainly used to standardize and facilitate evaluation processes. However, rubrics can also become formative tools to convey performance information to students. This paper revisited the concept of rubrics to further extend some aspects related to making formative rubrics more adaptable and adaptive: criteria dichotomization, weighted evaluation criteria, and go/no-go criteria.

Formative rubrics should be discussed as learning content is presented, and not after teaching. They should also adapt to the learning rhythm of students. According to the subject who controls adaptation, two types of dynamic rubrics are distinguished. Rubrics are adaptable if users can interactively vary the level of detail of criteria, and they are adaptive if the instructor can design different rubrics for different stages of formation. E-rubrics allow incorporating dynamic tools for adaptable and adaptive rubrics. The possibility to unfold or fold the level of detail of the criteria, and query the anchors associated to the levels of attainment, allow the students to obtain an improved understanding of the matter, when needed.

Additionally, adaptive rubrics allow instructors the progressive introduction of the evaluable concepts. Therefore, adaptive rubrics should be coordinated with the teaching guide. Timetables are suggested to plan the pace to introduce lower level criteria during the first weeks (following a bottom-up approach), and replaced them later by more general criteria, whose knowledge may become exclusionary (by way of go/no-go criteria) at the end of the instructional period.

Go/no-go criteria (when a failure in one criterion is so critical that it prevents analyzing other aspects of the subject’s performance), are recommended, but they must be explicitly identified—and included as such—in the descriptor. Finally, go/no-go criteria can become soft by including a threshold parameter (Ex. After ten errors, the assigned grade becomes zero, regardless of satisfying other rubric criteria).