Introduction

so difficult is the task, that it is tempting to conclude that maybe it would be better to try to avoid student modelling all together, to search for some magical “end run” around the need to understand the learner at all. Unfortunately, whether the tutoring system is an “old fashioned” present-and-test frame-based tutor, a deeply knowledgeable AI-based expert advisor, or a scaffolding environment situated in the learner’s world, it must adapt to the learner or be forever condemned to rigidity, inflexibility, and unresponsiveness. (Greer and McCalla 1993, p. viii)

This statement appears in the preface to a collection of papers on student modelling edited by Jim Greer and Gordon McCalla entitled, “Student Modelling: The Key to Individualized Knowledge-Based Instruction,” which stems from a 1991 NATO workshop by the same name. It captures the editors’ resolute stance that some form of student modelling is an essential ingredient of an adaptive tutoring system. Much of Jim’s work with his colleagues showed developers of intelligent tutoring systems (ITSs) that “the problem of student modelling” (p. v) is manageable, despite being enormously difficult. Promising results from user studies of the ITSs that they developed, most of which were driven by Bayesian student models, proved this challenge to be worth tackling (e.g., Zapata-Rivera and Greer 2004b). These results inspired the next generation of ITS developers, including the authors of this paper, to develop new approaches to student modelling, and/or to implement student modelling in tutors for students of various ages, across subject domains. [See Pavlik et al. 2013 for a review.]

Most ITSs address problem-solving domains. These tutors rely on student modelling to achieve macro- and micro-adaptation, according to the two-loop model of tutoring behavior that VanLehn (2006) proposed—that is, to choose appropriate tasks for a student (macro-adaptation) and to provide feedback and support, as needed, at each task step (micro-adaptation). Many student model guided ITSs have shown promising learning outcomes (e.g., Aleven et al. 2016a; Conati and Kardan 2013; Desmarais, and d. Baker, R. S. J. 2012; Mitrovic 2012; Pavlik et al. 2013; Shute 1995). Some ITSs, such as the Cognitive Tutors, have been providing effective instruction to thousands of students (e.g., Blessing et al. 2009; Koedinger and Corbett 2006). However, there is one genre of ITSs that has lagged behind problem-solving tutors with respect to its use of student modelling to drive adaptive instruction: tutorial dialogue systems (TDSs). TDSs engage students in natural-language conceptual discussions with an automated tutor. (For example, see Table 1 and Fig. 1.)

Table 1 A sample tutorial dialogue in Rimac
Fig. 1
figure 1

Rimac tutoring system interface. Problem statement shown in upper left pane, worked example video in lower left pane, and dialogue excerpt in right pane

Student modelling’s low profile in tutorial dialogues systems, relative to problem-solving ITSs, is not surprising, given that TDSs add the challenge of natural-language understanding to two well-established student modelling problems: the inherent uncertainty of the model’s assessments and the fact that students’ understanding evolves as they interact with a tutoring system. The student model must constantly be updated to keep pace with this “moving target” (Greer and McCalla 1993). In addition, it is not straightforward to apply approaches to student modelling used in problem-solving ITSs to TDSs, due to characteristics intrinsic to tutorial dialogue. First, pairs of tutor-student dialogue turns do not always represent a step during problem solving. Instead, some dialogue turns may contribute relevant background knowledge and thus may require a finer-grained representation in order to track students’ knowledge. Second, conceptual discussions do not neatly map to the structured steps typical of problem solving, as illustrated in Tables 1 and 2. (See also Fig. 1.)

Table 2 Directed Line of Reasoning that underlies the reflective dialogue shown in Table 1

Despite these challenges, student modelling is just as important for TDSs as it is for any other tutoring system, for the reasons that Greer and McCalla (1993) stated: to thwart “rigidity, inflexibility, and unresponsiveness” (p. viii). This paper describes work that we have been doing to more tightly couple student modelling with automated tutorial dialogue than has been done in most TDSs, with the goal of making these tutors more adaptive and efficient. Our motivation stems largely from classroom-based studies that we conducted while developing Rimac, a prototype tutorial dialogue system for conceptual physics. Rimac aims to enhance high school students’ understanding of concepts associated with quantitative physics problems that they solve on paper.Footnote 1 (See Table 1 and Fig. 1.) Several studies examined the relation between particular tutoring strategies, student characteristics and learning, in order to derive adaptive tutoring policies (decision rules) to implement in Rimac (e.g., Jordan et al. 2015a; Jordan et al. 2015b; Jordan et al. 2018; Jordan et al. 2012; Katz and Albacete 2013; Katz et al. 2016; Katz et al. 2018). We consistently found significant pretest to posttest learning gains, regardless of which condition students were assigned to—that is, which version of a tutoring strategy or policy students experienced (e.g., a high vs. low frequency of restatement, different types of summarization, different ways of structuring remediation). In addition, several studies revealed interesting aptitude-treatment interactions (Jordan et al. 2015a, 2018; Katz et al. 2016). However, many students complained that Rimac’s dialogues are “too long,” during informal interviews and on user surveys. More concerning was feedback indicating that the tutor was insufficiently adaptive and inefficient—specifically, that Rimac’s dialogues often spend too much time on concepts that the student understands and too little time on concepts that they have been struggling with. Other TDS developers have reported similar user feedback (e.g., Kopp et al. 2012).

We realized that the only way to address the problems of inefficiency and inadequate personalization in TDSs would be to model students’ understanding of the content addressed during discussions with the tutor and to use these models to guide the system in making adaptive decisions. In other words, we decided to take the torch that developers of ITSs without natural-language interaction had been carrying for a long time by assigning student modelling a more prominent role in Rimac than has been done in most TDSs.

The remainder of this paper will proceed as follows. Section 2 illustrates the need to implement student modelling within tutorial dialogue systems. Section 3 describes Rimac and explains how the tutor uses a student model to drive adaptive scaffolding during dialogue. Section 4 summarizes classroom-based studies that addressed the following questions: (1) To what extent is the substantial effort required to incorporate student modelling in TDSs worthwhile, with respect to learning gains and efficiency? and (2) How important it is to dynamically update a student model to drive adaptive tutoring, as opposed to using a static student model that is initialized based on students’ pretest performance, but not updated further (Albacete et al. 2017a; Jordan et al. 2017)? Section 5 discusses limitations of Rimac’s student model and outlines future work to address these limitations. Section 6 describes related work on adaptive instruction in tutorial dialogue systems, and Section 7 focuses on one limitation of TDSs that needs to be addressed in order to promote sustained use of these systems in the classroom. We conclude by tying our work on incorporating student modelling in Rimac to Jim’s vision for ITSs.

Illustrating the Problem: The Limited Adaptivity of Tutorial Dialogue Systems

The amount of temporary support or “scaffolding” provided during human tutoring is contingent upon the learner’s level of understanding or skill in carrying out a task (Belland 2014; van de Pol et al. 2010; Wood and Middleton 1975). For example, studies of parent-child interactions during problem-solving tasks (e.g., Pino-Pasternak et al. 2010; Pratt et al. 1992; Pratt et al. 1988; Pratt and Savoy-Levine 1998; van de Pol et al. 2010; van de Pol et al. 2015; Wood and Middleton 1975) have found that parents dynamically adjust the support that they provide to align with the child’s zone of proximal development (ZPD), defined as “the conceptual space or zone between what a child is capable of doing on his or her own and what he or she can achieve with assistance from an adult or more capable peer” (Reber et al. 2009; Vygotsky 1978). The hallmark of effective scaffolding in these studies is a high frequency of adherence to the Contingent Shift Principle (Wood and Middleton 1975): “If the child succeeds, when next intervening offer less help. If the child fails, when next intervening take over more control” (p. 133). These studies, and related research on scaffolding in other contexts—for example, teachers’ guidance of peer group interactions in the classroom (e.g., van de Pol et al. 2015)—indicate that tutorial dialogue systems should strive to emulate the contingent scaffolding of human tutoring.

Unlike human tutors, most tutorial dialogue systems tailor instruction to the student only to a limited extent. Many TDSs implement a popular framework called Knowledge Construction Dialogues (KCDs), which step all students through the same pre-scripted “directed line of reasoning” (DLR) (Hume et al. 1996), regardless of the student’s ability in the targeted content. (Table 2 presents an example of a DLR.) Individualized instruction is limited to the tutor’s deviations from the main path in the DLR, when the student answers incorrectly and the tutor launches a remedial sub-dialogue. Tutoring returns to the DLR’s main path when a remediation has completed (e.g., Ai and Litman 2011; Chi et al. 2014; Evens and Michael 2006; Lane and VanLehn 2005; Litman and Forbes-Riley 2006; Rosé et al. 2006; Ward and Litman 2011). Similarly, dialogues in the AutoTutor family of TDSs follow an approach that is based on observations of human tutoring called Expectation and Misconception-Tailored (EMT) dialogue (e.g., Graesser et al. 2017a). Each dialogue sets an agenda of expectations (anticipated, correct responses) that must be covered at some point during the dialogue. The tutor remediates when a student’s response reveals a misconception (flawed or missing knowledge).

Both of these approaches can cause frustration when students feel that the tutor is forcing them to engage in lengthy discussions about material that they already know and not addressing content that they need help with (e.g., Kopp et al. 2012; Jordan et al. 2018). Negative affective responses are worth heeding, in light of evidence that boredom and frustration with a tutoring system predict poor learning gains (d. Baker et al. 2010).

To illustrate the limited adaptivity of most TDSs, let’s consider two cases that apply the KCD approach. The first case is illustrated in the dialogue shown in Table 1.Footnote 2 The dialogue script (DLR) shown at the top of Table 2 generated this dialogue. The DLR is also represented in the lower part of Table 2 as a directed graph that can produce alternate paths through the dialogue. Each node includes a question that the tutor asks the student as he or she progresses through the DLR. The graph shows that tutoring during this dialogue can take place at three levels of granularity. The primary (P) level requires the most amount of inferencing on the student’s part (low granularity); the intermediate or secondary (S) level breaks down the steps between P-level nodes (moderate granularity); and the tertiary (T) level provides background knowledge needed to answer S and P level questions correctly (high granularity). The simulated student whose dialogue is included in Table 1 clearly needs scaffolding to understand the reasoning that leads to a correct response to the reflection question. This student answered the reflection question and the question in T2 (node P2) incorrectly, indicating that he or she does not understand that changes in the normal force govern a person’s perception that their weight changes as they ride in an elevator—not their actual weight, which is nearly constant. The remedial sub-dialogue provided by the DLR’s secondary (S) path (turns T3-T8) therefore seems a fitting response to this student’s apparently poor understanding.

Now consider the case of a more knowledgeable student who has a correct perception of the physical phenomenon addressed in this problem (i.e., the perception that one’s weight changes with motion), understands that weight is independent of motion, and understands that a change in the normal force (not weight) is what one perceives while riding an elevator. This student should be allowed to move onto a more challenging problem, provided that he or she answers the main RQ correctly. However, in TDSs that implement the KCD approach—including Rimac before we incorporated student modelling—the shortest path that this student could take through this dialogue would be RQ➔P1➔P2➔Recap, as shown in Table 3. Even a relatively short dialogue such as this would likely be a waste of this student’s time.

Table 3 The shortest path through the graph shown in Table 2

Adherence to the standard KCD approach can also be insufficiently adaptive for less knowledgeable students. KCDs tacitly assume that traversing the primary path through a DLR, temporarily shifting to secondary and lower paths only when necessary for remediation, is appropriately challenging for all students. However, this is often not the case. The primary path through a dialogue script typically requires more background knowledge and reasoning ability than many students bring to a conceptual problem. A tutoring system with a student modelling component could use available data (e.g., the student’s performance on a pretest, course exams, homework assignments) to predict that the student is not yet ready to answer particular P-level questions. The tutor could then immediately conduct the dialogue at a finer granularity level—at the S level and, as necessary, the T level, etc. In other words, adaptation could begin even before the tutor asks the first question in a dialogue script, based on the information represented in the student’s student model. This more adaptive tutoring behavior would emulate what experienced human tutors do: ask questions that are within students’ capability to answer correctly, look for cues that students need some support (e.g., a delayed response), and provide that support in order to avert an incorrect answer (Fox 1991).

The central hypothesis motivating our work is that tutorial dialogue systems would be more effective and efficient if they could consult a student model to provide individualized, knowledge-based instruction, as most effective problem-solving ITSs do. “Knowledge-based” means that information about student characteristics is represented in a student model, such as the student’s understanding of curriculum elements (knowledge components), demographic information, affective traits such as interest in the subject matter, engagement, self-efficacy, etc. (Chi et al. 2011). The absence of such information about students forces designers of tutorial dialogue systems to make a “best guess” about how to structure a KCD—that is, what the main line of reasoning should be, what remedial or supplemental sub-dialogues to issue, and when—and then to hard code these guesses into dialogue scripts. The consequence is that students are often underexposed to material they don’t understand and overexposed to material they firmly grasp. The first problem renders these systems ineffective while the second makes them inefficient, as shown in the examples discussed in this section.

Research by Kopp et al. (2012) indicates that a high dosage of interactive tutoring is not always necessary. The authors compared two versions of AutoTutor. The standard (control) version presented six conceptual questions to students in a research methods course. The dialogues associated with these questions addressed all expectations (targeted KCs), remediating when students responded incorrectly. An alternative, experimental version engaged students in dialogues about three conceptual questions and then presented three additional questions but did not engage students in dialogues. Instead, each conceptual question was followed by a canned response and explanation. The authors found that students in the experimental condition learned as much as students in the control condition, but they did so more efficiently (i.e., in less time). A follow-up study showed that it was just as effective to present the three highly interactive dialogues before the three static (“question + canned response”) exercises as the reverse ordering.

These findings prompted us to design and develop a student modelling engine for Rimac that could increase learning efficiency in a more adaptive manner than the approach taken in Kopp et al.’s (2012) studies, arbitrarily alternating between intensive dialogue and no dialogue. Specifically, the student model would enable the tutor to choose which reflection questions to discuss further after a student responds to a reflection question correctly, and at what level of granularity.

Student Modelling in Rimac

Overview of Rimac

Students’ failure to grasp basic scientific concepts and apply this knowledge to problem solving has been a persistent challenge, especially in physics education. Even students who are adept at solving quantitative problems often perform poorly on qualitative problems, and misconceptions can linger throughout introductory college-level physics courses (Halloun and Hestenes 1985; Mestre et al. 2009). Rimac provides a research platform to examine how to develop an adaptive tutoring system that can enhance students’ conceptual knowledge about physics.

Rimac’s conversations with the student are implemented as Knowledge Construction Dialogues (KCDs), as described in Section 2 and illustrated in Tables 1-3 and Fig. 1. Dialogues are authored using the TuTalk dialogue development toolkit, which allows domain experts to construct natural-language dialogues without programming (Jordan et al. 2006). Authors can focus instead on defining the tutorial content and its structure. This rule-based approach to dialogue, coupled with encouraging short answers at each student turn, increases input understanding and affords greater experimental control than do alternative approaches (e.g., Dzikovska et al. 2014).

Field trials conducted to test and refine Rimac typically involve having students complete an assignment during class or as homework. Since Rimac is not a problem-solving tutor, students solve quantitative physics problems on paper. They enter their answer to each problem in a box in the tutor’s interface and then have the option to watch a video that presents a brief (~4–5 min), narrated worked example of a correct solution. (See Fig. 1.) Because worked examples have consistently been shown to support learning (e.g., Atkinson et al. 2000; Cooper and Sweller 1987; McLaren et al. 2016; Sweller and Cooper 1985; van Gog et al. 2006), we use worked examples to provide students with feedback on problem solving. Rimac presents a series of conceptually focused reflection questions (RQs) about each just-solved problem, such as the RQs shown in Table 1 and Fig. 1. Students engage in a conversation about each RQ by typing responses to the tutor’s questions. Although Rimac’s dialogues supplement quantitative problems, the tutor’s dialogues could alternatively be presented independent of problem-solving exercises, as a tool to enhance students’ conceptual understanding, scientific reasoning and explanation skills.

Each step in Rimac’s dialogues is associated with a set of learning objectives or knowledge components (KCs). For example, referring to the dialogue and DLR shown in Tables 1 and 2, respectively, the tutor’s question in T1 (node P1) addresses the KC, The weight (or gravitational force) of an object is the same regardless of whether or not the object is accelerating; the tutor’s question in T6 (node S5) addresses the KC, For an object accelerating upward at a constant rate, the upward normal force must be larger than the downward gravitational force. Rimac’s student modelling component initializes its assessment of each KC that its dialogues address based on students’ responses to pretest questions that target these KCs. It then uses students’ responses to dialogue questions to dynamically update the student model’s assessment of the KCs associated with each question, as described in the next section and in Chounta et al. (2017a). [See also Albacete et al. 2019.]

Rimac currently does not support student initiative through question asking. The main reason is that a repair mechanism would be necessary if student initiative were encouraged, to enable the system to develop a shared understanding of the student’s initiative, which may include topic shifts. In our opinion, the technology for handling conversational repair and understanding needs to improve dramatically before student initiative can be supported. We consider this an important goal for future research, in light of abundant research that demonstrates the instructional benefit of question asking (e.g., Gavelek and Raphael 1985; King 1990, 1994; Palincsar 1998) and the substantially higher frequency of student questions asked during one-on-one human tutoring than in the classroom (Graesser and Person 1994).

How Rimac Produces a Student Model and Uses it to Guide Adaptive Tutoring

Overview

Rimac’s student model enables the tutor to emulate the contingent (adaptive) scaffolding provided by human tutors. Specifically, Rimac implements domain contingency and instructional contingency (Katz et al. 2018; van de Pol et al. 2010; Wood 2001, 2003). Domain contingency entails selecting appropriate content to focus on while a student performs a task, while instructional contingency entails addressing this content with an appropriate amount of support.

To achieve domain contingency in Rimac, we developed different versions of each dialogue, each version corresponding to a line of reasoning (LOR) at a different level of granularity. When embedded in dialogues, these alternative LORs can be represented as a directed graph, as shown in Table 2 (Albacete et al. 2019). To achieve instructional contingency, we developed different versions of the questions, hints, feedback, and explanations that are associated with each step of a DLR. Each version provides a different level of support (i.e., low, medium, or high support), chosen based on the student model’s assessment of how likely the student is to correctly answer the question asked at a dialogue step. For example, consider these two versions of the same core question: “What is the man’s acceleration?” versus “Given that the net force applied on the man is 2N, and thinking about Newton’s Second Law, what is the man’s acceleration?” The latter question provides more support than the former because it reminds the student what the value of the net force is and that net force and acceleration are mathematically related (Katz et al. 2018).

Rimac always aims for mastery by selecting the question at the lowest possible level of granularity that the student is likely to answer correctly—that is, the question bearing the least discussion about relevant background knowledge. The tutor consults the student model to make this choice. Rimac represents a student model as a regression equation, implemented using Instructional Factor Analysis Model (IFM), as proposed by Chi et al. (2011). The system starts with a default stereotypical model for low, medium, and high prior knowledge students, classifies the student into one of these groups based on the student’s overall pretest score, and then adjusts the default model based on the student’s responses to pretest items and dialogue questions. This section describes the process of initializing and updating a student model in Rimac and using the model to guide individualized instruction. (For more detail, see also Albacete et al. 2018, Chounta et al. 2017a, and Katz et al. 2018.)

Initializing the Student Model and Customizing it for a Particular Student

We used stereotyping (Tsiriga and Virvou 2002) to initialize Rimac’s student model. Training data enabled us to specify three student “personas” that the system could use to classify students based on their pretest performance: low, medium, and high prior knowledge students. The training data consisted of 560 students’ pretests and dialogues over a four-year period (2011–2015). We used K-Means to cluster students in the training data set based on their pretest performance because we had previously found that students’ average pretest scores correlated positively with the student model’s predictions (Chounta et al. 2017a). Also, pretest data (and/or experts’ assessment data) is typically used to initialize student models because it enables initialization of KC-specific parameters, such as difficulty level and prior knowledge (Gong 2014).

The appropriate number of clusters in K-means is typically determined by plotting the within-groups sum of squares by number of clusters and choosing the point where the plot bends (Hothorn and Everitt 2014). The sum of squares plot shown at the left side of Fig. 2 indicated that three was the appropriate number of clusters for the training data. We validated that these clusters represent groups of “high,” “medium,” and “low” prior knowledge students based on students’ average pretest scores (Fig. 2, right). We then trained an instance of the student model for each persona (cluster). This yielded a better prediction accuracy score than did using the whole training data set, so we decided to use the three generic student models that resulted from this training step, as illustrated in Fig. 3: the Low Prior Knowledge persona, the Medium Prior Knowledge persona, and the High Prior Knowledge persona.

Fig. 2
figure 2

The within-groups sum of squares by number of clusters, used to determine the optimal number of clusters (left), and the results of the K-means clustering process (right)

Fig. 3
figure 3

The student model initialization process for a new students

Each student who uses the tutor for the first time takes an online pretest, which enables the system to classify the student according to one of these personas. Rimac initializes a student’s model by using the generic student model that coincides with this persona. It then personalizes the student model by analyzing the student’s responses to test items that target particular knowledge components. We developed a program called the Multiple-choice Knowledge Assessment Tool (McKnowAT) to automatically assess students’ understanding of the KCs associated with a given multiple choice test item, based on the student’s selection (or non-selection) of the item’s options (Albacete et al. 2017b).

Updating the Student Model

As the student progresses through a dialogue, the student model is dynamically updated based on the student’s responses to the tutor’s questions. Each dialogue exchange (pair of tutor-student dialogue turns) is treated as a training instance, represented by the KCs that the exchange addresses and the status of the student’s response to the tutor’s question (i.e., correct or incorrect). For example, in the dialogue excerpt shown in Table 1, the student answers the question asked in turn T2 incorrectly. This turn maps to primary path node P2 (“What force are you actually perceiving whenever you feel heavier or lighter during an elevator’s motion?”). The main KC associated with this node is: When a person is aware of how heavy they feel, they are perceiving the magnitude of the normal force acting on them, not their actual weight. Consequently, the student model will downgrade its assessment of this KC, expressed as the probability that the student understands it.

The student modelling system updates the student model after every exchange. In this way, the model maintains the most current image of the student’s knowledge state. In addition to being kept up to date, each student model is persistent—that is, carried over from one problem and dialogue to the next, one assignment to the next. In contrast, most TDSs that implement some form of student modelling create the student model anew during each dialogue.

Choosing the Next Question to Ask during a Dialogue

Few students follow exactly the same path through one of Rimac’s dialogue scripts, such as the DLR shown in Table 2. Some transitions from one question to the next are hard coded into the dialogue script; that is, no decision needs to be made about which question to ask next. (These predefined transitions are indicated by black arrows in Table 2.) Predefined transitions help to keep dialogue length manageable and to reduce cognitive load. They often correspond to knowledge that was discussed in a previous RQ and summarize those KCs so that the student can focus on the goal(s) of the current RQ. To illustrate, this is the case with nodes S5 and S6 shown in Table 2. These nodes address KCs that were discussed in detail in a previous RQ (i.e., the relationship between normal force and weight in an accelerating or decelerating elevator). Hence, when a student answers the question asked at S5 (“At the very beginning of the upward trip, how does the normal force compare to the weight?”) incorrectly, the tutor remediates by summarizing these previously discussed KCs before proceeding to the next question asked at S6 (“At the very end of the trip, how does the normal force compare to the weight?”), as shown in tutor turn T7 of Table 1. Predefined transitions also support dialogue coherency, because some questions need to be asked in sequence, as is the case with nodes S3 and S4, and S5 and S6. In addition, some predefined transitions reflect “bottom out” remediations; the tutor provides the correct answer and there is no lower level line of reasoning to discuss.

Other nodes in a dialogue’s DLR raise a decision to be made about which node to traverse to next. These decisions are based on the student model’s predictions about the likelihood that the student will answer the question asked at candidate “next step” nodes correctly. For example, if a student answers the RQ correctly, the tutor must first decide if the student needs to discuss this problem further, based on the student model’s assessment of the learner’s understanding of the most relevant KCs associated with this RQ, as discussed in Jordan et al. (2016) and Albacete et al. (2019). If so, the tutor’s next decision is to choose an appropriate level at which to conduct this discussion: Should the discussion start at the primary, secondary, or tertiary level? (Transition choice points are indicated by blue arrows in Table 2.)

When making these decisions, the tutor strives to balance challenge with potential success. Rimac uses logistic regression to predict the probability of a student answering the question asked at each candidate node correctly as a linear function of the student’s proficiency on the associated KCs. The tutor then uses the classification threshold to interpret the meaning of this probability. The “classification threshold” is the probability that determines which class will be chosen, such as correct vs. incorrect. It is shown as a dotted line in Fig. 4. In Rimac, this threshold was determined to be 0.5 (Chounta et al. 2017a). Hence, if the probability of a student answering a question is > = 0.5, it is interpreted as “the student is likely to answer the question correctly;” otherwise, the probability is interpreted as “the student is likely to answer the question incorrectly.”

Fig. 4
figure 4

The Grey Area construct with respect to fitted probabilities (i.e., the probability that a student will answer a particular step correctly), as predicted by the student model for a random student and for the steps in a conceptual problem-solving task (RQ)

As a prediction gets closer to 0.5, there is increasing uncertainty: ultimately, a 50% chance that the student will answer the question correctly and a 50% chance that the student will answer it incorrectly. Consequently, this probability cannot be interpreted reliably. We refer to the region of high uncertainty between 0.4 and 0.6 as the “Grey Area” (Chounta et al. 2017a, 2017b), as shown in Fig. 4. The student model’s high degree of uncertainty in its predictions within the Grey Area might simply be due to insufficient evidence; that is, the system has not yet accrued enough data to assess the student’s understanding of the material that a question addresses. Alternatively, high uncertainty might reflect the student’s unsteady command over this material. Perhaps the student is on the brink of understanding but has not yet sufficiently mastered the KCs associated with a question—as evident, for example, in the student’s inconsistent responses to items on the pretest and in previous questions that address these KCs. With these considerations in mind, we proposed that predictions in the Grey Area might indicate that the student is in the ZPD for the targeted material. Correspondingly, the Grey Area might afford a computational model of the ZPD (Chounta et al. 2017a, 2017b).

Once the system has interpreted the student model’s output (i.e., its predictions of success at each candidate “next node”), the tutor chooses the question at the lowest level of granularity whose predicted probability of a correct response is at least 0.4—in other words, within or above the Grey Area. The student is expected either to be able to answer this question correctly (i.e., the probability of a correct response is >= 0.6) or to be within his or her ZPD for that question (i.e., the probability of a correct response is 0.4-0.6). An exception to this policy pertains to questions belonging to the expert (P) level LOR. For these questions, the tutor takes a more cautious approach, only asking them if it is quite certain that the student will answer correctly--that is, if the predicted probability of the student answering the expert question is equal to or greater than 60%. The student model examines each possible next question, starting with the P level and moving down. It checks the probability of each of these questions in sequence (P, then S, then T level, and further down if possible) until it finds a question that it can ask, according to the student model’s selection policies. For example, again referring to the DLR shown in Table 2, if the student answers the top-level RQ correctly and the tutor’s prediction of a correct response is 0.6 for nodes P1 and S1, the tutor will ask the question at P1 because this question is more challenging than the one at S1 (i.e., the question at P1 requires more inferencing). If the last question asked was at P1, the probability of a correct response at P2 is 0.4, and the probability of a correct response at S3 is 0.75, the tutor will traverse from P1 to S3—again with the aim of balancing challenge and potential success. P2, the expert-level question, will not be asked because its probability of being answered correctly is below 0.6. S3 will be asked instead, since its probability of a correct response is at least 0.4.

By taking this approach to choosing the next question, Rimac treats the uncertainty of student modelling as a feature, not as a bug. It uses the model’s estimates of how likely a student will succeed at the next step in a DLR to choose an appropriate level at which to address this step. Few predictions will approximate 1.0—that is, complete confidence that the student will answer the question asked at a candidate “next step” node correctly, as illustrated in Fig. 4. For example, slips sometimes happen, even when students have mastered the KCs associated with a question. The student model is nonetheless robust enough to adaptively choose the next step and to ask the question at an appropriate level of support, as described next.

Adapting the Level of Support Provided at each Dialogue Step

The Contingent Shift Principle that underlies effective scaffolding during human tutoring (e.g., Pratt and Savoy-Levine 1998; Wood 2001; Wood et al. 1976) is partly realized in Rimac when the tutor selects the next step in a dialogue script. The tutor’s predictions about the student’s probability of answering each candidate “next step” correctly allow it to determine whether to ask a question at the P level, S level, T level, etc., which often results in shifts across levels—for example, asking a question that requires a high amount of inferencing on the student’s part at one step (e.g., a P-level question), then shifting to a question that requires a moderate amount of inferencing at the next step (e.g., an S-level question).

Rimac’s job is not done when it adaptively chooses the next step in a line of reasoning, simulating the domain contingency observed in effective human tutoring and teaching (Brownfield 2016; Brownfield and Wilkinson 2018; Rodgers et al. 2016; van de Pol et al. 2010; Wood 2001). The tutor must also simulate instructional contingency by deciding how to ask the question at this step and how to implement other tutoring strategies (e.g., hints, explanations, feedback). As is the case with choosing the next step, adjusting the support provided in a question, hint, etc. is tantamount to emulating the Contingent Shift Principle, because the level of support that the tutor provides at one step might shift up or down at the next step, based on the tutor’s assessment of the student’s need for support in order to answer the question at these two steps correctly. For example, if the tutor decides to ask a question at the P level, the student will always receive low support, because the student model is fairly certain that the student will answer this question correctly (i.e., probability > = 0.6). Conversely, if the probability of a correct response to a question is <0.4—that is, below the Grey Area—the tutor is fairly certain that the student will answer incorrectly and will therefore provide ample support. Since there is less certainty within the Grey Area [0.4–0.6] regarding how much support the student will need to answer the question correctly, we divide this region into three segments: lower third [0.40–0.47], middle third [0.47–0.53], and upper third [0.53-0.60] corresponding to high, medium, and low support, respectively. This policy for choosing how much support to provide when the probability of success lies within the Grey Area pertains only to S-level and T-level questions. P-level questions are always asked with low support.

Rimac’s dialogue authors produce several variations of questions, feedback on students’ responses, hints, explanations, and other tutoring strategies for each dialogue step. These variations allow the tutor to choose an appropriate level of support (low, medium, or high), given the predicted probability that the student will answer the question at a given step correctly. We specified guidelines for dialogue authors to use to generate alternative forms of questions, hints, etc. and decision rules to guide the tutor in choosing among these alternatives. These authoring guidelines and decision rules operationalize what it means to provide a “high” level of support, versus a “moderate” level, versus a “low” level when implementing tutoring strategies (Katz et al. 2018).

Fortunately, several scaffolding researchers and developers of teacher professional development programs have also faced the challenging task of defining different levels of support (LOS)—for example, in order to measure the frequency of contingent shifts observed in various instructional contexts (e.g., Pratt et al. 1992; Pratt and Savoy-Levine 1998; van de Pol et al. 2014; Wood et al. 1978). This work has yielded many frameworks that specify different levels of teacher support, or different levels of cognitive complexity, depending on whether a framework differentiates levels from the teacher’s or student’s perspective, respectively. Table 4 provides an example of a framework that distinguishes levels of support according to how much control the teacher exerts during small group problem-solving tasks in the classroom (van de Pol et al. 2014). This framework has been used or adapted in several teacher training programs and scaffolding studies (Brownfield 2016; Brownfield and Wilkinson 2018; Rodgers 2017; Rodgers et al. 2016; San Martin 2018; van de Pol et al. 2019; van de Pol et al. 2014, 2015). However, the descriptions of each level of support (control) provided in this and most frameworks are not specific enough to develop authoring guidelines to generate alternative questions, hints, etc., and decision rules that an automated tutor can consult to choose among these alternatives. For example, referring to the sample LOS framework in Table 4, it is unclear how to produce a “broad and open question” at Level 1, versus a “more detailed but still open” question at Level 2, versus “a hint or suggestive question” at Level 4.

Table 4 Sample Level of Support framework

In order to more precisely operationalize “different levels of support,” we examined level definitions across several LOS frameworks, and transcripts from human physics tutoring sessions, with the aim of identifying factors that dialogue authors could use to adjust the level of support afforded by alternative forms of questions, hints, etc. (Katz et al. 2018). For example, some factors that render a question more (or less) abstract include:

  • Does the question refer to objects included in the problem statement (e.g., “What is the velocity of the bicycle?” vs. “What is the velocity?”)?

  • Does the question provide a hint and, if so, what type of hint—one that states a piece of information needed to answer the question correctly, or one that prompts the student to recall this information on his or her own—that is, a convey information hint versus a point to information hint, respectively (Hume et al. 1996)?

  • How much and what type of information should be included in the question—for example, a reference to the answer to the previous question; a list of possible answers to choose from [e.g., “how is the velocity varying (increasing, decreasing, constant, etc.)?” versus “how is the velocity varying?”]?

  • In a quantitative domain such as physics, does the question refer to a law or principle in equation form or sentence form—for example, “a = Δv/Δt” vs. “acceleration is defined as the change in velocity over the time interval”?

These factors were incorporated in authoring guidelines and decision rules for questions, as illustrated in Table 5.

Table 5 Rules for generating and selecting questions at different levels of support

Pilot Studies to Assess the Student Model Driven Version of Rimac

We conducted two classroom-based studies to gain an initial sense of whether incorporating student modelling within tutorial dialogue systems is worth the effort, as measured by learning gains, more efficient learning, or both. In both studies, we found an advantage for efficiency, measured by time on task, but not for effectiveness, measured by pretest to posttest gain scores. We summarize these studies in this section. [See Jordan et al. 2017 for more detail on Study 1 and Albacete et al. 2019, for more detail on Study 2.]

Study 1

Does a TDS that uses a “poor man’s student model” to decide when to decompose steps into lower-level sub-steps promote more efficient learning than a TDS that always decomposes steps into sub-steps?

We predicted that the answer to this question would be yes, because micro-level tutoring at the sub-step level can be time-consuming. To test our hypothesis, we compared two versions of Rimac. Dialogues in the control version behave as dialogues developed using the KCD framework typically do; they take a cautious approach to instruction. That is, even when a student answers the tutor’s question at a given step during the dialogue correctly, the tutor will address the step’s sub-steps, in case the student’s correct response stemmed from a lucky guess or from incomplete reasoning that nonetheless was “good enough.” For example, referring to the DLR shown in Table 2, a student in the control group would not be allowed to skip the dialogue associated with this RQ if he or she were to answer this RQ correctly. Similarly, if the student were to answer a P-level question correctly, the student would still be required to traverse this step’s associated S- and T-level nodes. For example, after the student answers the question at P1 correctly, the tutor would choose the following path leading to P2: S1➔T1➔S2➔P2.

In contrast, dialogues in the experimental version of Rimac use students’ pretest performance to decide whether to decompose a step into its corresponding sub-steps, provided that the step is decomposable. The student model bases this decision on the predicted probability that the student already understands the KCs associated with these sub-steps, measured as 0.8 or above for the top-level RQ and 0.5 or above for decomposable steps in the DLR. For example, if the student model does not allow the student to skip the DLR in Table 2 because the predicted probability that the student understands the main concepts associated with this RQ correctly is less than 0.8, the student could still proceed through a relatively short path (e.g., RQ➔P1➔P2➔Recap), as long as the predicted probabilities of success at P1 and P2 are above 0.5.

Data from 72 students enrolled in physics classes at 3 high schools were included in the study (N = 35 Control; N = 37 Experimental). Students in both conditions took an online pretest in class. They then solved two quantitative problems on paper. For each problem, students had the option to watch a worked-example video in the tutor that provided feedback on problem solving but no conceptual information. After each problem, they engaged in dialogues that addressed the concepts associated with the just-solved problem (five RQs total, across the two problems). Finally, students took an online post-test in class and completed a user satisfaction survey.

Data analysis revealed significant pretest to posttest gain scores in both conditions. However, neither condition was more beneficial for learning than the other and no aptitude-treatment interactions were observed. Both high and low prior knowledge students learned a comparable amount, as measured by pretest to post-test gain score, regardless of which condition they were assigned to. The only significant difference between conditions was an efficiency advantage for high prior knowledge students in the experimental condition. The mean time for this group of students to complete the dialogue intervention was 34 min in the experimental condition versus 46 min in the control condition. More importantly, this gain in efficiency did not come at a cost to learning. High prior knowledge students across conditions showed comparable gain scores.

Study 2

Does a TDS whose student model is dynamically updated at each dialogue step promote higher learning and efficiency gains than a TDS that uses a static student model?

This follow-up to Study 1 compared two student model guided versions of Rimac. The control version was similar to the experimental version in the first study. The student model was initialized using students’ pretest scores and not updated further. As in Study 1, students could skip the discussion and move on to the next RQ if they answered the current RQ correctly and their pretest scores on all of the most relevant KCs were 0.8 or above. Otherwise, if the student could not skip the RQ, he or she would be assigned to a fixed path through the line of reasoning at the expert level (P) if the student’s scores on all relevant KC’s were greater than 0.7; a fixed path at a medium level (S) if the student’s scores on all relevant KCs were greater than 0.4; and a fixed path at a novice level (T or lower) otherwise. Hence, the lowest scoring KC in the set drove decision making. We refer to this condition informally as the “poor man’s student model” because it lacks the complex mechanisms needed to dynamically update the student model and use the model to select steps, as described in Section 3.2.

In the experimental version of the tutor, the student model’s assessment for each KC was updated dynamically after each dialogue step. As described in Section 3.2, the tutor referred to the predicted probability of a correct response at each candidate “next step” to select a dialogue step. It aimed for mastery by selecting the step with the highest probability of a correct response, within the path through the DLR that requires the most amount of inferencing. We predicted that the experimental version of the tutor would outperform the control version on learning and efficiency gains. As with the first study, only the latter prediction was realized. However, this time the experimental version showed an efficiency advantage for both high and low prior knowledge students, classified according to a median split of students’ pretest scores.

Data from 73 students enrolled in physics classes at one high school were included in the study (N = 42 Control; N = 31 Experimental). The study protocol was similar to that followed in the first study, except that students solved a few more problems and engaged in more dialogues with the tutor: 5 quantitative problems with 3–5 reflection questions per problem.

As with the first study, we found significant pretest to posttest gains across conditions, but neither condition was more beneficial with respect to learning gains than the other. This might be due to the thoroughness of the control condition. If a student were to answer a question incorrectly, he or she would go through a remedial sub-dialogue that explicitly addresses all of the material that the path the student had been assigned to (e.g., P or S) expects the student to infer.

The lack of higher learning gains for the experimental condition might also be due to the shortness of the intervention. Perhaps the instructional benefit of dynamic updating only manifests in sufficiently long interventions. Consequently, when we add more content (problems and dialogues) to the tutor, we might observe greater learning gains. This content will include challenging problems that allow high incoming knowledge students to learn new material and better understand concepts that they have not fully mastered, as well as problems that give low-performing students more practice in acquiring and applying basic concepts.

As in Study 1, no aptitude-treatment interactions were observed. However, in Study 2 both high and low prior knowledge students learned more efficiently (i.e., took less time on task) in the experimental condition than in the control condition. Specifically, students in the dynamic student modelling group went through the dialogues about 27% faster than students in the static student modelling group, as illustrated in Fig. 5. Increased efficiency is an important outcome because it indicates that the tutor focuses on material that students need help with and doesn’t spend too much time on material that students have sufficiently mastered. This leaves time for more challenging tasks and, perhaps, helps to sustain student interest.

Fig. 5
figure 5

Comparing time on task between conditions for low and high prior knowledge students

Assessing the Student Model’s Performance

The experimental condition from Pilot Study 2 provided data that we could use to assess the accuracy of the student model’s predictions. The 31 participants in the experimental condition answered a total of 2603 questions. Approximately 94% of students’ responses to these questions (2436) were predicted to lie outside of the Grey Area; that is, the probability of a student answering a question correctly was either below 0.4 or above 0.6. The remaining 6% of responses (167) were predicted to lie within the Grey Area; that is, the probability of a student answering a question correctly was between 0.4 and 0.6, thereby with dubious interpretability. The student model’s accuracy outside of the Grey Area was 0.61; in other words, 61% of the model’s predictions were correct. As expected, the student model’s accuracy within the Grey Area, 55%, was lower than it was outside the Grey Area.

The confusion matrix of the student model’s predictions when used as a binary classifier (correct vs. incorrect response), with a classification threshold set to 0.5, is shown in Table 6, where the positive class of “1” signifies correctness. The model’s precision (specificity) and recall (sensitivity) were 0.94 and 0.62, respectively. Hence, with respect to precision, 94% of the model’s predictions that a student’s response would be correct were in reality correct. With respect to recall, 64% of correct responses were predicted as such, although the model mistakenly predicted many responses as correct that were in fact incorrect (97%).

Table 6 Confusion matrix of the student model when used as a binary classifier

This high precision and low recall indicate that the model’s classifier is extremely “picky” when it comes to detecting incorrect answers. In other words, our model is able to predict correct answers with better accuracy than incorrect answers. Indeed, only 3% of incorrect responses were predicted as such. This skew towards predicting correct responses may indicate overfitting. This is a plausible explanation because the dataset was imbalanced: 63% of students’ responses (1543) were correct.

Another possible explanation for the student model’s poor performance in predicting incorrect responses is that the classification threshold (0.5; see Fig. 4) might not be appropriate for this dataset. This threshold was set using a training dataset. However, the Receiver Operating Characteristic (ROC) curve for the binary classifier shown in Fig. 6 indicates a higher optimal classification threshold for the pilot study dataset, 0.81; that is, 0.8 (rounded) is the point where the classifier can achieve the maximum precision (specificity) and recall (sensitivity). A possible explanation for this discrepancy in classification thresholds is that students who participated in the pilot study were more knowledgeable than students whose data comprise the training dataset. This would also account for the imbalanced pilot study dataset, which favors correct responses.

Fig. 6
figure 6

ROC curve of the binary classifier as used in the pilot study

Future Work to Improve the Student Model’s Performance

The accuracy of the student model’s predictions outside of the Grey Area (61%) indicates that the model’s performance is heading in the right direction but there is considerable room for improvement. We have two strands of modifications planned: changes that directly address how the student model is implemented and changes that are external to the student model but could impact its performance—for example, possible improvements in knowledge representation and in the tutor’s natural-language processing capability. This section provides an overview of planned work in each strand.

Planned Enhancements to Rimac’s Student Modelling Component

In Section 4.3, we noted that differences between the training dataset and the actual dataset may have led to inaccurate specification of the classification threshold used in the pilot study. The classification threshold also impacts the model’s Grey Area boundaries (i.e., its upper and lower limits). We plan to further examine the relationship between the classification threshold and student ability, as measured by pretest performance. If we find that students’ prior knowledge has as strong an impact on where the optimal classification threshold lies as we suspect, we will routinely use students’ pretest scores to customize the classification threshold and Grey Area boundaries for each student, thereby potentially increasing the student model’s ability to accommodate students’ needs.

We also plan to augment the student model with a “slip” parameter in order to handle situations where a student may provide a wrong answer despite their having the knowledge needed to answer correctly. This feature is fundamental to other student modelling approaches, such as Bayesian Knowledge Tracing, but it is not commonly implemented in logistic regression student models, such as the IFM that we use. Nonetheless, related research has shown that “slips” can be sufficiently modelled, and that doing so improved the predictive accuracy of the student model in problem-solving tutoring systems (MacLellan et al. 2015). We expect that a slip parameter will be especially important to have in place when Rimac has enough content to support longer interventions, with greater risk that students will forget material that they previously knew.

Additionally, there are different ways to represent the data used to train and update the model that could influence student model accuracy. In Rimac, the data used to train students’ initial student models consist of information about the question-response pairs students experience during their interactions with Rimac. The types of questions students answer vary as to how much support they provide the student. However, the biggest consistent differences in the amount of support provided are between the initial reflection questions, which are asked in the same way to all students, and the questions within the dialogues that follow up on the reflection question, which provide adaptive levels of support, as described in Section 3.2.5. We plan to test whether distinguishing between reflection questions and dialogue questions during model training and updating improves student model accuracy. Similarly, we plan to test whether including pretest questions in the training data could influence the accuracy of the student model. Some pretest questions are similar to reflection questions, so we will investigate whether accuracy is better when test questions are categorized separately or grouped with reflection questions during training and updating of student models.

We also plan to test whether filtering out turns with responses that are unrecognized by the system is better for prediction accuracy than leaving them in and treating them as incorrect responses. Currently the system classifies unrecognized student input as incorrect. The rationale for this policy is that it would be more harmful to skip a dialogue that could be potentially helpful to a student than to repeat information that the student already knows. However, this policy negatively affects the accuracy of the student model since it causes it to update using false data (i.e., unrecognized correct responses that are classified as incorrect). Filtering unrecognized input would be done both when a model is initially trained and when it is updated in real time. That is, if we filter out responses that the system could not recognize during training of the model, then at run time we will not update the student model when a response cannot be recognized. The refined training data will be used to fine-tune other parameters of the modelling algorithm, such as the learning rate—how fast the model adapts or “learns” from new data—and to improve the responsiveness of the model with respect to dynamic updates. Related ITS research has shown that using different modelling parameters—which consequently result in different learning curves for different subpopulations of the student population (i.e., fast learners and slow learners)—can provide more accurate metrics for student learning (Chang et al. 2006; Doroudi and Brunskill 2019) and, overall, more accurate predictions (Chounta and Carvalho 2019).

Finally, we plan to explore whether building separate student models for different phases of interaction improves prediction accuracy. For example, we could build one model to predict the correctness of answers to reflection questions and a separate one to predict the correctness of dialogue questions.

Planned Enhancements to System Features that Impact the Student Model’s Performance

Two aspects of a tutorial dialogue system that are external to the student model but nonetheless impact its performance include knowledge representation—how the knowledge components that the tutor tracks are structured—and natural-language recognition capability. We describe some of our plans to improve these capabilities within Rimac.

Knowledge Representation

The complexity of a student model—namely, the number of predictive factors relative to the number of observations in the training data—may lead to overfitting. Complexity is proportional to the number of KCs that the dialogues address; the more KCs, the greater the complexity. Since Rimac tracks a large number of KCs (~260), complexity likely contributed to overfitting.

One way to reduce student model complexity is therefore to reduce the number of KCs. Representing KCs as a knowledge hierarchy instead of as a flat list would support conceptual grouping of KCs. For example, all KCs that address Newton’s Second Law could be represented by one abstract “super KC”. We can reduce the number of KCs included as predictor variables in the model if we infer the higher-level “super KCs” from their child KCs instead of including them in the model as separate predictor variables. Alternatively, the system could maintain separate models for different groupings of KCs, as noted previously with respect to reflection questions and dialogue questions. Other possible groupings could be by topic (mechanics, electricity, thermodynamics, etc.) or by type of knowledge (procedural, conceptual, metacognitive, etc.).

Natural-Language Recognition

Although the accuracy of the NLU approach for short dialogue responses that is used in Rimac is high (i.e., Rosé et al. 2001 measured an accuracy of 96.3%), this approach depends on creating both semantic and syntactic representations of typical student responses based on the current dialogue context. However, sometimes students respond in atypical or unexpected ways that often indicate their affective state—for example, profanity that stems from frustration (Zapata-Rivera et al. 2018). We expect that if a student expresses himself in an atypical way on occasion, it would not be so noticeable or negatively consequential. But if he frequently expresses himself in an atypical way, then NLU will not recognize these responses, which could result in a less positive experience for that student (Dzikovska et al. 2010). The consequence for a student when he answers correctly but is not understood is that the TDS lengthens the dialogue and covers material the student may already know.

We plan to test improvements to our approach to dealing with unrecognized responses. Rimac already incorporates an algorithm that uses restatement with or without verification from the student when the best recognition score for a student response is below threshold but still well above zero (Jordan et al. 2012). Currently a match between a student response and the best candidate concept is categorized as high, medium, low or unanticipated. If the highest score for a student input matches a candidate concept at a score of 0.8 or greater, the match quality is set to high. If the input-to-concept matches with a score between 0.8 and 0.7 the match quality is set to medium. If the input-to-concept matches with a score between 0.7 and 0.6 the quality is set to low. A score below 0.6 between the input and the best candidate concept means it was not understood. When the match quality for a student’s input is high, that input is treated as understood. When the match quality is medium the system says, “I understood you to say <phrase that represents the concept matched>” (i.e., the system revoices what it “heard”). When the match quality is low the system asks, “Are you saying <phrase that represents the concept matched>?” If the student says yes, then it goes with that match. If the student says no with no attempt to restate, then the system marks it as unanticipated. If the student says no and answers again then the system attempts to understand this new response instead. We plan to test the effect of lowering the values that define high, medium and low after adding a nonsense detector and an off-topic detector. During field trials, we will check students’ perception of the increased use of this strategy (e.g., Is it confusing when it happens? Is it happening too frequently?).

We also plan to test the performance of deep learning approaches for dialogue as a possible means of improving recognition. Although deep learning models for dialogue are showing promise (e.g., Gunasekara et al. 2019), to our knowledge, none so far have been trained using computer-human interaction data; they have been trained only using human-human interaction data. Thus, the effect of the noisiness of our computer-human training data (i.e., misrecognitions of student responses) on performance is unknown.

Related Work

The sample of tutorial dialogue systems described in this section illustrates various approaches to providing adaptive instruction. We chose six frequently cited TDSs that influenced subsequent TDS development, especially of approaches that attempt to track information about a student for more than a single turn in order to adapt the dialogue. [Over 30 dialogue-based ITSs have been developed (Paladines and Ramírez 2020).] As VanLehn (1988) noted, it is important to bear in mind that any intelligent tutoring system’s student model can best be thought of as representing the tutor’s perception of the student’s cognitive state, not the student’s actual cognitive state. We focus on modelling student cognition instead of affect because the former has been our focus in developing Rimac. To highlight Rimac’s contributions, we point out differences relative to Rimac’s approach to adaptivity during dialogue. Except as relevant, we do not describe the various natural-language understanding and generation techniques nor the full pedagogical approaches that TDSs have used. Additional information about these and other topics related to student modelling in conversational tutors can be found in several review articles—for example, Pavlik et al. (2013), Brawner and Graesser (2014), Bimba et al. (2017) and Alkhatlan and Kalita (2018). (See also papers on the tutorial dialogue systems cited and discussed in this section.)

To inform dialogue adaptations, some TDSs include a dedicated student model as part of their system architecture, while others rely on localized diagnosis and classification of the representations they derive during language understanding to model what the student currently knows (i.e., they don’t track changes in student knowledge). The particular approach(es) to student modelling and adaptivity that TDSs implement depend on various factors including the developers’ research goals and constraints, underlying theories of learning and instruction, and research on human tutoring. Most of the TDSs discussed in this section focus more on understanding the student’s recent natural-language contributions in order to drive adaptivity, whereas Rimac has not yet done so because it encourages short-answer responses to allow for better input recognition. However, encouraging short answers is not feasible in some learning contexts and a mixture of response types is ultimately preferred.

CIRCSIM-Tutor (Evens and Michael 2006) is the first fully developed tutorial dialogue system, and the first to implement a student modelling module that tracks student performance during the dialogue in order to make micro-adaptive decisions. We focus on CIRCSIM-Tutor version 3 (Zhou and Evens 1999) because it best illustrates how the multi-level structure of its student model would support dynamic planning of adaptive dialogue. Unfortunately, this version of the student model was not included in any CIRCSIM-Tutor evaluations so its potential to support student learning remains untested.

CIRCSIM-Tutor adopts an “overlay and buggy modelling” approach. Students enter their predictions about the relationship between several parameters during three domain phases in a Prediction Table. The tutor compares students’ predictions with an expert’s predictions in order to identify correct and incorrect (buggy) predictions. Similarly, during dialogue, the tutor compares students’ responses with expert responses in order to diagnose and classify each response according to its degree of correctness (e.g., correct, near hit, near miss). Each tutorial dialogue remediates one incorrect (buggy) prediction in the Predictions Table by dynamically building a hierarchical dialogue plan.

The student modelling component constructs a four-tiered performance model for each student: a “local” assessment for domain concepts, an assessment for the three system phases that the student makes predictions about in each problem, an assessment for each problem, and a global assessment across problems. Scores at each level are propagated upwards to compute scores at higher levels. Each level of this student model can provide different types of information to planning rules that implement macro- and micro-adaptive decisions. For example, information at the local (concept) level can contribute to decisions about whether to ask a follow-up question about a particular concept. Information at the phase level can help select tutoring method(s) to achieve particular tutoring goals—for example, determine what type of hint to provide. Information at the problem level can be used to design a lesson plan, while information at the global assessment level can support the tutor in choosing the next problem. However, CIRCSIM-Tutor’s implementation, like Rimac’s, focuses on micro-adaptive decision rules. Rimac also uses overall performance on concepts (KCs) to guide dynamic decision making—in particular, to decide at each dialogue step which node in a KCD’s finite state network to traverse to. However, in order to provide the high degree of experimental control necessary to address our research questions, Rimac does not currently dynamically alter its dialogue scripts.

Unlike CIRCSIM-Tutor’s student model, Rimac’s base model is learned from prior student dialogues and updates are weighted based on prior student data rather than by using the same weightings for every concept. Concept unique weightings have the potential to adjust for concepts of varying degrees of difficulty but evaluating the benefits of doing so remains for future work.

EER-Tutor

(Weerasinghe et al. 2009) consults a hierarchy of errors that students make in the Entity-Relationship domain in order to diagnose students’ solutions to database design tasks. EER-Tutor provides adaptive feedback in the form of scripted dialogues that are linked to each error category. It builds two constraint-based student models: a short-term model of satisfied and violated constraints in the current solution and a long-term student model that records constraint histories. These models represent students’ problem-solving performance, not their performance during dialogue. Since the system does not use dialogue-based evidence of students’ domain knowledge and reasoning skills to update the student models (e.g., the types of errors made and the level of prompting the student needed in order to correct these errors), the tutor cannot use this information to guide adaptive decision making.

A small set of “adaptation rules” customize the scripted dialogues that address errors (constraint violations) in the student’s solution. These rules represent all three aspects of scaffolding contingency, although the system developers do not claim to have specified them with this intention. For example, one rule targets temporal contingency; it tells the tutor how long to wait before intervening when a student seems to be idling. Another rule targets domain contingency by controlling how to choose an error to address when there is more than one error in a student’s submitted solution. EER-Tutor makes this decision both reactively, based on its error diagnosis, and proactively. It chooses the error type that the student has the highest probability of making in future solution attempts. Most of the remaining rules support instructional contingency. For example, one rule determines whether to issue a problem-independent question before asking a contextualized, problem-specific question.

Although EER-Tutor’s adaptation rules are domain-neutral, the dialogues that instantiate these rules are scripted. Similarly, the authoring guidelines that Rimac’s dialogue developers consult to alter the level of support provided by questions, hints, etc. apply across quantitative problem-solving domains. However, Rimac’s rules differ from EER-Tutor’s rules in two main ways: they are more extensive and are grounded in scaffolding theory. The EER-Tutor, but not the adaptation rules per se, is based on constraint-based learning theory. A study that compared EER-Tutor with adaptive dialogues and EER-Tutor with non-adaptive dialogues indicated that a limited set of rules to guide adaptive tutoring is better than none (Weerasinghe et al. 2010).

Weerasinghe et al. (2009) discuss the potential benefits of updating the student models during dialogue in future versions of EER-Tutor. For example, recording which types of errors a student makes and how much prompting the student needed in order to correct an error could be used to assess the error’s associated constraint(s) and to determine, more broadly, if the student’s reasoning skills have improved. Rimac’s dynamic updating of the student model during its dialogues is one of the critical features that supports micro-adaptive tutoring. It also supports macro-adaptive tutoring. Dynamic assessment of knowledge components informs Rimac’s decisions about which reflective dialogues a student could profitably skip.

The Geometry Explanation Tutor (Aleven et al. 2001) also searches a hierarchical ontology that contains complete and buggy explanations against which it classifies a student’s contributions. The classification serves as a proxy model of the student’s ability to produce complete and accurate explanations. The tutor responds adaptively by issuing a scripted feedback message that is associated with its corresponding response category.

As the developers acknowledge, the tutor does not customize its feedback because it lacks the data and functionality necessary to do so, such as a dialogue history and dynamic planning. In terms of scaffolding theory, it implements domain contingency by providing feedback that addresses students’ errors but does not implement instructional contingency as do most TDSs, including Rimac. Nonetheless, a classroom study indicated that domain contingency alone can be beneficial. Students who used the dialogue version of the Geometry Explanation Tutor produced higher quality explanations than students who used the menu-driven version, although neither group outperformed the other with respect to problem-solving performance (Aleven et al. 2004).

Like all cognitive tutors, the Geometry Explanation Tutor and its predecessor (the PACT Geometry Tutor; Aleven et al. 1999) choose appropriate problems for students to work on. However, since information about students’ explanation performance during dialogue is not used to update the system’s student model, macro-adaptation is limited to selecting problems that develop students’ geometry skills, not their explanation skills.

Beetle-2

(Dzikovska et al. 2014) lacks a model of students’ task performance to refer to, unlike the above TDSs but similar to Rimac. However, like EER-Tutor, it responds to errors that students make during problem-solving tasks. It dynamically analyzes the dialogue state across multiple turns by comparing the student’s explanations, and other input, with benchmark responses. This analysis produces a “diagnosis structure” that represents correctly mentioned objects and relations in students’ responses and missing, irrelevant, and contradictory parts. This “diagnosis structure” serves as a model of students’ understanding of domain concepts and relations. Beetle-2’s tutorial planner then uses this model to select a generic tutoring tactic and works with its natural-language dialogue generator to instantiate this tactic in a contextualized manner—for example, by mentioning objects from the student’s problem-solving work. Instructional contingency is achieved by dynamically choosing more directive tactics, as necessary, to address the selected error in the diagnosis structure during a remedial dialogue.

As Beetle-2’s developers acknowledge, one limitation is its lack of a persistent student model that could guide adaptive task selection. The order of problems in the curriculum and the order of questions within each dialogue are pre-specified. Although macro-adaptivity in Rimac is currently limited to deciding whether to skip a reflective dialogue, its persistent student model could be used to personalize problem and dialogue selection in a future version of the tutor. Also, Rimac adapts its tactics both reactively and proactively according to a student’s cumulative performance, which also requires a persistent student model.

An evaluation of Beetle-2 that compared it with a no-training control suggested that its curriculum is effective. However, no significant differences were found when comparing a dialogue version to one that simply gave the correct answer when the student was not correct (Dzikovska et al. 2014). The Guru (Olney et al. 2012) and iStart (e.g., Allen et al. 2015; McCarthy et al. 2020; McNamara et al. 2007) tutoring systems are similar to Beetle-2 in that they focus on adapting student-tutor interactions according to a precise understanding of the student’s recent contributions, not on building a persistent student model as in Rimac.

AutoTutor

(e.g., Graesser 2016; Graesser et al. 2017a) supports Expectation and Misconception-Tailored (EMT) dialogues and aims to model novice tutors (e.g., a peer tutor). As with Beetle-2 and Rimac, there is no student model of problem solving to consider. In this case it is because there is no separate problem solving that precedes a dialogue. A separate EMT frame is pre-built for each problem that the system covers (a labor-intensive process). Its slots include anticipated expectations and their components and anticipated misconceptions. Misconceptions are addressed didactically through scripted explanations whereas discussions about an expectation in an EMT frame can take place across one or several turns until each expectation is adequately covered. Students’ turns are analyzed using a speech act classifier (Olney et al. 2003) and latent semantic analysis (LSA) by comparing the student’s contributions to the expectations and misconceptions encoded in the EMT frame.

AutoTutor achieves domain contingency by consulting the EMT frame to choose an expectation to discuss next that is incomplete or missing in the student’s response. Graesser et al. (2004) state that a “processing module” manages this decision when there is more than one unfulfilled expectation (or expectation component) but they do not specify this module’s decision rules or criteria. The tutor’s default protocol is to elicit an expectation or one of its components by issuing a series of increasingly directive tutoring tactics, stopping when the expectation has been satisfied: first pump (e.g., “tell me more”), then hint, then prompt; if all else fails, assert (state the expectation). Instructional contingency can take place by varying the entry point into this default sequence. For example, an early version of AutoTutor (Graesser et al. 2003) used fuzzy production rules that considered factors such as dialogue history and student ability (e.g., based on pretest performance) to decide where to start. With a high ability student, the tutor might skip the pump and start with a hint. In contrast, Rimac updates information on student ability throughout the dialogue and this information persists across discussions with the student. Rimac achieves domain and instructional contingency by consulting its student model to decide which node to traverse to next and how to address the selected node’s associated content (i.e., how much support to provide), respectively. For example, a hint will provide more or less support depending on the student’s ability.

Students using AutoTutor have shown greater learning gains than students using a variety of simpler alternatives to a TDS (e.g. reading a textbook) and have shown similar learning gains to those tutored by human experts, when domain coverage is similar (VanLehn et al. 2007). However, there have been no comparisons that directly test the effectiveness of its form of adaptivity to that implemented in other TDSs.

Graesser et al. (2017a) describes several possible improvements. One is that assessment information could be attached to expectations and misconceptions in order to assess the student’s performance on particular expectations (and misconceptions) or on the problem as a whole. For example, an expectation could be scored based on the amount of support that the student needed in order to meet that expectation—that is, how far along the assistance protocol (pump, hint, etc.) did scaffolding get? In addition, by mapping expectations to theoretically grounded knowledge components, students’ performance on these KCs could be tracked across problems, thereby producing a persistent student model.

DeepTutor

(Rus et al. 2013a; Rus et al. 2015) explores macro-adaptivity in tutorial dialogue systems and takes a similar approach to AutoTutor for providing micro-adaptive feedback. DeepTutor uses a framework called a Learning Progression Matrix to model students’ level of proficiency in each course topic and across a sequence of increasingly difficult topics that the course covers. A learning progression matrix is a hierarchically structured construct. A course consists of topics, which are addressed through a series of lessons. Each lesson includes a series of tasks (problems, dialogues, and other activities). A task is accomplished through a series of solution steps and each step can be facilitated through a series of tutoring tactics (hints, pumps, prompts, etc.). Adaptivity can be applied by implementing alternative instructional strategies at each level. To date DeepTutor has focused on implementing macro-adaptivity at the course level and micro-adaptivity within each dialogue task.

Learning progressions model students’ journey towards mastery in a particular domain: “learners go through a pathway of knowledge states, where each state has distinct patterns of understanding and misconceptions” (Rus et al. 2013b, p. 4). The goal of macro-adaptive tutoring is to design a customized learning trajectory for a particular student—a set of tasks that will advance the student to higher states within a particular course topic and across the topics included in the course’s curriculum. An initial learning trajectory can be defined based on a student’s pretest score and then dynamically adjusted based on a student’s performance on problems, short multiple-choice tests, and dialogue contributions.

DeepTutor selects dialogue tasks to accommodate the student’s current state for a topic, as recorded in the student’s learning progression matrix. Consequently, DeepTutor’s dialogues focus on the limited set of expectations and misconceptions that are associated with the student’s current state, not on the full suite of expectations that an AutoTutor dialogue typically addresses. This departure from AutoTutor’s approach has the potential to support more targeted, customized learning conversations. Micro-adaptive scaffolding within each dialogue takes place by choosing which expectation, expectation component, or misconception to address in the next tutor step (i.e., domain contingency) and the “best” tactic to use to address this step (i.e., instructional contingency). The first decision is made based on the tutor’s evaluation of the student’s responses during dialogue, while the second decision is guided by a complex set of tactical decision rules, such as the “complex mechanism” that controls hinting (Rus et al. 2014a, p. 316).

DeepTutor potentially offers more support for macro-level adaptivity than Rimac currently does in that it encodes a curriculum and tracks progress through the curriculum (reminiscent of CIRCSIM-Tutor’s levels). Macro-adaptivity in Rimac is currently limited to allowing a student to skip dialogues that address content that the student model indicates the student has sufficiently mastered. However, it remains for future work to determine how learning progressions could also support the type of proactive and reactive adaptations that Rimac implements. For example, would it be necessary to recognize that a student may be between the defined cell levels of the learning progression matrix?

A small-scale evaluation of an early version of DeepTutor (N = 30) that compared students who used a micro-adaptive version with students who used a macro- and micro-adaptive version found significantly higher pretest to posttest learning gains in the fully adaptive condition (Rus et al. 2014b). In this version, students’ proficiency levels were estimated based on overall pretest scores and no updates were made to proficiency levels during dialogue. In contrast, Rimac dynamically updates its student model during each dialogue.

Summary of Contributions

Few TDSs besides Rimac maintain a persistent student model—one that is dynamically maintained across tutoring tasks and sessions. A persistent student model is necessary to support the outer, macro-adaptive loop of VanLehn’s two-loop framework (VanLehn 2006). Otherwise, system developers must specify the order of tasks (problems or dialogues). In addition, most TDSs make micro-adaptive decisions reactively, in response to students’ performance. However, few TDSs besides Rimac also carry out proactive decision-making in order to challenge the student without overwhelming him, gently nudging the student towards mastery. Towards that end, Rimac performs micro-adaptive tutoring by predicting students’ success in answering a set of candidate “next step” questions, chooses one (domain contingency), and then determines how much help to provide, through which tutoring strategies, etc. (instructional contingency). Hence, Rimac stands apart from other TDSs by dynamically building and maintaining a persistent student model that supports reactive and proactive decision making, in order to emulate the contingent scaffolding of human tutoring (Katz et al. 2018). The classroom studies described previously (Section 4) indicate that this combined approach to decision making supports more efficient tutoring than does reactive decision making alone. They also indicate that IFM effectively implements this approach.

Towards Integrating Tutorial Dialogue Systems in the Classroom

Despite strong evidence that adaptive tutorial dialogue systems support learning (e.g., Albacete et al. 2017a; Albacete et al. 2019; Dzikovska et al. 2014; Evens and Michael 2006; Forbes-Riley and Litman 2011; Graesser 2016; Graesser et al. 2017b; Weerasinghe et al. 2009; Weerasinghe et al. 2011), they have not yet become an integral part of classroom instruction. Addressing this disconnect between research and practice is especially important for blended learning classrooms, which combine online learning with traditional methods of instruction (lectures, textbook readings, problem-solving assignments, etc.). Many challenges remain to be met to achieve sustained use of TDSs including improved language understanding and student modelling performance, and identification of effective ways to sustain student engagement with TDSs and ITSs in general (e.g., Graesser et al. 2014; Jackson et al. 2012; Kopp et al. 2012). This section focuses on one limitation of TDSs that prevents their large-scale adoption: their lack of a companion open learner model (OLM). OLMs display students’ progress and have the potential to maintain student model accuracy and student engagement (Bull 2020; Bull and Kay 2016). Our choice of this focal point is motivated by Jim’s interest in using OLMs to perform similar roles in ITSs (e.g., Zapata-Rivera and Greer 2004a, 2004b). We show that dialogue data adds some wrinkles to the fabric of OLM design and discuss why incorporating some types of OLMs in TDSs may not be feasible long term, such as during an entire course.

Using OLMs to Display Student Progress

Considering several ITSs and adaptive hypermedia systems, Bull (2020) remarked that “there are open learner models [OLMs] about!”Footnote 3 As mentioned, an open learner model allows students, teachers, and other education stakeholders (e.g., parents, peers, school administrators) to inspect the contents of a tutor’s student model in order to track students’ progress. An OLM also allows users to interact with and perhaps edit the student model in order to improve its accuracy. However, when it comes to TDSs, there are no “open learner models about”. To our knowledge, all TDSs lack a companion OLM, even though the reverse holds true: several OLMs include natural-language interaction with a chatbot—for example, STyLE-OLM (Dimitrova 2003; Dimitrova and Brna 2016), NLDtutor (Suleman et al. 2016) and CALMsystem (Kerly and Bull 2008; Kerly et al. 2008; Kerly et al. 2007). This section focuses on the use of OLMs to track student progress through inspectable OLMs. Inspectable OLMs allow users to examine the student model and receive explanations for its assessment values but not change them. The next section addresses student model maintenance, automatically or manually through OLMs that give users varying degrees of control over the model’s content.

Several studies have found benefits from allowing students to inspect and/or interact with their student model through an OLM. Germane to this festschrift, Jim Greer’s work with Diego Zapata-Rivera showed that providing various forms of guidance to students as they interact with an OLM can increase the frequency of students’ reflection on their work (e.g., Zapata-Rivera and Greer 2002, 2003; Zapata-Rivera and Greer 2004a). Reflection is an important metacognitive activity that can promote learning (e.g., Long and Aleven 2017). Other observed benefits of student interaction with OLMs include learning gains (Brusilovsky et al. 2015; Chen and Chan 2008; Duan et al. 2010; Girard 2011; Hsiao and Brusilovsky 2017; Long and Aleven 2017; Shahrour and Bull 2009), improved self-assessment accuracy among students who do not already demonstrate this skill (Kerly et al. 2008; Long and Aleven 2017; Suleman et al. 2016), confidence gains (Al-Shanfari et al. 2017), and heightened motivation to use the tutoring system (e.g., Bull and Pain 1995; Long and Aleven 2017; Thomson and Mitrovic 2010).

OLMs have also proven to be useful for teachers. In particular, “teacher dashboards” combine the behavioral data of learning analytics with the system-inferred assessment data of OLMs. They support teachers in planning adaptive instruction for individual students, small groups, or an entire class (e.g., Bull and McKay 2004; Girard and Johnson 2008; Grigoriadou et al. 2010; Pérez-Marín and Pascual-Nieto 2010; Riofrío-Luzcando et al. 2019; Xhakaj et al. 2017a, 2017b; Yacef 2005; Zapata-Rivera and Greer 2001; Zapata-Rivera et al. 2007). Teacher dashboards have also been developed to facilitate real-time classroom orchestration, a term used to describe how a teacher monitors and guides her classroom (e.g., Holstein et al. 2018a; Holstein et al. 2018b; Holstein et al. 2018c; Holstein et al. 2019). Recent studies indicate that teachers’ use of learning analytics dashboards to support classroom orchestration can promote learning gains (Holstein et al. 2017b; Holstein et al. 2018c). [See Bull 2020 for a thorough review of key findings on the benefits and limitations of OLMs for teachers, students, and other users.]

Several questions need to be addressed in order to develop OLMs for TDSs that could result in similar benefits. As with the design of any score report, the first critical step is to conduct an audience analysis, in order to find out what questions various users would bring to an OLM or teacher dashboard for a TDS (Zapata-Rivera and Katz 2014). We focus on teachers and expect considerable overlap between teachers’ questions and students’ questions (Zapata-Rivera 2019). Teachers will likely want to be able to inspect visualizations of a student’s performance on particular concepts, such as skill meters and bar graphs, as a student would when inspecting their personalized OLM. However, teachers will likely also want to view aggregate assessments for their class as a whole. For example, in order to plan a lesson for an upcoming class, a teacher might want to see average scores on selected knowledge components, a summary list of KCs with the lowest average scores, common misconceptions, etc. (Aleven et al. 2016b; Xhakaj et al. 2016). Teachers will also likely want to inspect behavioral data, such as: How many problems and/or dialogues did a student complete for a particular assignment? What was the average completion rate for a given class? How many reflection questions or questions asked during a particular dialogue did a student leave blank or respond to with gibberish, possibly indicating disengagement?

Other, less predictable questions that teachers might bring to a dashboard for a TDS could be identified through user-centered design sessions (e.g., Aleven et al. 2016b; Epp et al. 2019; Holstein et al. 2018a; Holstein et al. 2017a, 2019; Holstein et al. 2010; Xhakaj et al. 2016). For example, to what extent do teachers want to “drill down” in order to receive explanations about the reasoning behind the student model’s competency assessments? These explanations would present the evidence that the student model used to infer that a student has a good or poor understanding of a particular concept and how much weight each piece of evidence contributed to this inference. Explanations such as these are available in tutoring systems that incorporate an inspectable or interactive OLM, and in independent OLMs (i.e., OLMs that are not associated with a particular tutoring system) (e.g., Bull 2016; Bull and Kay 2016; Bull and Pain 1995; Dimitrova 2003; Dimitrova and Brna 2016; Ginon et al. 2016; Kerly et al. 2008; Suleman et al. 2016; Tchetagni et al. 2007).

Providing similar explanations in TDSs raises interesting design challenges. The main evidence that a TDS uses to infer students’ competency on knowledge components is dialogue data. Displaying tutor-student exchanges during dialogue (e.g., a tutor’s question followed by the student’s response) is considerably more complex than, say, showing a list of quiz scores or test items that a student missed. For example, while the tutor explains a low score on a concept, how many incorrectly answered questions that target that concept can the OLM display without causing clutter and information overload? How can a dialogue exchange be sufficiently contextualized so that it makes sense? Would it suffice to show one exchange back? Or should contextualization of dialogue-based evidence be personalized—for example, by allowing teachers (and other users) to ask to “see more” or “see less”, as necessary? Some OLMs visualize student progress over time (e.g., Ferreira et al. 2017; Rueda et al. 2007; Rueda et al. 2003). Accommodating teachers’ potential desire to track student progress in understanding key concepts across a series of dialogues will likely raise even more challenging design issues.

As with any ITS, the student model in a TDS will have more confidence in some of its assessments than others. In addition to presenting dialogue-based evidence and explaining how competency ratings were calculated, an OLM for a TDS could represent model (un)certainty by adapting its display in various ways (Epp and Bull 2015). However, it is important to test carefully to ensure that teachers can interpret uncertainty representations correctly. For example, Zapata-Rivera et al. (2016) found that teachers tend to have difficulty interpreting error bars in score reports, although a brief video tutorial sufficed to resolve this problem.

Using OLMs to Maintain Student Model Accuracy

Students’ cognitive state is a moving target, constantly changing in response to their interactions with a tutoring system and various external learning resources—for example, other online materials, teachers, peers, family, friends, and textbooks. As with any ITS, a TDS’s student model will quickly become outdated without some means in place to update it. Student model maintenance can take place automatically and/or manually through user interaction with an OLM (Bull 2016, 2020). For example, the Next-TELL OLM automatically integrates assessment data that is transferred from multiple sources through its application program interface (API) (e.g., Bull et al. 2013; Bull and Wasson 2016). If a teacher assigns weights to various data resources, as in EI-OSM (Zapata-Rivera et al. 2007), this can help students understand discrepancies between the OLM’s representations and their expectations. However, a drawback of integrating multiple data sources in a student model (and OLM) is that this can increase student model inaccuracy, due to varying formats, degrees of reliability, specificity, etc. As Bull (2016) noted, “With the variety of learning data now available, we have to accept that there is greater room for imprecision in the learner model” (p. 19).

A student model can become outdated and inaccurate for several other reasons, including the system’s inability to automatically import data from online and standalone learning systems; learning that takes place offline, through instructional activities such as reading a textbook and interacting with teachers, peers, etc.; “noise” from student guesses and memory lapses (slips). In the context of TDSs, inaccuracy also stems from the system’s inability to understand some of the student’s dialogue input. Consequently, there is a strong need for teachers and students to be able to interact with a (future) TDS’s OLM in order to manually maintain its student model.

As described in the SMILI☺ framework (Bull 2020; Bull and Kay 2016), the various types of OLMs referred to in the literature differ mainly according to how much control they allow students to exert over the student model’s content. At one end of the spectrum stands inspectable only OLMs, which give the system full control over model updating; students can only examine representations of the model’s competency ratings and perhaps explanations about the reasoning that underlies these ratings. At the other end of the spectrum stands editable OLMs, which allow students to directly change the model until new evidence that may override the student’s self-assessment comes along. Editable OLMs risk introducing more inaccuracy due to students’ erroneous ratings, while inspectable-only OLMs prevent obvious errors in the model from being corrected—for example, a momentary slip or, in the context of a (future) OLM for a TDS, unrecognized student dialogue input.

For these and other reasons, various alternative approaches to interactive student model maintenance stand between these two goal posts (inspectable-only and editable OLMs) (Bull 2020), for example: OLMs that allow students to add information to their model but not replace its content (learner adds information OLMs); OLMs that allow users to challenge the model and prove that they know more (or less) than the system thinks they do, typically by answering a challenge question (persuadable OLMs); and OLMs that encourage negotiation, allowing the student and the system to justify their position until they either reach agreement or must defer to a policy, typically set by teachers, about what to do if agreement can’t be reached (negotiable OLMs). In VisMod, Zapata-Rivera and Greer (2004b) gave teachers similar status as the system. Teachers could veto students’ proposed changes to the student model (Bull 2020).

Several exploratory studies suggest that interactive model maintenance through an OLM does indeed promote improved consistency and accuracy, when the system maintains some degree of control (e.g., Bull and Pain 1995; Dimitrova 2003; Dimitrova et al. 2001; Suleman et al. 2016). Other potential benefits include increased student agency over learning (Bull and Kay 2007, 2016), metacognitive behaviors such as planning and self-monitoring (Bull and Kay 2013), learning gains (Kerly and Bull 2008), and increased motivation when students perceive their interaction with an OLM as a break from the system’s main task (Thomson and Mitrovic 2010). However, as Bull (2020) stated, an important goal for future research is to determine which types of OLMs are best suited for different types of students, at different stages of learning, and with different types of tasks (p. 19). Research to address this issue indicates that we can’t always trust our intuitions about which mappings are best—for example, that mature students will be able to edit their student model appropriately. Britland (2010) found that university students were unable to detect and correct errors that were planted in an editable OLM.

The importance of considering the nature of a tutoring system’s tasks and activities is especially relevant to OLM design for tutorial dialogue systems. The last thing that a student who has just completed a lengthy automated dialogue probably wants to do is convince the tutor to alter their student model, especially if this would entail a detailed, Toulmin style negotiation activity (e.g., Van Labeke et al. 2007; Zapata-Rivera et al. 2007) or yet another dialogue (e.g., Kerly et al. 2008; Suleman et al. 2016). Consequently, another challenge for future research in OLM design for TDSs is to discover how to reap the benefits of interactive model maintenance—in particular, negotiable OLMs—without compromising student motivation.

Another challenge that has received little attention to date is to find ways to reduce the teacher workload required for student model and OLM development. This challenge pertains to all types of OLMs (independent, inspectable-only, or interactive), and to automated and manual student model maintenance. Teachers who have participated in OLM research and development have been “assigned” a multitude of tasks such as defining conditional probabilities (Zapata-Rivera and Greer 2004b); specifying weights for evidence based on credibility, relevance, and other factors (Zapata-Rivera et al. 2007); specifying reasons why a value should be changed so that students can choose these reasons from a menu, and setting thresholds to determine when evidence is sufficient to change assessment values (Ginon et al. 2016); engaging in negotiation discussions with individual students about their student model (Zapata-Rivera and Greer 2004b); defining performance expectations at various points during a course so that students can determine if they are on track (Bull and Mabbott 2006); including supplemental feedback (Bull et al. 2015); associating tasks and test items with knowledge components, etc. It may be feasible for teachers to perform these roles short term, as teachers have done during brief usability studies such as those cited in the preceding paragraph. However, we are skeptical that teachers could sustain these roles throughout a course—not only in TDSs, but in ITSs in general. For example, Zapata-Rivera et al. (2007) reported that teachers found EI-OSM useful, but teachers also cautioned that they have limited time to calibrate the system’s evidence parameters. As the authors eloquently stated, “Teachers wanted control over a system that can perform most tasks autonomously” (p. 298).

Feasibility of teacher involvement with student model and OLM development will likely depend on several contextual factors such as class size and course load. It is hard to envision teachers in large schools, with large classes, interacting with individual students about their OLM, ok’ing (or vetoing) students’ requested changes to the system’s assessment values, mapping test items to KC’s for every quiz and assignment included in a learner management system so that this data can be integrated in a tutor’s student model and OLM, etc. Hence, important tasks to add to the research agenda include specifying factors that constrain the feasibility of sustained use of ITSs and finding ways to automate or semi-automate as many teacher roles as possible. For example, perhaps a tutoring system could use large datasets to automatically learn or recalibrate evidence weightings.

Last but certainly not least, an important goal for future research is to determine if integrating OLMs in TDSs results in similar benefits to those found by combining these technologies in other types of adaptive tutoring systems such as learning gains, improved self-assessment ability, and increased motivation to use the tutor. Table 7 summarizes the research and design questions raised in this section, to set a roadmap for future work.

Table 7 An initial research agenda to inform the development of OLMs for TDSs

Conclusion

Long before computers that could help people learn put a glint in educators’ eyes, a wise man observed that “Necessity knows no law except to conquer.”Footnote 4 Whether or not Jim was familiar with this maxim, he argued for it, in effect, when he urged tutoring system developers to cast aside commonplace rules like “favor simplicity over complexity” and to instead take on the complex, difficult task of student modelling in order to meet a greater necessity: making these systems adaptive to the learner. Many effective problem-solving ITSs that incorporate a student model support his position.

In this paper, we argued that it is time for more developers of tutorial dialogue systems to follow suit and incorporate a student modelling component that can guide the tutor in providing adaptive support during complex conceptual discussions. We described how we do this in Rimac and summarized studies that found that: (1) an experimental version of Rimac that links a “poor man’s student model” (i.e., a static, pretest-initialized student model) with automated dialogue promotes more efficient learning than a control version that lacks a student model, among high prior knowledge students, and (2) a dynamically updated student model promotes more efficient learning than the static, “poor man’s student model”, among both high and low prior knowledge students. Although we have not yet observed an advantage of student model driven dialogue with respect to learning gains, we expect that this will follow from improving the performance of the student model, including more content to support long-term use of the tutor, and incorporating OLMs that provide tools to allow students and teachers to interact with and maintain the student model. Future generations of students who learn more effectively and efficiently from student model endowed dialogue tutors will unwittingly have Jim to thank.