Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction and Background

Tutoring on a one-on-one basis from an expert human tutor who is also a subject matter expert represents the ideal arrangement for learning – improving student outcomes on between one and two standard deviations [8]. However, this is not feasible for the vast majority of instruction. One way to attempt attaining similar performance is through the use of Intelligent Tutoring Systems (ITS) – computer systems which can take expert-created content and tutor it with built-in instructional expertise. Systems such as the Generalized Intelligent Framework for Tutoring (GIFT) allow for the creation and configuration of this type of tutoring systems, marrying content from the expert and instruction from a configured system [7].

Human tutors, as opposed to computer tutors, are not statically defined and unchanging – they learn over time. They are able to select key content which focusing on desired learning objectives, and to improve their selections over time after observations of effectiveness [1] – they do what they observe to work. ITS systems should mimic this functionality, by tracking which content sequences teach effectively, and improving content selection and ordering over time. Further, this feedback should be presented in after action review – immediate feedback upon student actions after the student takes them.

The instructional literature indicates that after action review feedback should be focused upon a relatively finite set of immediate learning goals. A brief review of the literature indicates that after-scenario feedback should be:

  • Focused - feedback about and at the level of the task step

  • Corrective - concerning how to perform specific tasks and steps better (not just feedback regarding accuracy)

  • Limited - identify the performance failures with greatest impact and concentrate feedback on them

  • Mastery-Focused – feedback should represent deep knowledge or concepts required to master the domain

This short list of feedback items is tied both to Ericsson’s theory of deliberate practice [3] and to Shute’s formative feedback guide [6]. The overall literature presents a picture of beneficial feedback which is short-but-focused and immediately follows an event; at least for the novice learners who comprise the bulk of students. In the design of adaptive intelligent tutoring systems, the above features should be taken into account, but the question becomes how to do so automatically.

One of the primary problems with selecting and improving the content over time is that the data is typically sparse; no two learners are alike, and class sizes are typically less than 100 people. Learners can be clustered into categories, but experience indicates that there are typically less than 5 categories for a desired metric [5], with more metrics resulting a large number of categories and sparser data. Further, the content can, and usually should, change after each class, creating a “moving target” problem for models. A typical solution of reinforcement learning, where models are learned once and not updated, isn’t an appropriate representation in an arena of both changing content and student populations, which represents a “fluid fitness landscape” in machine learning literature.

An approach to solve this problem involves the creation of artificial students, based on the observed population. These artificial students can have a bell curve distribution of deviations, associated with predictions of how they would experience the content. While the simulated student population does not represent the total population, the approximation is sufficient to create data. This data is then sufficient to create instructional policies. The implementation of the instructional policies is then enough to put into practice. This is especially relevant in relatively sparse selection domains, where the instructional policy is choosing among relatively few pieces of content (e.g. 7 content objects mapping to 3 learning objectives).

This paper describes a “closed loop” system for testing the above design: creating a population of simulated students, approximating the actions that they would take, using those actions to develop a policy of remedial content selection. Further, this paper describes a study which utilizes this technological approach to compare the effect of the developed policies on learners, and reports on the findings.

2 System Design

This system design consists of a few key pieces of experimental apparatus. It is fundamentally built on the architecture of the Generalized Intelligent Framework for Tutoring (GIFT) program – a system of interchangeable software modules for building intelligent tutoring systems. The GIFT modules are the Domain, Learner, Pedagogical, and simulation Gateway. Off-the-shelf versions of these modules were used and configured for the domain of interest. The simulation Gateway was configured to interoperate with the learning environment; the domain module was configured to have a repository of puzzles and per-puzzle feedback. The Pedagogical Module was configured to operate based on an adaptive learning policy, with updates to the Domain Module to select for the most appropriate feedback. The adaptive learning policy modifications represent the new technical capability presented in this paper, and are described in greater detail below in Sect. 2.2.

2.1 Learning Environment – Physics Playground

Newton’s Playground is a “serious game” – a game designed to teach as well as entertain [9]. During the experience of the game, the learners/players draw different types of physics items, such as a lever, weight, or structural beam. Each of these items is then animated according to the basic laws of physics, and interacts with the rest of the environment – falling, swinging, lifting, etc. as appropriate.

The goal of each level is to get a red ball from one point on the screen to another with a system of drawn objects. An example puzzle may intend to teach the concept of pendulum physics by having a learner draw an anchor, string, and weight in order to have the anchor collide with the ball and launch it to the appropriate screen part.

This work uses a prototype version of Newton’s Playground built for GIFT, where puzzles are instrumented with measures including the completion time for each puzzle. 9 puzzles based on three physics concepts (Impulse, Conservation of Momentum, Conservation of Energy) were constructed using this technology. A sample image of interactions within the environment is shown below in Fig. 1.

Fig. 1.
figure 1

Physics Playground environment sample interaction; drawing a pendulum to hit the ball uphill to the balloon. (Color figure online)

Fig. 2.
figure 2

Learning algorithm implemented into the Domain Module, informed by offline EDM tool processes.

2.2 Adaptive After Action (A-AAR) Review

The baseline of the A-AAR review is a policy created by the Educational Data Mining (EDM) Tool. The EDM Tool inputs learner performance data and outputs a training model. The input required need only conform to a minimal ontology consisting of learner ID, training item, and at least one measure. At least one of these measures must be specified as a “training goal.” An example of a “measure” is incorrect GIFT report of At/Below Expectation. An example of a system training objective is to guide learners towards “At Expectation” on all associated concepts. Table 1 shows the first few columns of input to the data mining tool from data acquired for this project. The EDM tool augments the data by inferring latent variables. For each row of data, the EDM tool infers a Learner Competency Level (associated with User #123456) and a Scenario Difficulty Level (associated with the Playground and Puzzle associated with the scenario). These values are unknown, but can be inferred by computation, which the EDM Tool fills in using Gibbs sampling [4] (represented in the last three columns). The EDM tool models the process as a Partially Observable Markov Decision Process (POMDP) which serves to maximize the reward (learning) over the observed stats, actions, state transitions, possible observations, and probabilistic models of the phenomenon. For replication purposes, the exact parameter settings and details of the modeling algorithm can be found within prior work [2]. It is sufficient to indicate that the pedagogical policy is created by an EDM Tool which infers problem difficulties from learning objected measures associated with collected data (Table 2).

Table 1. Input to EDM tool
Table 2. Output of EDM tool (input to pedagogical policy); sample. Below/At/Above Expectation values encoded as 0/1/2. Next is the next node policy.

2.3 Runtime Integration with GIFT

Previous work sought to integrate Physics Playground with GIFT through traditional GIFT-integration approaches [9]. This combined system is represented in the literature as “Newtonian Talk”, and the development of its system of hints and feedback is beyond the scope of the current work. In this work, the physics environment was used “off the shelf” with built in measures of student assessment and feedback based upon that assessment. The EDM Tool and Policy updates are applied overtop of this existing experience.

The EDM Tool produces Policy updates which serve to update the model of domain content. This model has two functions – first to provide customized feedback from among the pre-authored feedback based on the evidence of which feedback is effective from the policy, and second to provide a recommended puzzle to complete at the next stage. These are presented to the user in the manner of Fig. 3, in the top left and lower right, respectively. In the upper-right of Fig. 3, users classified as “novice” can play a video of an expert solving the puzzle (presented via an in-window link to youtube), and users classified as “expert” can play a video of their own performance. The feedback screens are integrated into GIFT as shown within Fig. 2, with a customized version of GIFT capable of accepting the policy inputs; this version of GIFT is available upon request of the lead author.

Fig. 3.
figure 3

After-Action Review screen presented to the user.

3 Studies and Results

Three different studies were conducted as a part of this project. The goal of the first study was conducted to gather initial data from which to learn adaptive tutoring policies, and to serve as a non-adaptive control condition for future work. In short, without input data, there is nothing for the EDM Tool to generate. The second study used simulated students, with adaptive policy components driving their experiences, and simulated learning outcomes. The goal of the second study was to determine whether the developed policies would have a measurable effect on learning. The third study implemented the developed policies on real individuals, testing the various policies for effectiveness.

3.1 Study 1 - Human Subjects Control, Non-adaptive Policy Baseline

Data was collected on 42 participants, each running through an Introduction to Newton’s Playground lesson followed by 9 puzzles in Newton’s Playground, presented in random order. After the data from each puzzle was recorded, a random policy would select a random next recommended lesson, and the recommendation was displayed to the participant. The 9 puzzles consisted of three puzzles on Energy, three on Momentum, and three on Impulse. Measures of completion time for each puzzle were recorded. Puzzles took on average about a minute to solve, depending on the puzzle (89, 67, and 82 s respectively for the control conditions of the three puzzles). We found that participants who did not solve the puzzle within that time were likely to take a much longer time to solve the puzzle altogether (floundering behavior). The system timed out after 5 min. To avoid contaminating mean results with the outliers, for data analysis we thresholded performance at 120 s and assigned 120 s completion to any participant who took more than that amount of time. As a sanity check on this thresholding process, we computed median time; for the control conditions it was 99, 64, and 78 s for the three puzzles, and for the experimental conditions it was 63, 25, and 33 s respectively, and thus the difference between conditions was larger for the median than for the mean. Data was stored in the GIFT Learner Management System for later analysis. More complete results within this control condition are presented for comparison in Sect. 3.3.

3.2 Study 2 - Simulated Student Models, Simulated Outcomes

The parameters for the model discussed in previous sections were found by the EDM module run across the initial wave of students in Study 1. There were two types of parameters that were learned from the recorded results. The first type was parameters that were identified through metadata in Newton’s Playground. These included the states (which corresponded to the names of the measures recorded in the LMS), the actions (9 actions, 1 for each puzzle), and the observations (students were recorded for Pass/Fail of the puzzles, with Pass decomposed into AboveExpectation and AtExpectation). The second type included parameters that were approximated through Gibbs sampling of the data [4]. This included the transition probabilities and the observation probabilities for the underlying POMDP model (in other words, the probability that taking each puzzle advances student capability, and the probability of observing a time given student capability, respectively). The observation function was found by finding a best-fit of the item difficulty parameter to the results, based in turn on solving for the student state after each given puzzle.

After the model parameters were determined, a POMDP policy was generated that mapped each state to an action. We simulated 10000 students to validate the policy. The simulation included 10 steps, before the first step, student ability was sampled from a start distribution. Each simulated student iteratively took a puzzle (selected from a Tutorial or the 9 available puzzles), then an observation was received, and then the policy would select the next puzzle. Figure 4 shows an aggregate comparison (averaged over the 10000 students) of this policy to a Non-adaptive policy which selected random puzzles.

Fig. 4.
figure 4

Simulated student results of differing policies. Greater information on the construction of simulated students is available in the following work [2].

3.3 Study 3 - Live Students, Real Outcomes

After the initial data collection, a new study was run using the same pool of 9 N’s Playground puzzles used in the control condition and within the simulated study. This study made the following important changes to the overall protocol: Using of GIFT Cloud, a live system on Amazon Mechanical Turk, to run the subjects; inclusion of Adaptive AAR screens between puzzles (corresponding to the AAR described in previous sections); lessons were selected using an adaptive policy as opposed to random. Data from the previous data collection was used as a control condition. To provide adequate differentiation between the puzzles selected between the adaptive and control conditions, three puzzles were selected as “test” puzzles (“Impulse 3”, “Momentum 2”, and “Energy 3”, in that order), and the remaining six puzzles comprised the pool of “training” puzzles for adaptation. Of these, four were selected for training, in an adaptive sequence determined by the policy. In summary, each participant in the adaptive condition encountered four puzzles, selected by the adaptive policy, and then was tested on three further puzzles that were withheld from the training pool (Fig. 5).

Fig. 5.
figure 5

Control (above) and Experimental (below) conditions. All puzzles have feedback. After 4 puzzles, the experimental condition had a test condition, which is compared against an individual in the control condition who has experienced an equivalent number of puzzles.

These individuals are compared against “fair” individuals from the random policies of Study #1. A “fair” individual is one who had encountered at least the same number of puzzles (4, 5, or 6, for the three respective test puzzles, and not including the tutorial) prior to encountering the test puzzle. Correspondingly, the number of fair control individuals shrinks in size when adding criteria. These results are presented within Table 3, with statistical significance marked via asterisk.

Table 3. Completion times of puzzles within the Control (Study 1) and Adaptive (Study 3) conditions, using policies generated with simulated students (Study #2).

4 Discussion

In order for ITSs to be effective, they should closely mirror the functions of the real human tutors that they are modeled after. This includes a continuous analysis of the effectiveness of their actions and the modification of instructional policies accordingly. The above paper presents work performed within the context of GIFT, an intelligent tutoring system with interchangeable parts. A new ‘part’ was designed, developed, prototyped, and tested on live students. This part consisted of an adaptive policy, constructed over top of an existing model of the domain of instruction, which also provided a review/remediation screen.

The adaptive policy was developed on a relatively small sample of real students (<30, on average), but through a relatively large sample of simulated students (1000, exactly). Naturally, a reasonable reader would ask the question of whether a policy developed upon simulated learners would be applicable to live learners, and a follow-up study was run on such learners. This was done with a relatively small sample size, but observed relatively large effects with significance. This comparison somewhat represents the effect of the total intelligent tutoring system, as a policy-driven order of content and a policy-driven AAR page were both introduced. The effect size (large) of intelligent tutoring systems is typically observed to be within the range reported above, and only small samples are required in order to find statistically significant large effects.

The large effect size observed is due to two reasons. The first of these is that the control comparison is against a policy of random actions, but among the same 9 puzzles. However, although the control policy is ‘random’, it does give feedback relative to the learners’ errors. The observed improvements to the system are in addition to the basic improvements from error-sensitive feedback. The second reason for the large effect size is that the distribution of the times to solve is bimodal, reflected in relatively high variance values – learner’s either solved or failed to solve the puzzles within 120 s or were considered to have failed and were given feedback/remediation. This bimodal, but clipped, distribution is a fair comparison for real-world application, as the alternative is to allow students unlimited time to flounder and then compare the total floundering times in the control and experimental groups. However, the differences were significant on a unidirectional (we believe the policies should help) test of differences assuming unequal variances.

5 Future Work

The developed models presented within this work are portable, considering that they were made for a system built upon the idea of interchangeable parts. The next steps for this work are to integrate and release the work as open source, as is the practice of the GIFT program. This allows other researchers to benefit from this work at no cost and at no change to their baseline code or models – a powerful advantage. This work is anticipated to be publicly available as part of the GIFT open source package in late 2018, or upon request at time of publication. Another next step is to instructionally test the non-confounded learning items to determine the cause of the learning effect; i.e. is it the intelligent ordering or AAR screen which is causing the learning effect? Finally, the important work in this vein of research is to test the software on a larger sample size, and among different domains. The authors invite collaboration in doing so.