Keywords

1 Introduction

Online programming courses have emerged as a popular way to introduce students to programming [1]. These courses present several advantages: they are easily accessible, and students face interesting challenges. Unfortunately, it is not feasible to provide individual support to each student due to the large number of students enrolled in these courses. Automatic systems capable of providing adaptive support could enhance the students’ experience and improve their success rate [2].

In order to develop these automatic systems, there is a need to develop models capable of detecting students that will likely fail [3,4,5]. These models could use the large datasets that students generate when completing programming tasks [6, 7]. Students usually submit several partial solutions before solving a task, creating a programming trajectory for each exercise [8, 9]. These programming trajectories can be analyzed by machine learning systems to find general patterns [10].

In this study we present a supervised machine learning model that predicts the student future programming performance. The model takes the programming trajectory followed by the student and estimates the probability of the student successfully completing the next exercise. The model has been validated using two different datasets obtained from two different online programming environments, Robomission [11], and the Hour of Code challenge from Code.org [12].

Our results indicate that this model can predict accurately whether a student will be able to successfully complete a programming exercise. The information provided by the model can be used to rank students in terms of their performance. Using this ranking one can automatically select a group student that would benefit most from an intervention.

2 Methods

2.1 Data

In this study we worked with two different datasets. The first dataset is a set of programming trajectories submitted by students while completing one exercise in the Hour of Code challenge [13]. Additionally, for each student the dataset contains information about whether the student successfully completed the next task. The exercises and their solutions are shown in Fig. 1. Piech et al. [8] describe this dataset in more detail. The second dataset comprises 85 programming tasks from the Robomission programming platform. Effenberfer [14] gives a thorough description of the dataset.

Fig. 1.
figure 1

Hour of code exercise 18 (left) and exercise 19 (right) and example solutions. To solve the exercise the student must program the squirrel to reach the acorn.

2.2 Proposed Model

Our goal is to generate a supervised machine learning algorithm capable of predicting whether the student will successfully complete the next exercise. To this end we will use the programming trajectories followed by the students T = {ψ0, ψ1 … ψn}. Where ψ0 is the state before the student starts to work, ψi are the code snapshots submitted by the student and ψn is the last snapshot.

The training phase is straightforward: all the programming trajectories present in the training dataset are assembled into a tree. Different branches of the tree contain information about different programming trajectories. Figure 2 describes the process to integrate a new trajectory {ψ0, ψ1, ψ5} into a tree. For each code snapshot present in the trajectory we check if there is a branch in the tree with matching snapshots. If there is such a branch, we follow it while the partial solutions match. As soon as we find a partial solution (ψ5 in this case) that is not present in the branch, a new branch is created.

Fig. 2.
figure 2

Steps followed to integrate a new trajectory {ψ0, ψ1, ψ5} into the tree. Two different leaves of the final tree present the same partial solution.

Once we have processed all the student trajectories to generate the tree, we store in each node the relevant parameters of the students that ended their programming trajectories in that node. In this study we stored the proportion of students that successfully completed the next exercise. After assembling the tree, we can estimate the probability that a new student with trajectory Ti will successfully complete the next exercise. If we want to classify the student, we only need to compare this probability with the threshold that we have selected.

We have selected the Receiver Operating Characteristic (ROC) curve [15] and the area under the curve (AUC) to measure the performance of the classifier. We have used a 10-fold crossvalidation [16] stratified over students to compute them. We will compare our model optimal performance with the results of a simple baseline model. Our baseline model expects the performance of both tasks, the one taken as input and the predicted one, to be the same.

3 Results

We start examining whether our model is successfully detecting students who fail the next exercise in the Hour of Code challenge. The left side of Fig. 3 shows that the ROC curve is systematically above the identity line (y = x). The area under the curve (AUC) of our model in this case is 0.77, with a 95% confidence interval (0.77–0.79). Both the AUC and the confidence interval are greater than 0.5, indicating that our model is performing better than a random classifier. Figure 3 also contain the main results for the baseline model and the optimal threshold. We can see that the baseline model is much closer to the bottom left corner of the figure than the optimal threshold.

Fig. 3.
figure 3

Left: ROC curve obtained when classifying failing students in the Hour of Code exercise. The cyan region represents the 95% confidence interval. Right: AUC values for all the Robomission tasks vs. the number of students that completed each task. The line represents the loess regression of the data points.

The right side of Fig. 3 shows the AUC obtained for each task in the Robomission dataset versus the number of students that attempted each task. We performed a loess regression [16] looking for a correlation between AUC and the number of students. From the graph we can conclude that there is no such correlation. However, the variability of AUC values depends on the number of students. When the number of students is below 500 the AUC values show high variability. For values over 500 the variability decreases markedly.

4 Conclusions

In this study we present a machine learning algorithm able to predict the future performance of novice programmers using their programming trajectories in just one exercise. The output of the model can be used to rank students according to their predicted performance. The data used by the model can be easily obtained in online programming environments.

We have validated our model using two different datasets from two online learning platforms. Our results indicate that the model can classify students with reasonable accuracy. We have also found that the average performance of our model seems to be independent from the number of students attempting the task.