Keywords

1 Introduction

Bayesian Knowledge Tracing (BKT) is an established approach to modeling skill acquisition of students working with intelligent tutoring systems. However, BKT is far from an ideal solution, and multiple improvements and extensions were suggested to it over the years. One of such extensions is Deep Knowledge Tracing (DKT). The first DKT [6] adopted the Recurrent Neural Network (RNN) architecture from the deep learning community. Recent publications on DKT discuss various RNN architecture modifications to adapt to student learning theories as well as explore new deep learning models. Our work is inspired by both and proposes to use the Transformer architecture to model students’ knowledge state.

The Transformer model was first proposed by the Google Brain team [9] to generate better neural translations. It soon became the dominant model in many Natural Language Processing (NLP) problems [2, 7]. The main advantage of the Transformer over RNN is its ability to learn long-range dependencies [2].

We modified the Transformer architecture so that it does not directly learn the representation of each question. Instead, it learns the representation of the underlying W-matrix that relates knowledge components to question items, including the cases when multiple knowledge components are associated with an item. This modification allows the model to learn specific representation for frequently encountered questions while represent rare questions with their underline skill representations. Further, we allow the attention weight between question items to decay as students work on questions or problems, which effectively represents forgetting.

Fig. 1.
figure 1

Transformer architecture

2 Related Work

The original Deep Knowledge Tracing (DKT) Model [6] used an RNN based architecture and claimed to outperform BKT by a large margin. However, latter works [3, 10] showed that RNN based DKT is not superior than BKT models over a pool of datasets when data preprocessing errors are taken into account.

Recent work on DKT follows two general patterns. First, a group of studies [4, 11] tried to adjust the RNN structure so that it is consistent with the students’ learning process. For example, the Dynamic Key-Value Memory Networks [11] explicitly maintained knowledge components and knowledge states. Further, Nagatan and colleagues [4] intentionally modeled students’ forgetting behavior in RNN but achieved only limited success.

Another group of DKT papers seeks to leverage recently developed Transformer models [1, 5, 8]. Pandey and Karypis [5] used a self-attention model which is a simplified version of the Transformer. Ralla and colleagues [8] used Transformer Encoder to pre-train students’ interactions. Choi et al. [1] experimented with different alternatives to rewiring the components in the original Transformer. All these works showed inspiring results and motivated this study.

3 Methods

Figure 1 represents a simplified version of our adapted Transformer model. The main inputs to our adapted Transformer model is a sequence tuples \(x_i = (q_i, c_i)\), and timestamp \(t_i\). Here, \(q_i\) represents the question item a student is trying to answer, and \(c_i \in {0, 1}\) represents whether the response is correct. The goal of the model is a sequence of correctness estimates, \(c_{i+1}\), representing whether a student correctly solved the next question \({q_{i+1}}\). Formally, the Transformer model is trying to predict \(P(c_{i+1} = 1 | x_0, .., x_i, t_0, ..., t_i, q_{i + 1})\).

Interaction Embedding Layer. A student interaction, \(x_i\), will be first encoded as an index, \(d_i\), and passed to the interaction mapping layer:

$$\begin{aligned} e_i= softmax(W_{d_i.}) S \end{aligned}$$
(1)

\(e_i\) is the vector representation of a student interaction \(x_i\). \(W_{d_i.}\) represents the weights associated with all latent skills for \(d_i\). Each column of S is a vector representation of a latent skill. So, \(e_i\) is a weighted sum of all underlying latent skills.

Transformer Block – Masked Attention. The outputs of Interaction Embedding Layer, \(e_i\), is directly passed to a Transformer Block:

$$\begin{aligned} q_i = Qe_i, k_i = Ke_i, v_i = Ve_i \end{aligned}$$
(2)
$$\begin{aligned} A_{ij} = \frac{q_ik_j + b(\varDelta t_{i-j})}{\sqrt{d_k}} , \forall {j \le i} \end{aligned}$$
(3)
$$\begin{aligned} h_i = \sum _{j \le i}softmax(A_{ij})v_j \end{aligned}$$
(4)

the masked attention layer first extracts query \(q_i\), key \(k_i\), and value \(v_i\) from the inputs \(e_i\). It then assign an attention, \(A_{ij}\), to a past interaction \(e_j\) based on two components: 1) \(q_ik_j\), the query-key agreement between \(e_i\) and \(e_j\), which could be interpreted as the degree of latent skills overlapping between interaction \(e_i\) and \(e_j\); 2) A time gap bias, \(b(\varDelta t_{i-j})\), which adjusts the attention weight by the time gap between interactions \(e_i\) and \(e_j\). The hidden representation \(h_i\) is a weighted sum of the past value representations of \(e_j\). A Transformer block also have feedforward layer, normalization layer, and residual connections. We recommend readers to read the original paper [9] for more detail.

Linear Layer + Loss. The outputs of a stack of Transformer blocks is feed to the linear layer before calculating the final loss. \(e_{i+1}\) and \(e^*_{i+1}\) are the results of applying Interaction Mapping Layer to \((q_{i + 1}, 1)\) and \((q_{i + 1}, 0)\).

$$\begin{aligned} p_{i+1} = \frac{exp(h_{i}e_{i+1})}{exp(h_{i}e_{i+1}) + exp(h_ie^*_{i+1})} \end{aligned}$$
(5)
$$\begin{aligned} Loss = -\sum _i{c_{i+1}log(p_{i+1}) + (1 - c_{i+1})log(1 - p_{i+1})} \end{aligned}$$
(6)

4 Experiments

We ran 5-fold student-stratified cross-validation on three datasets that are frequently used in the literature. Table 1 lists descriptive statistics for the datasets.

ASSISTments2017Footnote 1. Data from the ASSISTment online tutoring system.

STAT F2011Footnote 2. This data is from a college-level engineering statics course.

KDD, AFootnote 3. This data is the challenge set A – Algebra I 2008–2009 data set from the KDD 2010 Educational Data Mining Challenge.

Table 1. Dataset overview and student-stratified 5-fold cross validation
Table 2. AUC under different architecture

5 Results and Discussion

Table 1 summarizes our findings and compares them to the start-of-the-art Deep Knowledge Tracing model results in the literature, as well as the Bayesian Knowledge Tracing (BKT) model. Our adapted Transformer model is superior to BKT on all datasets and outperforms the state-of-the-art DKT models from the literature by 9.81% and 11.02% on ASSISTments 2017 and STAT F2011 datasets. The remarkable gain is not due to the structure of the original transformer model. Pandey and Karypis [5]’s self-attention model is roughly equivalent to a 1-layer Transformer. Their reported AUC score on ASSISTments 2017 and STAT F2011 is about 10% worse than our adapted Transformer.

To further illustrate this point, we repeat the experiment on the original Transformer with/without the modified components, as illustrated in Table 2. The original Transformer gains an obvious performance boost by adding the interaction skill mapping and time-bias to its structure.

To conclude, our adapted Transformer architecture generated promising results on frequently used public datasets. For future work, we intend to explore how to efficiently incorporate more feature information into the Transformer architecture, as well as how to represent hierarchical relations between skills in the interaction embedding layer.