Keywords

1 Introduction

It is essential to equip instructor training with informative dialogic feedback on their classroom activities, which allows teachers to adjust and refine their teaching instructions [1, 4, 10, 17, 24]. Prior researches have been demonstrated that pedagogical teaching styles and instructions may significantly influence students’ engagements and academic achievements [18, 22, 26]. Traditionally, providing such feedback is very logistically complex and expensive, as it heavily relies on human annotations [3, 14, 20, 21]. This makes it inapplicable in real-world education scenarios. Thus, in this work, we focus on building an automatic AI driven solution to solve this fundamental class activity detection (CAD) problem. More specifically, we aim at automatically annotating classroom audio recordings by recognizing different speakers’ roles, i.e., student or teacher. CAD solutions produce basic information about the quantities and distributions of classroom conversations, which are one of the essential steps for deep classroom analysis [16].

A large spectrum of models have been developed to solving the CAD problem [2, 6, 8, 22]. Owens et al. proposed a machine learning algorithm that captures distinctive patterns in different instructional techniques and classifies the classroom sound into different class activities [22]. Cosbey et al. targeted on the same classroom sound classification problem as in [22] and adopted deep recurrent neural networks to extract meaningful features from audio frames [6]. Wang et al. conducted CAD by using LENA system [11] and identified three discourse activities of teacher lecturing, class discussion and student group work [30].

However, CAD in real-world scenarios is still extremely difficult because of three challenges: (1) conversational turn-taking overlap: Classroom conversations usually contain many frequent talk exchanges between teachers and students, which leads to a number of inextricable speech overlaps; (2) vocal variability and uniqueness: Every person’s voice is different and unique, which poses a difficult question on the generalization ability of the CAD solution; and (3) classroom noise: Both online and offline classrooms in reality are dynamic, complex and noisy. In the attempt to solve the aforementioned challenges, we develop the Siamese neural framework to precisely detect teacher and student activities from classroom audio recordings. The contributions of this work are summarized as follows: (1) It presents a pioneer research on the CAD problem and proposes a novel Siamese neural framework to tackle this problem; and (2) we comprehensively evaluate our framework with different realizations and their benefits on both online and offline real-world, large-scale classroom datasets.

2 The Siamese Neural Framework

In this section, we describe our end-to-end Siamese neural framework for the CAD problem in details. Our framework consists of three key components: (1) feature extraction module that extracts window-level raw embeddings from a pre-train large-scale audio encoding neural network; (2) the representation learning module, which extracts semantic representations from each classroom audio segment; and (3) an attentional prediction module that predicts the activity type for each window. The overall framework architecture is shown in Fig. 1.

Fig. 1.
figure 1

The overview of our Siamese neural framework. VAD is short for voice activity detection.

Feature Extraction. We first utilize a well-studied voice activity detection (VAD) system to segment audio streams into pieces of utterances and filter out the noisy and silent ones [23, 25, 27]. Then we transform each segment into frames of pre-defined width and step, and log-mel-filterbank energies of dimension 40 are extracted from each frame. After that, we obtain windows by using non-overlapping sliding windows of a fixed length on these frames. Once we create these audio windows from both teachers’ vocal sample segments and classroom recording segments, we extract windows’ corresponding low-dimensional dense vocal representations from a pre-trained acoustic neural network.

Representation Learning. We learn a refined vocal representation for each window by utilizing the contextual dependencies within each segment (either from teachers’ vocal samples or classroom recordings). In our framework, any existing sequential modeling function such as long short-term memory (LSTM), gated recurrent unit (GRU), etc. can be used [7, 13, 29]. By considering the contextual windows across entire segment, we are able to model the changes of tones and pitches in the audio stream smoothly and reduce the noises and outliers in the raw feature extraction component.

Attentional Prediction. We design an attentional prediction module focusing on the window-level class activity detection tasks. Our attentional prediction module is inspired by the intuition that all the audio windows spoken by the teacher share common attributes that are very different from those shared from student’s audio windows. Thus, we use teachers’ vocal samples as an aggregated query and compute an attention score with each individual window from classroom recordings. The higher the attentional score is, the more likely the audio window is spoken by the teacher. Based on this idea, we first add a mean pooling layer to aggregate all the teacher’s vocal sample representations. This yields a robust and representative query embedding of the teacher’s voice signals. The obtained vector is used as a voice biometrics query to compute attention scores with each individual window representation. In order to effectively train our framework, we design a cross-entropy loss function as the optimization objective. We use mini-batch stochastic gradient decent algorithm to minimize the objective and update the our model parameters.

3 Experiments

We evaluate our framework with two real-world K-12 education datasets: (1) the online dataset, which includes 400 classroom recordings and 300 distinct teachers from a third-party online education platformFootnote 1; and (2) the offline dataset that includes 100 recordings and 36 distinct teachers from physical offline classrooms. We randomly select 100 and 10 recordings from online and offline dataset respectively as our test sets. The prediction results are denoted as “Main”. Moreover, in order to evaluate the model generalization ability to new teachers, we further filter out teachers from above test set if the teachers appear in the training set and the prediction results are denoted as “Generalization”. We choose to use area under curve (AUC) score to evaluate the model performance [9].

We choose the following approaches as our baselines: (1) Average: Vocal representations from feature extraction component are directly used for attentional prediction; (2) DNN/GRU/LSTM: A single layer fully connected neural network/a bidirectional GRU/a bidirectional LSTM is used in the representation learning component [5, 12, 19]. We use 128 neurons and ReLU as the activation function; and (3) Transformer: A transformer is used in the representation learning component [28]. We choose to use 2 layers in the transformer and set 4 heads for each layer. We set the dimension of each head to 16.

Experimental Results: The results are shown in Table 1. For the main task, we find that (1) the Average performs much worse than any other method. This suggests that the fine-tuned representation learning plays an important role in the final prediction; (2) compared to GRU, LSTM, and Transformer, DNN has achieve a lower detection accuracy. This is expected as it is not able to capture the contextual information of windows within each segment; (3) the performance of all methods on online dataset is generally better than results on offline dataset. We argue that this is because the signal to noise ratio of offline recordings is much higher than the ratio in online recordings [16]; and (4) both GRU and Transformer have comparable performance, which is consistent with the previous findings [15]. For the generalization task, we have similar observations. The high accuracy achieved by Transformer and LSTM demonstrates the generalization ability of the proposed framework.

Table 1. Experimental results on the online and offline datasets.

4 Conclusion

We present a Siamese framework to tackle the CAD problem. Experiments demonstrate both detection performance and generalization ability of our framework. In the future, we would like to design models that can combine both audio and video data to generate more comprehensive classroom activity feedback.