1 Introduction

Nowadays, e-learning has become an important learning means to acquire knowledge. A large number of lecture videos are posted on the Internet everyday, and most of these videos are unstructured. If users want to find some specific knowledge, they usually have to browse the entire video, which is time-consuming. Therefore, it is essential to automatically extract the representative summary for lecture videos.

Detection of slides transition is a critical issue for lecture video summarization. Various lecture videos can capture the projected slides and the speaker by a pan-tilt-zoom (PTZ) camera, record the computer screen directly, or even switch from the PTZ camera to the screen recorder. During this process, the slide content may remain the same for a long time or change to others quickly, while the users have to watch the whole video to have these findings. Apparently, the appearance change in the video frame does not necessarily indicate a slide transition. It is hard to tell the real slide transition from the disturbance like camera motion, camera switch, and people movement.

Unfortunately, previous approaches usually extracted slide transition keyframes by measuring the visual difference between adjacent frames. Various features such as histogram [13], SIFT [2], and wavelet [18] are chosen to describe the appearance similarity. These methods fail to deal with frames which contain people movement and camera motion, which are common noise interruption in some types of lecture video. Recently, an iterative approach has been proposed to localize the projection screen and detect the slide transition [9]. However, it is targeted for a specific type of lecture video which capture the projection screen by a single PTZ camera. Camera switches are not allowed, thus its application range is limited.

In this paper, we present an automatic approach to detect slide transitions in lecture videos by inferring sparse time-varying graphs. We first partition the video into several small segments by feature detection and matching. Inspired by the storyline summarization approach [8], we regard each segment as a node to construct a sparse time-varying graph. This graph is able to model the transition from one segment (slide) to another. After inferring the adjacency matrix of the graph through an global optimization, we analyze them to generate the slide transition keyframes robustly.

For evaluation, we collect a variety of lecture videos and compare our system with general video summarization approaches and slide progression detection method. Experimental results show that our system is able to handle several types of lecture videos and achieve the best performance.

The remainder of this paper is organized as follows. Section 2 investigates the lecture video summarization problem. Section 3 describes the system overview and general pipeline of each part. Section 4 presents the core sparse time-varying graph. Section 5 shows the experimental results and verifies the superiority of our system. Section 6 concludes this paper.

2 Related Work

In this section, we mainly review the literatures in lecture video summarization.

Many previous approaches generate the summary by measuring the appearance similarity between two adjacent frames. To compute the similarity, several features are extracted, such as color histograms, corner points, and edge information. For instance, some algorithms [10, 13] leveraged color histogram to summarize lecture videos. Jeong et al. [6] detected the forward and backward slides change by a recursive pruning algorithm. However, on one hand, if there are camera motions in the lecture video, appearance difference is not always effective; on the other hand, parameter estimation for the similarity threshold is trivial.

To address these problems, some shot boundary detection approaches were proposed. For example, Zhao and Cai [21] presented a shot boundary detection algorithm based on fuzzy theory. They segmented videos into six different classes to detect shot boundary, and trained the camera movement features to avoid interference. Porter [16] designed a shot segmentation algorithm by a two-component frame differencing metric. Other classification methods introduced Support Vector Machines [20] and Neural Networks [12] to recognize shot boundaries. Recently, Li et al. [9] tracked the feature trajectories to find the slide progression frames.

Some work employed additional data to obtain a video summary, such as text embedded in lecture videos [15, 19], audio signal [5, 17], and electronic slides [1]. Wang et al. [19] reconstructed high-resolution video texts to detect and analyze text information for matching video clips. Ngo et al. [15] employed a foreground vs. background segmentation algorithm to obtain the projected electronic slides, and then detected and analyzed text captions to detect slide transitions. In their experiments, the camera is fixed and remain stationary, which is not universal applicable. Fan et al. [1] matched original electronic slides to presentation videos by a hidden Markov model. Since their input is the lecture video with its original electronic slides, which is not suitable for mostly lecture video summarization. In contrast, our method automatically detects slide transitions without additional data.

Our work is inspired by but different from the sparse time-varying graphs for reconstructing storyline graphs of videos and photos [7, 8]. [8] used a set of photo streams to construct a storyline summary that represents the narrative structure of activities by inferring sparse time-varying graphs. The storyline can be further used for sequential image prediction. [7] proposed a scalable approach to jointly summarize a set of associated videos and images. Temporal graph analysis was also considered in [14]. The difference is that the temporal graph is used for scene modeling and detection. In our system, we introduce the temporal graphs to infer the real slide changes, while the graph in [8] is targeted to discover the common structure of an event taken by many people.

3 Problem Settings

The input to our system is a lecture video which is represented by a set of frames \(\mathcal {I}=\{\varvec{I}^1,\cdots ,\varvec{I}^N\}\), where N is the number of frames. The original video is first partitioned into segments that have the similar appearance through feature matching. Subsequently, by representing each segment as a node, a sparse graph at each moment is built to indicate the transition between segments. After inferring the adjacency matrix in the sparse time-varying graphs, we are able to detect slide transitions by analyzing its structure.

Video segments. SIFT features are first extracted to represent low-level visual information. With the help of feature matching, we generate a video segment by matching SIFT features in the next few frames to the first frame of current segment until the ratio of feature matches is lower than a threshold m. For instance, if video frame \(\#2\) to \(\#10\) are similar to frame \(\#1\), segment \(\#1\) consists of frame \(\#1\) to \(\#10\). Segment \(\#2\) starts from frame \(\#11\) and ends at frame whose matches ratio with respect to frame \(\#11\) is below the threshold m. To avoid parameter tuning, we adopt an adaptive technique to estimate the threshold m. Specifically, we first compute the ratio of feature matches across the whole video and compute a histogram. By estimating a Gaussian distribution \(N(\mu _r, \sigma _r)\) around the peak value with maximum likelihood estimator (MLE), we naturally use \(m=\mu _r+3\sigma _r\) as the threshold to avoid empirically setting. Finally, frame \(\varvec{I}^i\) is described as a binary segment indicator vector \(\varvec{x}^i \in \mathfrak {R}^D\) with 1 nonzero elements, where D is the number of video segments.

Definition of graphs. The time-varying graph is defined as \(\mathcal {G}^i=(\mathcal {V}, \mathcal {E}^i), i \in \{1, \cdots , N-1\}\), where each node in the vertex set \(\mathcal {V}\) corresponds to a video segment (i.e. \(|\mathcal {V}|=D\)). The edge set \(\mathcal {E}^i\) is encouraged to be sparse and time-varying. On one hand, sparsity is introduced to avoid unnecessary complex graph structure, and the nonzero elements indicates the strong relationship between nodes. On the other hand, \(\mathcal {E}^i\) varies smoothly along with the content change over time, which is just used for slide transition detection. The slide transition detection problem is turned to the graph inference one, i.e., how to obtain a set of time-specific adjacency matrix \(\varvec{A}^i, i \in \{1,\cdots , N-1\}\) of the edge set \(\mathcal {E}^i\), which is detailed in Sect. 4.

Slide transition detection. After obtaining the sparse adjacency matrix, we analyze them to produce the slide transition keyframes. For instance, if a slide stay unchanged, the non-zero elements of matrix \(\varvec{A}^i\) always appear in the diagonal position, which means that the node switches to itself. However, during the slide transition period, matrix \(\varvec{A}^i\) differs from the previous ones and is not necessarily a diagonal one. Therefore, we are able to detect slide transitions by analyzing the nonzero elements of adjacency matrix. Due to the large amount of frames, we maintain a coarse-to-fine two-step strategy to locate the transition frame efficiently. Specifically, the whole video frames are uniformly split into multiple time intervals (such as every 10 frames), where a coarse-level adjacency matrix is first estimated. A fine-level adjacency matrix is then calculated at each frame to locate the exact time point once a slide transition is found in some certain interval. More importantly, this framework is able to neglect the disturbance such as camera motion and people movement, and reveal the real slide change.

4 Sparse Time-Varying Graphs

In this section, we first describe the graph modeling principles, and then present the optimization framework for solving such sparse time-varying graphs.

4.1 Graph Modeling

The inference of the time-varying graph is formulated as a maximum likelihood estimation problem, based on the assumption that first order Markovian is employed between the consecutive frames. We first describe the graph model in this subsection.

Given a lecture video \(\mathcal {I}=\{\varvec{I}^1,\cdots ,\varvec{I}^N\}\), after temporal segmenting, we rewrite it as \(\mathcal {I}=\{\varvec{x}^1,\cdots ,\varvec{x}^N\}\). Based on the Markovian assumption, the likelihood of the sequence is defined as

$$\begin{aligned} f(\mathcal {I}) = f(\varvec{x}^1) \prod _{i=1}^{N-1} f(\varvec{x}^{i+1} | \varvec{x}^{i}), \end{aligned}$$
(1)

where \(f(\varvec{x}^{i+1} | \varvec{x}^{i})\) is the transition model describing the conditional transfer likelihood from frame \(\varvec{I}^{i}\) to frame \(\varvec{I}^{i+1}\).

For scalability, we reasonably assume that different video segment \(x^{i+1}_{m}\) and \(x^{i+1}_{n}\), where \(m, n \in \{1,\cdots ,D\}\), and \(m\ne n\), are conditional independent once given \(\varvec{x}^{i}\). Therefore, the transition likelihood is calculated over each individual dimension

$$\begin{aligned} f(\varvec{x}^{i+1} | \varvec{x}^{i})=\prod _{d=1}^D f(x^{i+1}_d | \varvec{x}^{i}). \end{aligned}$$
(2)

Naturally, we use a liner dynamic model to simplify the transition model

$$\begin{aligned} \varvec{x}^{i+1} = \varvec{A}^i\varvec{x}^{i}+\varvec{\zeta }, \end{aligned}$$
(3)

where \(\varvec{\zeta } \sim N(0, \sigma ^2 \varvec{I})\) is a Gaussian noise vector with zero mean and variance \(\sigma ^2\).

Combing Eqs. (2) and (3), the transition likelihood is further expressed as the probability density function of a Gaussian distribution

$$\begin{aligned} f(x^{i+1}_d | \varvec{x}^{i})\sim N(\varvec{A}^i_{d\cdot } \varvec{x}^{i}, \sigma ^2), \end{aligned}$$
(4)

where \(\varvec{A}^i_{d\cdot }\) denotes the d-th row of matrix \(\varvec{A}^i\). Taking the logarithm of Eq. (1), the log-likelihood of the sequence is finally computed as

$$\begin{aligned} \log f(\mathcal {I})=\log f(\varvec{x}^1)-\sum _{i=1}^{N-1} \sum _{d=1}^D \left\{ \frac{N-1}{2} \log (2 \pi \sigma ^2 )+\frac{1}{2\sigma ^2}(x^{i+1}_{d}-\varvec{A}^i_{d\cdot } \varvec{x}^{i})^2\right\} . \end{aligned}$$
(5)

4.2 Global Optimization

The expected adjacency matrix in the graph should satisfy the following criteria: (1) it should be close to the MLE solution in Eq. (5); (2) it should only have a few nonzero elements; (3) the matrix in neighboring frames should be temporally coherent. In this subsection we show how to formulate and solve the adjacency matrix in an optimization-based framework.

Firstly, the Maximum Likelihood Estimator in Eq. (5) produces that \(\varvec{x}^{i+1}=\varvec{A}^i \varvec{x}^{i}\). Due to the insufficient images at time i, the estimator suffer from high variance. Fortunately, supposing that the nearby frames should have the similar appearance, we impose the transition model on the neighboring observation pair \((\varvec{x}^{i+k}, \varvec{x}^{i-k+1}), k\in N^+\) to gather redundant constrict in the data term. Therefore, the first criterion is formulated as

$$\begin{aligned} \min _{\varvec{A}^i} \sum _{k=1}^K w^i_k (\varvec{x}^{i+k} - \varvec{A}^i \varvec{x}^{i-k+1})^2. \end{aligned}$$
(6)

The weight coefficient \(w^i_k\) is introduced to indicate how much degree neighboring frame pair should meet the same adjacency matrix, which is defined as

$$\begin{aligned} w^i_k = \exp \left\{ -\frac{(2k-1)^2}{2\sigma _t^2}\right\} \exp \left\{ -\frac{||\varvec{x}^{i+k}-\varvec{x}^{i-k+1}-\varvec{x}^{i+k+1}+\varvec{x}^{i-k}||^2_2}{2\sigma _f^2}\right\} . \end{aligned}$$
(7)

The former part in \(w^t_k\) is the temporal weighting of observation pair \((\varvec{x}^{i+k}, \varvec{x}^{i-k+1})\). As time goes on, the relation between observations is gradually weaken. Closer to time i, more contribution on the estimation \(\varvec{A}^i\). The latter part in \(w^i_k\) is the weighting of segment difference. If the difference between \((\varvec{x}^{i+k}-\varvec{x}^{i-k+1})\) and \((\varvec{x}^{i+k+1}-\varvec{x}^{i-k})\) is large, we think it encounters noise, and set a low weight to avoid the noise. In addition, \(\sigma _t=2\) and \(\sigma _f=0.2\) are standard deviations controlling the Gaussian kernel. \(K=\min (\min (N-i-1, i-1), 2\sigma _t)\) is the size of the neighborhood set. In addition, this data term is the underlying explanation for being able to discard the disturbance from camera motion and people movement.

Secondly, the graph should only have a few strong connections. This is done by an \(\ell _1\) regularizer to control the sparsity of adjacency matrix. It not only avoids over-fitting, but also removes the weak link between nodes.

Thirdly, adjacent frames should have the similar matrix. We minimize the temporal difference \(||\varvec{A}^i-\varvec{A}^{i-1}||^2_2\) to maintain temporal coherence.

Finally, we have the complete optimization formula for the adjacency matrix

$$\begin{aligned} \min _{\{\varvec{A}^i\}} \sum _{i=1}^{N-1} \sum _{k=1}^K w^i_k ||\varvec{x}^{i+k} - \varvec{A}^i \varvec{x}^{t-k+1}||^2_2 + \lambda \sum _{i=1}^{N-1} ||\varvec{A}^i||_1 + \alpha \sum _{i=1}^{N-1} ||\varvec{A}^{i+1}-\varvec{A}^{i}||^2_2, \end{aligned}$$
(8)

where \(\lambda \) and \(\alpha \) are the weights for the sparsity term and smoothness term, respectively (\(\lambda =0.05\), and \(\alpha =0.01\) in our system). This optimization formula is significantly different from that in [8]. On one hand, in [8] each \(\varvec{A}^i\) suggests the common codeword transition probability of many photo streams, while we only have a single video sequence. On the other hand, they solve each \(\varvec{A}^i\) independently, while we employ a global optimization.

Global optimization in Eq. (8) could be solved via a lot of tools, such as coordinate descent [3]. Note that the inference of graph reduce to a weighted \(\ell _1\)-regularized least square problem when all variables but \(\varvec{A}^i\) fixed. Furthermore, thanks to the assumption that each dimension of \(\varvec{x}^i\) are conditional independent, neighborhood selection [11] is applied to obtain each row of \(\varvec{A}^i\) separately. As a result, we iteratively solve the weighted lasso problem for each row of the adjacency matrix D times.

5 Results and Discussion

As shown in Fig. 1, we have collected three types of lecture video from Yale University Courses and YouTube to verify the effectiveness and superiority of our system. Type-A video is recorded with multiple cameras which allows complex camera motion and sudden camera switch, such as the switch from slide to speaker. Type-B video is also recorded with multiple cameras but shows the slide and speaker simultaneously. Type-C video captures the speaker and on-stage screen by a still camera. Both camera movement and people movement would affect the detection accuracy. Each lecture video is temporally down-sampled to 1 fps and its spatial resolution is \(640\times 360\). Video length ranges from roughly 10 min to 45 min. The parameters are fixed as stated before.

Fig. 1.
figure 1

Three types of lecture videos in our experiments. (a) Type-A video presents complex camera motion and sudden camera switch. (b) Type-B video presents the speaker and computer screen in two regions simultaneously. (c) Type-C video presents the on-stage screen by a single camera.

A typical detection result for Type-A lecture video is shown in Fig. 2, where slide progressions are automatically detected and marked on the timeline. Result shows that our method successfully pick out the slide content changes effectively.

Fig. 2.
figure 2

Slide transition detection result by our approach for Type-A lecture video. Ticks marked on the timeline indicate the detected transition frames shown below the timeline.

Fig. 3.
figure 3

Slide transition detection result by the SPD approach [9] for Type-A lecture video. Note that both camera switch and people movement are mistaken as slide transitions.

We perform some quantitative evaluation to demonstrate the superiority of our system. After manually labeling the groudtruth slide change, we leverage the \(F_1\) score as the evaluation metric

(9)

where \(S_c\) is the number of slides transitions correctly detected, \(S_t\) is the total number of detected slide transitions, and \(S_a\) is the total number of the actual slide transitions.

We compare our system with video summarization approach using Singular Value Decomposition (SVD) [4], shot boundary detection method using Frame Transition Parameters (FTP) [12], and recent slide progression detection method by analyzing the feature trajectories (SPD) [9]. Table 1 shows the average Precision, Recall, and \(F_1\) score of different approaches on all types of lecture video, while detailed performance on each type of lecture video is shown in Fig. 4.

Table 1. Average performance of different methods on three types of lecture video.
Fig. 4.
figure 4

Detailed performance of different methods on three types of lecture video.

As shown in Table 1 and Fig. 4, our system significantly outperforms these approaches in detecting slide transitions. Our system improves the \(F_1\) score by \(16.6\%\) on average, compared with the feature trajectory-based SPD approach. In particular, our approach improves the \(F_1\) score up to \(46.8\%\) on Type-A lecture video, where camera switch and complex motion are presented. The SPD approach would produce lots of false positives due to the sudden camera switch and frequent people movement when dealing with Type-A lecture video, which is also evidenced by Fig. 3. We also find that general SVD approach achieves a better precision but fails to discover most of the slide changes. In addition, shot boundary detection method using FTP achieves the worst performance for detecting slide transitions.

6 Conclusion

In this paper, we present the sparse time-varying graph optimization approach to automatically detect slides transitions in lecture videos. By formulating the sparsity and time-varying characteristics into a global optimization framework, we are able to solve the adjacency matrix in the graphs and then detect the slide changes. Experimental results show that our system successfully summarizes lecture video by key frames and achieves the best performance. Besides the general feature matching, other specific information of lecture video, such as text title on the top of screen, could be used to further improve the performance, which will be our future work.