Multimodal Features and Probability Extended Nearest Neighbor Classification for Content-Based Lecture Video Retrieval

Sanjay B. Waykar; C. R. Bharathi

doi:10.1515/jisys-2016-0041

Open Access Published by De Gruyter September 15, 2016

Multimodal Features and Probability Extended Nearest Neighbor Classification for Content-Based Lecture Video Retrieval

Sanjay B. Waykar and C. R. Bharathi

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2016-0041

Abstract

Due to the ever-increasing number of digital lecture libraries and lecture video portals, the challenge of retrieving lecture videos has become a very significant and demanding task in recent years. Accordingly, the literature presents different techniques for video retrieval by considering video contents as well as signal data. Here, we propose a lecture video retrieval system using multimodal features and probability extended nearest neighbor (PENN) classification. There are two modalities utilized for feature extraction. One is textual information, which is determined from the lecture video using optical character recognition. The second modality utilized to preserve video content is local vector pattern. These two modal features are extracted, and the retrieval of videos is performed using the proposed PENN classifier, which is the extension of the extended nearest neighbor classifier, by considering the different weightages for the first-level and second-level neighbors. The performance of the proposed video retrieval is evaluated using precision, recall, and F-measure, which are computed by matching the retrieved videos and the manually classified videos. From the experimentation, we proved that the average precision of the proposed PENN+VQ is 78.3%, which is higher than that of the existing methods.

Keywords: Video retrieval; lecture videos; optical character recognition (OCR); texture; local vector pattern; classifier; precision

1 Introduction

The current video search and video retrieval systems such as Google, YouTube, Bing, etc., retrieve videos based on available textual metadata such as title, genre, person, and tags given by the users, which may not be available or relevant to the video content at all times [8]. In general, such type of metadata has to be generated by a human to make sure that it is of high quality; however, the generation step may consume some time and cost. Moreover, the metadata provided by the human is brief, high level, and subjective in nature. Hence, apart from the existing techniques, the upcoming video retrieval systems concentrate on automatically generating the metadata by using video analysis technologies. Thus, more efficient content-based metadata can be created [6], [16], [19]. Generally, video retrieval methods are classified as text-based methods or content-based methods. The text-based methods take texts as inputs and take traditional textual search methodologies to search for textual information linked within a video, while the content-based methods take images or videos as input and search the similar visual contents within a video [7].

In the field of images and videos, content-based video retrieval is most commonly used in various applications, such as video editing, composition, surveillance, object manipulation, scene composition, and health informatics [15]. Here, the first step for video retrieval is the partitioning of a video sequence into shots. A shot is an image sequence that provides continuous action; it is captured from a single operation of a single camera. Key frames can also be used to represent video features and the retrieval can be performed based on visual features of key frames, and queries may be directed at key frames using query by retrieval algorithms. After extracting the key frames, the next step is to extract the features. The features are generally extracted offline, and so computation is not an important factor; however, computation of features still takes a long time [9]. The most common methods usually adopt low-level visual features such as color, texture, shape, and motion to measure the similarity between videos [5]. The last step is about matching the features with query image to obtain the desired videos. Following these key steps, different methods are presented in the literature for video retrieval, which has a wide range of application based on the video taken for retrieval purposes.

Yang and Meinel [16] proposed character and speech recognition methods for retrieval. This method has the advantage of textual content and improved the retrieval performance, but considering both sources of information would increase the computational overhead. Chen et al. [2] proposed the latent variable-based technique for retrieval. It performs better – even noisy text is available – but the major drawback is that visual features are not considered. Cooper [3] proposed character and speech recognition methods for retrieval. It combines both modalities to improve the performance but it has the problem of a weak indexing method, which affects the strength of retrieval. Yang et al. [18] proposed the weighted Discrete Cosine transform-based method for retrieval. Due to time-based text occurrence information, it smooths the retrieval performance but it faces the problem due to the text detector, which is not adaptable to noisy pixels. Yang et al. [19] proposed character and speech recognition methods for retrieval. It search indices to improve the searching performance in video retrieval but it requires manual annotation of videos. Yang et al. [17] proposed the video segmenter and geometry-based optical character recognition (OCR). This technique improved due to video indexing; however, dictionary-based learning requires more training sequences. Che et al. [1] proposed character recognition-based retrieval, which has the advantage of logical correlation among slides, but it fails to include the textual characteristics- and capturing characteristics-dependent analysis.

In this paper, a lecture video retrieval system is developed using multimodal features and probability extended nearest neighbor (PENN) classification. The proposed content-based lecture video retrieval system utilizes textual information and content features for the retrieval purpose. At first, input videos are read out and frames are extracted from the input videos. Then, key frames are identified from input frames. Once the key frames are identified, two levels of information are extracted from the frames. The first level of information is the textual contents, which are extracted using OCR methods [10], [12]. The second set of information is based on the visual content that is extracted based on the texture strength. Texture consistency is effectively estimated using a local pattern descriptor [4], which is one of the recent and effective techniques for texture description of images. These two sets of information are extracted from every video, and they are stored on the indexed database. When a query frame or text information is given as input, the proposed system extracts these two levels of information from the input, and they are matched with the database using the proposed PENN – the modified method from extended nearest neighbor (ENN) classification [13] – using probability modeling.

The major contributions made in the paper are given as follows:

A lecture video retrieval system is developed by combining the local vector pattern (LVP) and OCR with a classifier.
A new classifier is developed by modifying the ENN classifier including the membership degree.

The paper is organized as follows: Section 2 presents the motivation behind the approach. Section 3 explains the proposed video retrieval technique, and Section 4 presents the experimentation of the proposed technique. Finally, the conclusion is given in Section 5.

2 Motivation Behind the Approach

2.1 Problem Definition

Let us assume that the lecture video database D contains N videos subjected to various categories. The aim here is to retrieve k similar videos by inputting the query Q, which may be a video VQ or text TQ. The input database can be represented as follows:

(1)D={Vi; 1 ≤ i≤N},

where V_i is a video containing M number of frames. Every video is composed of a set of frames that are represented as follows:

(2)V i= {Fj; 1≤j≤M}.

Frames are two-dimensional vectors containing the pixel information, g_x,y. Frames are the group of pixels having the dimension of mxn. It can be represented as

(3)Fj={gx,y; 1 ≤x≤m; 1≤y≤n}.

Finally, the objective of retrieving the similar videos V^R from the input database for the input query can be indicated as follows:

(4)VR={Vk; 1≤k≤K}.

2.2 Challenges

Due to the ever-increasing storing of information through videos, finding suitable videos for the user’s intent from the database is indispensable in the current world. This presents the challenge of searching and finding suitable and user-intent videos through the user query, which may be frames or videos.

In today’s world, lecture videos are playing a major role among students, for understanding and clarifying the algorithms or concepts published by eminent professors. This poses an additional challenge of identifying the most suitable lecture video for their query input, which may be image frame or text string.

The important challenge of converting the core content of the video to textual information has more practical challenges. The textual characteristics of the contents presented in every lecture video are completely different in line spacing, font, and size. In addition, the capturing characteristics such as illumination and intensity are also completely different. These practical challenges need to be considered.

In Ref. [16], content-based lecture video retrieval was developed using character recognition and speech processing. Considering both information seem a computation overhead and, mostly, both sources have the same information in two different formats. Thus, considering a single source of information to extract the suitable video is an important research issue to be solved.

3 Proposed Methodology: Multimodal Features and PENN for Content-Based Lecture Video Retrieval

This section presents the proposed methodology for lecture video retrieval using multimodal features and PENN classification. The input for the proposed technique is the lecture video database containing different subjects. The feature library is then constructed from the input videos after extracting keywords and texture content. The constructed feature library is utilized with the PENN classifier for finding the neighbor videos of the input query, which may be video or text. The PENN classifier finds the probability of belonging for every video based on the distance matching with query. Also, it considers the finding of two-level neighbors for the probability computation. Based on the neighbor videos found from the PENN classifier, the retrieval is performed. Figure 1 shows the block diagrams of the proposed video retrieval technique using the PENN classifier.

Figure 1:

Block Diagram of the Proposed Video Retrieval Scheme.

3.1 Extraction of Key Frames

This step is to extract the key frames from the input videos to find the feature information. The extraction of key frame is important for feature extraction because the input video may have large numbers of frames, so the extraction of feature information from every frame is difficult because it requires much computational complexity. Therefore, the right selection of key frames and the extraction of features only from the key frames signify better retrieval efficiency as well as effectiveness. In order to include this objective, each and every frame is subtracted from their corresponding previous frames, and the frames that have more difference are taken out as key frames. The frame having much difference with its previous frames also carries the significant information; it may be the next slide if the input is a presentation video. Now, the input video is shrunken with only the important frames, which are known as key frames:

(5)Vi={KFl; 1≤l≤L},

(6)KFl∈Fj; L<M,

where KF_l denotes the key frames and L denotes the number of key frames. The number of key frames L should be fewer than the number of frames M in the input video V_i. Figure 2 shows the visualization of key frames from four categories of videos. Figure 2A is a video related to data mining, Figure 2B is related to image processing, Figure 2C is related to networking, and Figure 2D is related soft computing.

Figure 2:

Visualization of Key Frames.

(A) Video 1. (B) Video 2. (C) Video 3. (D) Video 4.

3.2 Construction of the Feature Library

The important phase of the video retrieval scheme is to construct the feature library that is the used to find the relevant videos. In this paper, we utilized two modalities for feature extraction. The first modality is textual information that is found from the lecture video using OCR [10], [11], [12], which is the most common method for finding the textual contents from image or video data. The OCR methods are given in Refs. [10], [11], [12]. We directly applied them to the video to find out the keywords present in the frames. The second modality utilized in this paper is visual content, which is extracted using the texture descriptor.

3.2.1 OCR on Key Frames

Once we identify the key frames from the input video, OCR is applied to the key frames to extract the keywords presented. The reason for selecting OCR as a feature vector is that the texts in the lecture slides are closely related to the lecture topic, and can thus provide important information for the retrieval task. The literature presents various algorithms for OCR. This paper utilizes the popular algorithm given in Refs. [10], [11], [12], which is based on the benchmark OCR framework called Tesseract. The extraction of keywords present in the lecture videos is explained using five important steps:

Step 1. Line finding: It directly reads the key frames and the lines are extracted using two main processes, called blob filtering and line construction. In blob filtering, the size of the characters is identified by finding median heights, which are then utilized for safely filtering out blobs. In the second process, line creation is performed by merging the blobs that overlap by at least half horizontally.

Step 2. Baseline fitting: Quadratic spline is utilized here to fit the baseline more accurately after finding the text lines. Here, blobs are partitioned into groups and the baselines are fitted with a realistically continuous displacement for the original straight baseline.

Step 3. Fixed pitch detection and chopping: Here, characters are segmented by checking the pitch of the text. The determination of the pitch information is carried out using Tesseract, which is then utilized to chop the words into characters for the word recognition step.

Step 4. Segmentation and search: Once the word is segmented, the potentiality is not good enough for the segmented words; the associator reads the words and performs the A* (best first) search on the segmentation graph to find the candidate characters and select the optimal character from the search results.

Step 5. Shape classification: Once the character is segmented, the recognition of the word is performed by finding the features using polygonal approximation [12]. The features were initially trained with different set of words, and the classification of the words is now found out using the trained classifier.

After performing the above steps, the words are recognized or found out from the key frames. Then, stop words like, “an,” “the,” “he,” “she,” “can,” and so on are removed from the recognized text to find the important keywords from the key frames:

(7)OCR(KFl)={Wp; 1≤p≤Nw},

where OCR(KF_l) is the OCR on the key frames, W_p is keywords extracted, and N_w is the total number of keywords. Figure 3 shows the sample set of keywords extracted from the videos using OCR recognition.

Figure 3:

Keywords Extracted from the Videos.

(A) Video 1. (B) Video 2. (C) Video 3. (D) Video 4.

3.2.2 LVP on Key Frames

To extract the visual features, a texture descriptor called LVP [4] is utilized here to find the important contents for effective retrieval. The reason for selecting texture feature is that texture can play a major role in computer recognition tasks because texture features are easy to understand, model, and process, and ultimately to simulate the human visual learning process using computer technologies. Here, key frames are directly given to the LVP operator, which provides texture histogram as feature content. The LVP of the key frame in δ direction of vector at r is mathematically given below:

(8)LVPd(r, δ)=∑b=1Np2b−1hd(b, r, δ),

where LVP_d(•) refers to the LVP at neighborhood distance d and δ is the index angle.

(9)hd(b, r, δ)={1; if Hb,r,δd≥00; Otherwise},

(10)Hb,r,δd=Bδ+45°d(b)−[Bδ+45°d(r)Bδd(r)×Bδd(b)],

(11)Bδd(r)=KF(δ, d)−KF(r),

where KF(δ,d) is the intensity of the pixel, which is located at d distance and δ angle from reference pixel r. Once we identify the texture image, the texture vector is found out by finding the histogram of the texture image. The texture histogram of the key frame is represented as follows:

(12)LVP(KFl)={Rq; 1≤q≤255},

where R_q is the count of the bin and 255 is the total number of bins. Figure 4 shows the visualization of the LVP of four videos from four different categories.

Figure 4:

LVP of Videos.

(A) Video 1. (B) Video 2. (C) Video 3. (D) Video 4.

3.2.3 Feature Concatenation

Feature concatenation is a step used to store the features extracted from the videos in an organized way. The features from every video consist of the OCR words and LVP for every frame. For example, every video has L number of key frames, and every key frame has a vector of LVP feature and a set of keywords as feature elements. The feature vector for the input video V_i can be represented as follows:

(13)f(Vi)={OCR(KFl); LVP(KFl)∀l∈Vi}.

The feature library f contains the feature of every videos f_i given in the input database:

(14)f={fi; 1≤i≤N}.

3.3 PENN Classifier for Video Retrieval

This section presents the proposed PENN classifier for video retrieval by matching the query video or text query with the feature library. The proposed PENN classifier is newly proposed here by extending the ENN classifier proposed in Ref. [13]. The ENN method considered neighbors of the retrieved neighbors for taking the decision of classification. However, the decisions based on the neighbor and on neighbors of neighbors were considered with equal importance to classify the data objects. In order to give different degrees of membership for the neighbors as well as the neighbors of neighbor, we have proposed a new mathematical model for better classification. The proposed PENN classifier considers different weightages for the first-level and second-level neighbors, and the membership degree is computed using the probability of assignment. Table 1 shows the algorithmic description of the PENN classifier.

Table 1:

Algorithm of the PENN Classifier.

1	Algorithm: PENN classifier
2	Input: Feature library f, query Q, K
3	Output: Retrieved videos
4	Algorithm
5	Start
6	For i=1 to N do
7	Compute the probability, P(fiQ)
8	Compute the cumulative probability, CP(fiQ)
9	end for
10	Compute a set R containing the K smallest from the cumulative probability
11	Return K videos
12	End

Let us assume that the input query Q is passed through the PENN classifier for the retrieval of K neighbors from the feature library. At first, the query is matched to the feature library that contains the features of all the videos. Then, the top K neighbors are selected from the matching process using the following probability formula:

(15)P(fiQ)=Sim(Q, fi)∑i=1NSim(Q, fi),

where Sim(Q,f_i) is the similarity measure, which is computed by matching the feature of query video with the ith video in the feature library. The similarity measure is computed using the following equation:

(16)Sim(Q, fi)=Minl=1L SimF(KFlQ, KFlfi).

Based on the above equation, the similarity measurement is performed for all the key frames and the frames having the minimum value are taken as the final similarity value of the query video with ith video. The similarity measure between two frames is the summation of the distance between the LVP vector and the keyword. Suppose a text query is given as the query input, then the keywords are only used for similarity measurement. The formula for computing the similarity between two frames is given as follows:

(17)SimF(KFlQ, KFlfi)=distS(OCR(KFlQ), OCR(KFlfi))+distE(LVP(KFlQ), LVP(KFlfi)),

(18)distS(X, Y)=1255∑i=1255(xi−yi)2,

(19)distE(X, Y)=1−(X∩Y)(X∪Y).

Once we find the probability measure for the query video with the videos in the database, the videos having the minimum probability is taken out as the K relevant videos of the input query. Then, these K videos act as query and their corresponding relevant videos are found out using the following equation:

(20)P(fifj)=Sim(fj, fi)∑i=1MSim(fj, fi).

The similarity of these two videos is found out using the above equation, and the cumulative probability to decide the relevant videos is found using the following equation:

(21)CP(fiQ)=α⋅P(fiQ)+β⋅∑k=1KP(fifj),

where α and β are the weighted constants. Here, the first term references the probability of membership based on the first level of neighbors and the second term refers to the probability of membership degree based on the second level of neighbors. The probability of membership for the query video with all the videos can be obtained, and the top K video with the minimum probability is taken as the relevant video for the input query. Figure 5 shows the visualization of query videos and the retrieved results.

Figure 5:

Visualization of Key Frames.

(A) VQ1. (B) Retrieved videos of VQ1. (C) VQ2. (D) Retrieved videos of VQ2. (E) TQ1. (F) Retrieved videos of TQ1.

4 Results and Discussion

This section presents the experimental results and the comparative discussion with the existing methods using three different metrics.

4.1 Experimental Setup

The proposed multimodal features and PENN classification for content-based lecture video retrieval is implemented using MATLAB, and the performance of the proposed system and the existing system will be validated using the metrics called precision, recall, and F-measure.

Dataset description: The videos utilized for video retrieval is collected from the publicly available resources. In total, 40 videos are taken with four different categories, such as data mining, image processing, soft computing, and wireless communication. Every category contains 10 lecturer presentation videos.

Evaluation metrics: The performance of the proposed video retrieval is evaluated using precision, recall, and F-measure. The definitions of these metrics are given as follows:

(22)Precision=Nrel∩NretNret,

(23)Recall=Nrel∩NretNrel,

(24)F-measure=2∗Precision∗RecallPrecision+Recall,

where N_rel is number of relevant videos and N_ret is the number of retrieved videos. Here, relevant videos are the manually classified videos and retrieved videos are the outputs obtained by the methods.

Parameters fixed: The parameters considered in the proposed method are k and radius (R). The k-value is the user-desired parameter because it is the number of videos the user wants to retrieve from the database. The value of R from LVP is analyzed, and the best value is suggested. The experimentation is here performed with four videos and four text queries. The four video queries are the queries taken from the input database of each category. The text queries utilized in the experimentation are {“data”, “image”, “network”, “computing”}. The comparison is performed with the ENN classifier described in Ref. [13] and the KNN classifier given in Ref. [14].

4.2 Analysis of k-Value from the PENN Classifier

This section presents the extensive analysis of the proposed video retrieval scheme for various numbers of k-value along with the video and text input query. Figure 6A shows the precision graph of four different video queries. Every query is from the four different categories of videos. After inputting the video query, the k-value, which means the number of retrieved images, is varied from 2 to 6 and the results are analyzed. From the results, we understand that VQ3 and VQ4 obtained the maximum accuracy of 83.3%, which is higher than that for the other video queries. VQ1 and VQ2 obtained the precision of 80%. Similarly, Figure 6B shows the precision graph for the text query. Here, the maximum accuracy of 90% is obtained when TQ4 is given as input and the k-value is 2. From both the precision graphs, we clearly understand that the precision value decreases whenever the k-value increases.