Keywords

1 Introduction

With the development of smart phones and mobile Internet technologies, the volume of online video data has experienced explosive growth, and the number of users viewing videos on major video sites has also increased significantly. Faced with such a huge amount of video data, manual review of video content is no longer possible. This will expose some violent and bloody videos directly to the users watching the video and bring a negative impact on the visual and mind of people, especially for the healthy growth of young people. So this paper focuses on the classification of bloody videos to decrease these adverse effects.

Due to the discomfort caused by bloody content and absence of public related dataset, fewer scholars pay attention to this problem. Violent video recognition technology develops slowly and still faces a lot of open and difficult problems. Even the definition of the bloody class with complex scenes is still vague. European MediaEval proposes the VSD-Violent Scene Detection game task, which defines the definition of violence video in its competition [1]: ‘Videos containing physical violence that children under 8 years of age watch are not permitted’. This paper uses this concept to define bloody videos. There are currently more than a dozen battle scenes in the MediaEval 2015 database violence videos that have been publicly published. However, there are only a dozen bloody scene videos with more psychological harm, and the total duration of a bloody screen video is only about one minute. In view of the lack of current bloody video data and the low intelligence of blood-stained video detection algorithms, this paper constructed a bloody video database and proposed an end-to-end bloody video recognition system based on audio-visual feature fusion to purify the network environment and improve users experience and more.

This article is divided into five parts: Sect. 2 introduces the related research, Sect. 3 puts forward the end-to-end bloody video recognition system based on audio-visual feature fusion, Sect. 4 compares and analyzes the experimental results. The conclusion and future research ideas are given in Sect. 5.

2 Related Work

Related research work mainly focuses on feature extraction and multimodal fusion. In visual feature extraction, convolutional neural networks are often used to extract image features of static frames [2]. For example, violence frames are used as input in [3], and the model is fine-tuned by a model pre-trained on ImageNet. The experimental results show that compared with the classification effect of traditional features, the advanced and sematic features can help improve the performance of violent video system recognition. The literature [4] draws on the two-stream CNN network structure, uses static video frames and optical flow streams as two-way CNN input to extract violent video features, and uses the output of CNN network as the input of LSTM network to analyze long-time video sequences. A variety of hand-craft features are also extracted, and then several different SVM classifiers are trained based on the features of the manual design and the features learned from deep learning to obtain the final decision result. In [5], the adjacent video frames are used as the input of the neural network, and only the convolved LSTM network is used to extract the inter-frame change information and scene semantic information of the violent video. In summary, the above methods are only the use of visual information to recognize violent videos. We know that in addition to bloody videos on the screen, bloody audios like screams and explosions can provide complementary information. It is unreasonable to classify videos only from visual or audio information.

Compared with the single vision model classification method, the audio and video multimodal fusion model can capture the complementary information that is not obtained by the single mode and perform a more robust prediction. In [6], the CNN network is used as both a deep audio feature extractor and a violent video classifier, and the audio is windowed at intervals of 25 ms and 10 ms to get the MFB feature and sent to the CNN network to extract high-level audio features. Finally, the decision scores of audio and visual modalities were merged. However, this method used the MFB feature obtained after processing the original waveform into the CNN network instead of sending the original waveform information directly to the CNN network to extract the audio feature. This undoubtedly causes the loss of the original audio information to affect the audio feature extraction the CNN network. In [7], deep learning features and manual design features are used to extract visual channel features, MFCC methods are used to extract audio features, and a SVM-based late fusion method is used to classify violent videos. These above methods adopt the late fusion method in using the audio and video information. The disadvantage of late fusion is that it fails to utilize the feature level correlation among modalities because the feature information of each mode has been lost. Due to the limited fused information, the improvement of the recognition performance is also limited for the late fusion method. In contrast, the audio-visual feature fusion can simultaneously ‘see’ more information of each mode, and can better capture the connection of each mode, and can significantly improve the video classification performance. However, it is still scientific research issues how to effectively extract and integrate audio-visual features, and recognize bloody videos. This paper will focus on this problem.

The main contribution of this paper is to propose an end-to-end bloody video recognition system by audio-visual feature fusion using deep learning: Based on the self-built bloody video database, firstly use the CNN and LSTM methods to extract the temporal and spatial characteristics of the visual channel. The network directly extracts the time domain features from the original audio waveform through 1D convolutional neural network. Finally, the feature fusion layer of neural network is constructed to achieve the audio-visual feature fusion, and the proposed method for end-to-end bloody video recognition is implemented. The accuracy of classification on bloody video test dataset reaches 95%, which provide a valuable theoretical reference for violent video recognition problems.

3 End-to-End Bloody Video Recognition Algorithm

The block diagram of the end-to-end bloody video recognition system proposed in this paper is shown in Fig. 1. In terms of visual feature extraction, we get the static frame features extracted from the ResNet network into the LSTM network to obtain visual features with temporal and spatial information; In the feature extraction of audio modals, in order to not destroy the original information of the signal as much as possible, we take the original waveform input directly into the CNN network. After the audio and visual features are obtained in the above two steps, a feature fusion layer is trained using the neural network method to fully capture the correlations between the features while preserving their respective characteristics. Through this feature fusion layer, the shared feature subspace of the bloody audio and video is built, and then two modal features transformed into the same space are concatenated together and fed into the classifier to obtain a decision score for the video.

Fig. 1.
figure 1

Overall network structure of our method

3.1 Visual Feature Extraction Based on CNN and LSTM

At present, the ResNet network is used to extract the static image features. However, for the feature extraction in the visual modality of bloody videos, it is not enough to perform static features extraction and analysis for each frame because there is a temporal relationship between frames. A frame has a content relationship with its adjacent frames. However, the spatial convolutional neural networks like ResNet cannot simulate the temporal continuity characteristics of video frames. So we introduce the LSTM structure, which allows the information to be persistent and changes memory with adding a ‘process to determine whether the information is useful or not in the algorithm’ remember, update, and focus on information. In order to make full use of the context information between video frames, we use bidirectional LSTM [8] to extract video sequence information.

We first use the ResNet-50 model to extract the static frame characteristics of bloody videos. The ResNet-50 model trained on bloody video static frames is considered as a bloody static feature extractor, removing the last layer of the full ResNet connection layer. Take the 2048-dimensional vector after average pooling as the extracted blood feature maps and input them to the two-layer LSTM network for training instead of sending the original video directly to the LSTM network. The model diagram is shown in Fig. 2.

Fig. 2.
figure 2

ResNet+ bidirectional LSTM network model

The detailed implementation process: First, the video takes a frame every half second, and the obtained frame is firstly input to the 50-layer residual network so as to extract 2048-D feature vector for each frame. And then, the 2048-D feature vector will be fed into the LSTM network and the decision result can be obtain from the output of LSTM network. Because each of video is different lengths, we first find an average of 10 frames in all the videos in the training set to ensure that the length of the sequence entered into the LSTM remains the same. After that, the truncation method is used for any number of frames greater than 10, and the feature dimension size of all video outputs is the same.

3.2 Audio Feature Extraction Based on Raw Waveform and 1D CNN Model

The traditional audio models are divided into two steps: designing audio features and building a suitable model based on this type of feature. However, we often find that the features designed using prior knowledge cannot be guaranteed to be suitable for some specific statistical classification models. Therefore, we try to send the original waveform as input to the CNN network in order to keep the original signal information as much as possible.

This article designed a shallower full convolutional network without any fully connected network and dropout layer applied. According to [9], in the network we have designed, we introduce a separate global average pooling layer that can average the activation values in the time dimension and convert each feature map into a float type value. The network structure designed in this paper is as follows (Fig. 3):

Fig. 3.
figure 3

The schematic of the raw waveform CNN network structure.

The specific implementation process is as follows: First, read the audio data according to the sampling rate of 8000 Hz, and then truncate or fill all audios to equal length. In this part, according to the length of the audio in the database, we take 32000 sampling points, so the input audio sequence is (32000, 1). The receptive field of our first convolutional layer is set to 80, and the convolution step is taken as 4, so the output eigenvector dimension of the first layer convolution is (8000, 256). After that, the eigenvectors output by 256 filters are passed through a time-domain pooling layer. The length of the pooling layer is 4, the maximum pooling method is used, and the output length is 2000 feature vectors. Then a convolution layer using a (3, 1, 256) convolution kernel also performs a one-dimensional convolution with a step size of 1, so the output feature vector is (2000, 256) dimensions. After this is the second max pooling layer, the pooling length is 4, and the output vector dimension is (500, 256). We use a global average pooling layer to average 256 feature maps and finally get a (256,) dimensional vector. The blood/non-blood labels of audio data are regarded as the supervised signal to guide the training of audio network. The weight of this network will be used to initialize the parameters of the following fusion network. After learning of neural network to autonomous learning to the most suitable for this paper research the bloody audio characteristics of the classification task.

3.3 Bloody Video Detection Based on Audio-Visual Feature Fusion

Early fusion, also called feature fusion, refers to fusing the extracted features of each modality. The simplest fusion method is to directly concatenate the visual and audio features. However, the visual channel features and the audio channel features have different meanings and are located in different feature spaces. Directly merging two types of features with different meanings and ambiguities sometimes leads to recognition performance decline. Therefore, how to eliminate the multimodal ‘semantic gap’, consider the inter-modal relationships, and established a shared feature fusion space is still a technical problem that needs to be solved.

In order to make full use of the correlation and complementarity between features, we introduced a shared feature fusion layer on the basis of the previous single-channel network, and transformed each modal feature into the same feature expression space through the newly created feature fusion layer. The multimodal network structure of the audio-visual feature fusion layer are shown in Fig. 1. We will use data as a driver, utilized deep neural network methods to train the whole network, and obtain shared feature subspaces for bloody audio and video. And then, the merged features are sent to network classifiers to get the recognition result.

The detailed implementation process is as follows: During the training phase, we firstly train the visual and audio networks separately that are introduced in Sects. 3.1 and 3.2. The fully connected layers are discarded and only the extracted features are considered. The audio network gets a 256-dimensional feature vector, the visual network gets a 512-dimensional feature vector; After that, the two feature vectors are input to a fully connected feature-fusion layer, and then a ReLU activation function is applied after each fully connected layer to make a nonlinear transformation. Two modal features transformed into the same space are put together and fed into the sigmoid classification layer to get the decision score for this video. We note that the multimodal network is initialized utilizing the weights of the unimodal models and trained end-to-end. The hyper-parameters of the network training are set as follows: in the feature fusion layer, the number of full-connected neurons in the visual channel is 256, the number of fully-connected neurons in the audio channel is 125, batch size is 32, the maximum training epoch number is 10, and the optimizer adopts the Adam method. The initial learning rate is set to 0.0001.

4 Experimental Results and Analysis

4.1 Database Description

The current bloody video public database is very few. The internationally published MediaEval 2015 violence database contains a total of 10,900 video clips with different resolutions. Most of them are 640 * 360, 1280 * 720, the average length is about 10 s, and the total length is about 10 s. For 30 h, only 4.6% of violent videos were recorded. The number of bloody videos in these violent videos was even smaller. So we collected about 50 bloody movies and more than 20 short videos on YouTube, all of which were 80% of the clips. All of them were clipped to cut out the entire bloody shot. It constitutes a positive sample of a bloody video database, and negative samples mostly use non-bloody video clips from MediaEval 2015. In the data set, most video resolutions are 1024 * 576, and the lengths are from 2 s to 4 s. The number of positive and negative samples in each category is 1:1. The total data collection time is 67 min. The data set distribution is shown in Table 1.

Table 1. Bloody video database composition

The database size and data quality play a key role in the deep learning algorithm. On the one hand, the data with a large enough amount can make the deep learning network more fully fit the complex function. On the other hand, it can accurately extract high-level semantic features of the data sample. Therefore, in order to overcome the problem of insufficient data, we use the data augmentation methods like rotation transform, flip transform, shear transform, scale transform, translation transform, scale transform, color shake, noise perturbation and other methods to expand the static frame data of bloody video, thereby training the Resnet50 network. The model obtains semantic features with better description of bloody video static frame information.

4.2 Evaluation Metric

The recall rate, precision rate, and accuracy rate are commonly used indicators to measure the predictive performance of the two-class model. In contrast, the accuracy rate is a single-number evaluation metric. This paper chooses the accuracy rate as the evaluation index of this algorithm. The accuracy rate is the proportion of the correct sample to all samples, as shown in formula (1).

$$ Accuracy = \, (TP + TN)/(TP + TN + FN + FP) $$
(1)

The TP (True Positive) indicates the number of the blood videos that are correctly classified as the number of blood class, TN (True Negative) indicates that the number of the normal samples that are correctly classified as the normal class, and FP (False Positive) indicates the number of the normal samples mistakenly predicted as blood class, FN (False Negative) is the number of samples in which blood samples are incorrectly predicted to be normal.

4.3 Experimental Results Using End-to-End Audio and Video Feature Fusion Method

All of the tests in this article were conducted on our own bloody video dataset. Table 2 shows the bloody video classification results based on the visual channel alone, the bloody video classification results based on the audio channel alone, and the bloody video classification results based on the audio and video feature fusion.

Table 2. Comparison of the test results of each model in the bloody video test library

In the visual channel-only bloody video recognition, the ResNet+LSTM-based method has a larger improvement than the simple residual network results. The main reason lies in the fact that we have introduced a bidirectional LSTM network and the temporal characteristics of the video frame. Modeling is performed and the time information in both directions is considered. The prediction result also confirms the effectiveness of the bidirectional LSTM network in bloody video detection tasks.

In the bloody video recognition based on the audio channel alone, the recognition result of inputting the original waveform diagram as a 1D convolutional network is better than the recognition result of processing the spectral map as a 2D convolutional network input. This is because the original waveform diagram is directly used as the information input of the network to reduce the loss of audio channel information. The features obtained after the 1D convolutional network can better describe the semantic information of the audio data. Here, we perform the same processing for each piece of test audio to the training phase, fill or truncate to 32,000 sampling points and send it to the trained one. The network gets its prediction score. However, as can be seen from Table 2 the indicators for detecting blood-stained video with audio channels are significantly lower than those with only visual channels. We believe that the main reason is that in the task of recognizing the bloody videos, audio information is just complementary and auxiliary to visual channel information. Not all audio channels of the video profiles contain obvious bloody features. For example, few videos may display bloody content without including screams and other sounds. On this condition, the cue from the sound cannot judge whether the video is bloody or not.

In the aspect of bloody video recognition based on audio-visual feature fusion, we implemented two bloody video recognition methods: one is based on the direct concatenation and fusion of audio-visual features, the other is based on the construction of feature fusion layer. By comparing the above experimental results, we can clearly see that the recognition accuracy of bloody video based on the multimodal fusion model is higher than that of the single-channel detection compared to the visual channel and the audio channel, which verifies the effectiveness of audio-visual feature fusion. However, due to the difficulty in collecting bloody video data, the number of our datasets is not large enough, making the difference in the recognition results of the two feature fusion methods not obvious. It is believed that the bloody video recognition method based on the feature fusion layer will exhibit better robustness as the amount of data increases.

5 Conclusion

5.1 Conclusion

This paper focuses on the use of multimodal fusion technology to achieve bloody video recognition. First of all, the related research is summarized in the aspects of feature extraction and multi-channel fusion technology; And then, the visual features of the frame is extracted from the residual network are sent to the two-layer LSTM to represent the spatial-temporal visual cues of bloody video. The raw audio waveform is input into the 1D convolution in the neural network to get the audio signal features in the time domain. Finally, we achieve bloody video recognition based on early-fusion multimodal fusion and implement end-to-end training using video label as a supervision signal. We aim to make full use of the correlation and complementarity of visual and audio channels to make joint decisions on bloody video. Using a effective fusion layer, the visual and audio features are projected into the same feature expression space and finally get 95% accuracy.

Due to less open bloody video data, we constructed bloody pictures and bloody video database using web crawlers and data augmented methods. The proposed multi-channel fusion model over the single-channel model is verified, and it can have a better discriminative effect on the videos in our self-built bloody video database. This research is also beneficial for the development of intelligent monitoring technology for bloody Internet content.

5.2 The Future Work

In this paper, CNN+LSTM is used to let the network automatically extract visual feature descriptors. However, bloody videos often contain some fighting scenarios. Therefore, the motion cues like optical flow information will be considered in order to obtain a higher-performance bloody violent video recognition system in the future.