1 Introduction

Sport video summarization, or highlights generation, is the process of creating a synopsis of a video of a given sport event that gives the viewer a general overview of the whole match. This process incorporates two different tasks: (1) to detect the most important moments of the event, and (2) organize the extracted content into a limited display time. While the second point is a widely-known problem in the multimedia and broadcasting community, the definition of what is a highlight has different interpretations in the community. According to [9], highlights are “those video segments that are expected to excite the users the most”. In [22], the focus relaxes from excitement to general attention, and thus salient moments are the ones that attract audience attention the most. These two definitions would imply to explicitly design specific models for extracting excitment from the crowd in one case and attention on the other. In this paper we overcome this problem by automatically learn visual features using deep architectures that discriminate between highlights and ordinary actions.

Fig. 1.
figure 1

Example video sequences of a goal event (left) and standard play time (right).

Traditionally, extracting sport highlights has been a labor intensive activity, primarily because it requires good judgment to select and define salient moments throughout the whole game. Then, highlights are manually edited by experts, to generate a video summary that is significant, coherent and understandable by humans. State-of-the-art artificial intelligence is still far away from having solved the whole problem.

In the last years, there has been an increasing demand for automatic and semi-automatic tools for highlights generation, mainly due to the huge amount of data (i.e. sport event videos) generated every day and made available through the Internet. Specialized broadcasters and websites are able to deliver sport highlights minutes after the end of the event, handling thousands of events every day. As a consequence, there has been extensive research in this area, with the development of several techniques based on image and video processing [1, 2, 8, 9, 13, 19, 22]. More recently, many works started using additional sources of information to increase performances, including audio recordings [15, 21], textual narratives [17], social networks [7, 10, 18], and audience behavior [4,5,6, 14]. Despite some solutions are already present on the market, performances are in general still fairly poor and we believe there is room for new research on this topic.

While previous work attempted to detect in sport videos actions that stimulate excitement [8] or attract attention [22] of the audience, in this paper we reverse the problem by analyzing the audience behavior to identify changes in emotions, that can only be triggered by highlights on the game field.

Specifically, we present a novel approach for sport highlight generation which is based on the observation of the audience behavior. This approach is based on the analysis of a set of space-time cuboids using a 3D-CNN architecture. All the samples are trained singularly, the result for each cuboid at a certain time step is then processed through an accumulator which generates a sort of highlight probability for the whole audience that will be used to perform the final ranking.

The rest of the paper is organized as follows: in Sect. 2 we briefly present the state-of-the-art in automatic highlight detection. In Sect. 3 we detail the proposed methodology, while in Sect. 4 we show some qualitative and quantitative results on a public dataset of hockey matches. Lastly, in Sect. 5 we draw some conclusions and perspectives for future works.

2 Related Work

Money and Angius [12] provide an extensive literature survey on video summarization. According to the taxonomy proposend in that paper, related work can be classified into three categories: (1) internal summarization techniques; (2) external summarization techniques; and (3) hybrid summarization techniques. By definition, internal summarization techniques rely only on information provided by the video (and audio) streams of the event. These techniques extract low-level image, audio, and text features to facilitate summarization and for several years have been the most common summarization techniques. External summarization techniques require additional sources of information, not contained in the video streams. These are usually user-based information –i.e. information provided directly from users– and contextual information –such as the time and location in which the video was recorded. As for hybrid summarization techniques, both internal and external information are analyzed, allowing to reduce the semantic gap between the low level features and the semantic concepts.

Social networks. According to Hsieh et al.  [10], the quantity of comments and re-tweets can represent the most exciting moments in a sport event. A highlight can be determined by analyzing the keywords in the comments and observing if the number of comments and re-tweets passes a certain threshold. Fião et al.  [7] uses emotions shared by the spectators during the match via social networks to build a system capable of generating automatic highlight videos of sports match TV broadcasts. Auxiliary sources of information are TV broadcast videos, the audio, the analysis of the movement and manual annotations (when available). The system also allows for the user to query the video to extract specific clips (e.g. attacking plays of a specific team).

Text. In [17], Suksai and Ratanaworabhan propose an approach that combines on-line information retrieval with text extraction using OCR techniques. This way, they are able to limit the number of false positives.

Audio. Rui et al.  [15] presents a method that uses audio signals to build video highlights for baseball games. It analyzes the speech of the match announcer, both audio amplitude and voice tone, to estimate whether the announcer is being excited or not. In addition, the ambient sound from the surrounding environment and the audience are also taken into considerations. Built on this work, Xiong et al.  [21] handpicked the highlight events and analyzed the environment and audience sounds at each of those highlight events. They discovered that there exists a strong correlation between loud and buzzing noise and some major highlight events. This correlation exists in all the three sports being analyzed: baseball, golf, and soccer.

Audience. Peng et al.  [14] propose the Interest Meter (IM), a system able to measure user’s interest and thus use it to conduct video summarization. The IM takes account attention states (e.g. eye movement, blink, and head motion) and emotion states (e.g. facial expression). These features are then fused together by a fuzzy fusion scheme that outputs a quantitative interest score, determine interesting parts of videos, and finally concatenate them as video summaries. In [4], Conigliaro et al. use motion cues (i.e. optical flow intensity and direction entropy) to estimate the excitement level of audience of a team sport event and to identify groups of supporters of different teams. In [5], these features are used to identify highlights in team sport events using mean shift clustering.

3 Method

The proposed highlights detection methodology uses a 3D Convolutional Neural Network (3D-CNN) to extract visual features from video recordings of the audience of the event, and classify them in positive samples (i.e. when a highlight occurs) and negative samples (i.e. standard play or timeouts).

From empirical observations, the audience reaction of a highlight (e.g. a goal) lasts for at least the 10 s that follows the event itself. For this reason, temporal resolution is not a critical parameter and downsampling the video from 30 to 3 fps allowed us to reduce the computational burden without losing the informative part of the video. The 3D-CNN cuboid is extracted from a manually selected rectangular area that roughly contained the bulk of the audience, using a uniform grid with fixed spatial dimension of 100 \(\times \) 100 pixels, while the temporal resolution has been set to 30 frames. These parameters are the result of an a priori intuition that each block should be able to represent a portion of spectators which should not be too large, in order to reduce the computational burden, but at the same time it should not be too small since this would bring to be too much location dependent. For our model we used a sliding window with a stride of 50 pixels resulting in a maximum overlap between two crops of 50%

In order to detect and rank the most important moments in the video sequence we follow the idea of Conigliaro et al. [3], where information accumulators along time have been proposed to segment supporters of the two different playing teams. Our goal is however different: unlike them, we are interested in a global analysis of the excitement of the audience regardless of the supporting preference at a certain time. For this reason we are using an accumulator strategy over the whole audience location in the scene. Each spatio-temporal cuboid \(C_i\), \(i=1,...,N\) represents a sample that is fed into a 3D-CNN and analyzed independently; then, for each time instant the related probability score \(p_i\), \(i=1,...,N\) of being a positive class is accumulated over all the samples in the spatial dimension, generating a scalar value representing the Highlight Likelihood (HL) that is a score representing how likely a particular instant can be considered an highlight or not. A sketch of the overall system is shown in Fig. 2.

Fig. 2.
figure 2

Sketch of the overall method.

3.1 Network Architecture

Inspired by earlier works on action recognition [11, 20], we use a 3D Convolutional Neural Network composed by 4 convolutional and 3 fully connected layers.

The network takes as input video cuboids of 100 \(\times \) 100 \(\times \) 30, where the first two numbers refer to the spatial dimension while the third is the temporal depth (number of frames). The first two convolutional layers are composed 12 filters 3 \(\times \) 3 \(\times \) 3, to capture spatio-temporal features from the raw data. These are followed by a 2 \(\times \) 2 \(\times \) 2 max pooling layer to detect features at different scales. In the latter two convolutional layers, 8 3 \(\times \) 3 \(\times \) 3 convolutional filters have been used. In all convolutional layers the ReLU activation has been used. The network is then unfolded with a flatten layer followed by three fully connected layers of decreasing dimensionality (32, 8, and 2 neurons respectively). The final classification task is achieved by a softmax layer that outputs the probability of a test sample to belong to each of the two classes: “highlight” and “standard play”.

4 Experiments

In this section we provide both qualitative and quantitative results to validate our proposed methodology. For the evaluation we adopted the S-Hock dataset [16], a publicly available dataset composed by 6 ice-hockey games recorded during the Winter Universiade held in Trentino (Italy) in 2013. This dataset, besides a set of short videos heavily annotated on low level features (e.g. people bounding boxes, head pose, and action labels), it provides also a set of synchronized multi-view full matches with high-level event annotation. In these games, the labeling consist in the time position of meaningful events such as goals, fouls, shots, saves, fights and timeouts.

In this work we considered only two matches: the final match (Canada-Kazakhstan) which is used for training the neural network, and the semi-final match (USA-Kazakhstan), used for testing.

4.1 3D-CNN Training Procedure

As mentioned briefly earlier, the positive class is named “highlights” and it represents all the spatio-temporal cuboids starting when a team scores a goal while the negative class (i.e. “standard play”) includes other neutral situations happening during the game. In this work we excluded all the other significant annotated events (fouls, fights, etc.) to reduce the number of classesFootnote 1. In training phase the samples belonging to the two classes have been balanced to avoid dataset bias.

The S-Hock dataset provides a set of synchronized videos of the games including several views of the audience, at different resolution/zoom level, and of the complete game footage. The video acquisition is done from different points of view (frontal and slightly tilted to the side), in this work we used all these views to ensure a more robust model of training that is able to learn features that are more possibly scale and position invariant. Positive and negative samples are then splitted into training and validation sets with a ratio of 70%-30%. Data augmentation procedure has been performed (horizontal flips in the spatial dimension) not only to increase the amount of training data but also to augment the invariance of the network.

The final optimization is proposed as a classification problem, minimizing the categorical cross-entropy between the two classes. For this procedure we used the RMSprop algorithm, a generalization of the resilient backpropagation rprop algorithm that extends the ability to use only the sign of the gradient and to adapt the learning rate separately for each weight, to improve the work with minibatches. In our experiments we use minibatches of 64 samples each. A Dropout layer with 50% probability to disconnect the link is applied before the first two fully connected layers to reduce overfitting. The procedure iterates over the whole dataset until convergence, usually reached after about 10 epochs. The whole training procedure takes about 2 h on a machine equipped with a NVIDIA Tesla K-80 GPU, using Keras/TensorFlow framework. The whole resulting dataset is composed of a total of 32,000 training samples.

Fig. 3.
figure 3

ROC curve

Fig. 4.
figure 4

Summed probabilities of highlights over all the crops in the scene. As visible, peaks in the curve nicely corresponds to a highlight.

Fig. 5.
figure 5

Probability scores given by a subset of the crops (chosen to be non overlapping for visualization purposes); each dot represents a crop which describes part of the scene. Green dots represent crops classified as people reacting to an highlight (e.g. cheering) while the red dots represent the crops classified as people with a “standard” behavior. (Color figure online)

4.2 Quantitative Results

Here we report a quantitative performance evaluation of the 3D-CNN in detecting positive and negative highlight samples. From the second period of the testing game, we randomly selected 3000 positive samples as well as the same number of negative samples and we fed them into the trained network. In Fig. 3 the ROC curve is reported. The Area Under the Curve (AUC) is 0.87. Binary classification is performed by assigning the sample to the class corresponding to the higher score; under this conditions the network reaches 78% of accuracy, 69% of precision and a recall of 84%. Results themselves are quite good considering the difficulty of the task, however, our goal is different, since we are using those results in a more sophisticated framework to infer and rank interesting events during the whole game. Consequently we expect a certain amount of noise in such prediction since in many cases the sample may be partially filled with empty seats (see Fig. 5 ), producing a wrong or at least biased prediction toward the negative class. However, this problem is minimized with the use of the accumulator approach and due to the fact that the empty-seats location will be very little informative in the whole sequence, while the crowded locations, where most of the spectators are situated, will convey most of the information used for the final decision.

4.3 Qualitative Examples

We also provide qualitative results to validate our approach. Figure 4 shows the HL score, summed over all the cuboids, at every non overlapping 10-second slice during an entire match (3 periods of 20 min plus timeouts). Goals are clearly identified in the first two periods, while in the third one other events also trigger the audience behavior; in particular, there are two prominent events that don’t correspond to goals at 18:45 (which is caused by a player almost scoring) and at 28:15 (which is caused by a foul in front of the goaltender, and the resulting penalty). We can easily see that there is a correlation between HL score and important events in the game and that goals usually cause the biggest reaction on the spectators.

5 Conclusions

In this paper we propose a method to temporally locate highlights in a sport event by analyzing solely the audience behavior. We propose to use a deep 3D convolutional neural network on cuboid video samples to discriminate between different excitement of the spectators. An spatial accumulator is used to produce a score which is proportional to the probability of having an interesting highlight in that precise time location. This enables the model to identify goals and other salient actions.

Despite being very simple, the model we present provides good preliminary result on a public dataset of hockey games, encouraging further research based on this approach. In our opinion, the main limit of this model is in the way we take into account the temporal information; indeed we extend a standard CNN to work with 3D data, where the third dimension is time. A more sophisticated model, such as recurrent neural networks (RNN) and long-short term memory (LSTM), could benefit the final inferential results. As future work we intend to replace the accumulator with such a temporal model, expanding the classification to a multiclass problem in order to detect different events. In order to do so, the dataset has to be enlarged possibly on a different location to make sure the network is learning more general discriminative features.