SPNet: A deep network for broadcast sports video highlight generation

https://doi.org/10.1016/j.compeleceng.2022.107779Get rights and content

Abstract

Professionally broadcasted sports videos usually have long durations but contain only a few exciting events. In general, professional bodies and amateur content creators spend thousands of man-hours to manually crop the exciting video segments from these long-duration videos and generate handcrafted highlights. Sports enthusiasts keep them updated with the latest happening based on such highlights. There exists a need for a method that accurately and automatically recognizes the exciting activities in a sports game. To address this issue, we present a deep learning-based network SPNet that recognizes exciting sports activities by exploiting high-level visual feature sequences and automatically generates highlights. The proposed SPNet utilizes the strength of 3D convolution networks and Inception blocks for accurate activity recognition. We divide the sports video excitement into views, actions, and situations. Moreover, we provide 156 new annotations for about twenty-three thousand videos of the SP-2 dataset. Extensive experiments are conducted using two datasets SP-2 and C-sports, and the results demonstrate the superiority of the proposed SPNet. Our proposed method achieves the highest performance for views, action, and situation activities with an average accuracy of 76% on the SP-2 dataset and 82% on the C-sports dataset.

Introduction

The last decade has witnessed a dramatic increase in the number of videos uploaded to the internet, especially, on video-sharing platforms where these videos can exist for a long time. Other than users’ videos, such videos include TV programs, drama, sports, talk shows, etc. A large portion of these videos belongs to the category of sports. Usually, a video is accompanied by a user-defined tag(s)/keyword(s), but normally video data is still unstructured and such tags cannot explain what is exactly going on in the video. However, efforts are made to understand the video content [1], [2].

Professionally broadcasted sports videos usually have long durations but contain only a few exciting moments [3]. Different sports have different rules and the sports game may continue from one hour to a couple of days. Sports highlights can be considered as a video-based summary that contains only exciting or important events. Highlights deliver the whole excitement of the game in a much shorter period and such highlights are the main source for the sports enthusiasts to keep themselves up to date in a busy lifestyle [4]. Traditionally, such highlights are manually cropped based on human effort alone. Blog writers and other content creators spend thousands of man-hours to produce such highlights and it is quite challenging to compile a unique set of highlights for different sports videos. There exists a need for automatic sports video summarization methods.

The recent explosion of Artificial Intelligence (AI), especially deep learning, has granted opportunities for advanced visual information processing [5], [6], [7]. Deep learning and AI-based techniques have already been quite successfully incorporated in various real-life applications [8], [9]. Such techniques can be incorporated to build automatic tools for generating broadcast sports highlights. However, there are a lot of challenges associated with realizing such tools and techniques. For example, different people may have different opinions over an exciting event. For a soccer game, only a goal can be an exciting moment, whereas for some spectators, a missed goal can be an exciting event. This difference in opinion can be resolved by detecting all the exciting events and presenting users with a highlights summary based on users’ preferences. The second challenge is the diverse nature of sports, as every kind of sport is different from others having different rules and playfield scenarios [10]. The third challenge is the availability of training data, especially for deep learning-based methods. The fourth and most challenging part is the nature of broadcast sports videos. Unlike user-generated videos, broadcast sports videos are recorded through multiple cameras having different views. Besides, the cameras are switched rapidly as per the instructions of the sports director. These properties of broadcast sports videos have not been adequately acknowledged by previous research.

Automatic sports video summarization is a challenging problem. The previous studies rely on tracking the players’ activities [11], [12], [13], monitoring the crowd noise [4], [14], clustering similar frames of the videos, analyzing player actions, and extracting useful information from the caption regions or user-generated comments [3]. Clustering-based approaches use low-level features to reduce visual redundancy, while other methods focus on extracting semantic features. The unstructured nature of sports video makes video summarization a challenging task.

The previous approaches have high chances of missing exciting events. For example, the caption-based approaches [15], [16] depend on the text-based information provided by the broadcaster. Such an approach can exactly track an important event, e.g., goal in a football match. However, this approach cannot be employed to detect goal misses, corner shots, outs, etc. which might seem interesting to a spectator. Tracking players [11], [12], [13] in a game has its benefits, but due to the versatile nature of sports games, it is impractical and challenging to track players in different sports categories. Moreover, the players are usually idle for most of the time in sports games such as cricket, baseball, etc. For this reason, this approach has higher chances of missing exciting events. Audio cues themselves (cheering of the crowd) are an important feature for detecting an exciting moment [4]. However, audio cues do not indicate the nature of the event. Besides, it was noticed on several occasions that the crowd cheers without any reason and may lead to false detection of an exciting event. As mentioned above, the rapid camera view change and movement of cameras pose a challenge to the clustering-based methods.

In this paper, we propose a novel approach to recognize the sports activity and robustly describe “what is going on in a video segment”. Such descriptions help highlight generation based on spectators’ preferences. The proposed approach has practical significance and can help in extracting highlights from various sports categories. First, unlike previous studies, we separate the broadcast sports video scenes based on views, actions, and situations (details are provided in Section 4.1). Second, we propose a deep learning-based network (SPNet) that collectively recognizes exciting events based on spatiotemporal high-level visual features. The proposed network utilizes 3D-ResNet that can directly extract spatiotemporal information using 3D kernels. Moreover, we utilize the Inception V3 block for collectively recognizing views and situations. The inception block stacks 11 inception modules where each module consists of convolution filters, pooling layers, rectified linear units, and filter concatenation. By using every frame under consideration, feature sequences are constructed and further trained using neural networks. Finally, the exciting event is evaluated based on the proposed prediction algorithm. The contributions of this paper can be summarized as follows:

  • We propose a deep learning network (SPNet) that exploits high-level visual feature sequences to accurately describe “what is happening” in a broadcast sports video scene and utilizes this information for generating highlights based on spectators’/users’ preferences.

  • We add fine-grained annotations to the SP-2 dataset1 and separate the annotations according to view, action, and situation.

  • We perform extensive experiments to validate the behavior of our proposed solution and the results of these experiments indicate the superiority of the proposed approach. Relevant data and codes are publicly available.2

The rest of the paper is organized as follows: Section 2 gives an overview of the related work. Section 3 presents the proposed method in detail. In Section 4, we demonstrate the results and their discussion followed by a conclusion in Section 5.

Section snippets

Related work

In this section, we shed some light on some of the related studies and contributions. Many researchers devoted their time and resources related to the field of video summarization or video abstraction. Sports video highlights generation can be considered as a subclass of video summarization. Some of the studies focused on generation highlights from sports videos such as [4]. Various studies focused on analyzing only a specific category of sports, e.g., basketball [17], tennis [18], soccer [19]

Our method

This section first presents the related background knowledge, followed by further details about SPNet.

Experiments and results

We performed comprehensive experiments on the SP-2 dataset to find the best-performing method. In this section, we present details about the dataset, results of the experiments, and some discussion on results.

Conclusion

Broadcasted sports videos usually have long durations and contain only a few exciting moments. It is not feasible for sports enthusiasts to watch the whole game. For this reason, many professional bodies and amateur content creators manually crop the video segments from long-duration broadcast videos. There exists a need for an automatic method capable of extracting the exciting moments while keeping in view the different opinions and preferences of sports enthusiasts (user preferences). In

CRediT authorship contribution statement

Abdullah Aman Khan: Data curation, Methodology, Writing – original draft. Jie Shao: Conceptualization, Supervision, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (No. 61832001).

References (38)

  • LuH. et al.

    DRRS-BC: decentralized routing registration system based on blockchain

    IEEE CAA J Autom Sin

    (2021)
  • KhanA.A. et al.

    RICAPS: residual inception and cascaded capsule network for broadcast sports video classification

  • Host K, Ivasic-Kos M, Pobar M. Tracking handball players with the deepsort algorithm. In Proceedings of the 9th...
  • Tanikawa S, Tagawa N. Player tracking using multi-viewpoint images in basketball analysis. In Proceedings of the 15th...
  • LinC. et al.

    Sports video summarization with limited labeling datasets based on 3D neural networks

  • Miao G, Zhu G, Jiang S, Huang Q, Xu C, Gao W. The demo: A real-time score detection and recognition approach in...
  • Khan AA, Lin H, Tumrani S, Wang Z, Shao J. Detection and localization of scorebox in long duration broadcast sports...
  • YoonY. et al.

    Analyzing basketball movements and pass relationships using realtime object tracking techniques based on deep learning

    IEEE Access

    (2019)
  • GhoshA. et al.

    Smarttennistv: Automatic indexing of tennis videos

  • Cited by (6)

    This paper is for regular issues of CAEE. Reviews were processed and recommended for publication by Co-Editor in Chief Prof Huimin Lu.

    View full text