SPNet: A deep network for broadcast sports video highlight generation☆
Introduction
The last decade has witnessed a dramatic increase in the number of videos uploaded to the internet, especially, on video-sharing platforms where these videos can exist for a long time. Other than users’ videos, such videos include TV programs, drama, sports, talk shows, etc. A large portion of these videos belongs to the category of sports. Usually, a video is accompanied by a user-defined tag(s)/keyword(s), but normally video data is still unstructured and such tags cannot explain what is exactly going on in the video. However, efforts are made to understand the video content [1], [2].
Professionally broadcasted sports videos usually have long durations but contain only a few exciting moments [3]. Different sports have different rules and the sports game may continue from one hour to a couple of days. Sports highlights can be considered as a video-based summary that contains only exciting or important events. Highlights deliver the whole excitement of the game in a much shorter period and such highlights are the main source for the sports enthusiasts to keep themselves up to date in a busy lifestyle [4]. Traditionally, such highlights are manually cropped based on human effort alone. Blog writers and other content creators spend thousands of man-hours to produce such highlights and it is quite challenging to compile a unique set of highlights for different sports videos. There exists a need for automatic sports video summarization methods.
The recent explosion of Artificial Intelligence (AI), especially deep learning, has granted opportunities for advanced visual information processing [5], [6], [7]. Deep learning and AI-based techniques have already been quite successfully incorporated in various real-life applications [8], [9]. Such techniques can be incorporated to build automatic tools for generating broadcast sports highlights. However, there are a lot of challenges associated with realizing such tools and techniques. For example, different people may have different opinions over an exciting event. For a soccer game, only a goal can be an exciting moment, whereas for some spectators, a missed goal can be an exciting event. This difference in opinion can be resolved by detecting all the exciting events and presenting users with a highlights summary based on users’ preferences. The second challenge is the diverse nature of sports, as every kind of sport is different from others having different rules and playfield scenarios [10]. The third challenge is the availability of training data, especially for deep learning-based methods. The fourth and most challenging part is the nature of broadcast sports videos. Unlike user-generated videos, broadcast sports videos are recorded through multiple cameras having different views. Besides, the cameras are switched rapidly as per the instructions of the sports director. These properties of broadcast sports videos have not been adequately acknowledged by previous research.
Automatic sports video summarization is a challenging problem. The previous studies rely on tracking the players’ activities [11], [12], [13], monitoring the crowd noise [4], [14], clustering similar frames of the videos, analyzing player actions, and extracting useful information from the caption regions or user-generated comments [3]. Clustering-based approaches use low-level features to reduce visual redundancy, while other methods focus on extracting semantic features. The unstructured nature of sports video makes video summarization a challenging task.
The previous approaches have high chances of missing exciting events. For example, the caption-based approaches [15], [16] depend on the text-based information provided by the broadcaster. Such an approach can exactly track an important event, e.g., goal in a football match. However, this approach cannot be employed to detect goal misses, corner shots, outs, etc. which might seem interesting to a spectator. Tracking players [11], [12], [13] in a game has its benefits, but due to the versatile nature of sports games, it is impractical and challenging to track players in different sports categories. Moreover, the players are usually idle for most of the time in sports games such as cricket, baseball, etc. For this reason, this approach has higher chances of missing exciting events. Audio cues themselves (cheering of the crowd) are an important feature for detecting an exciting moment [4]. However, audio cues do not indicate the nature of the event. Besides, it was noticed on several occasions that the crowd cheers without any reason and may lead to false detection of an exciting event. As mentioned above, the rapid camera view change and movement of cameras pose a challenge to the clustering-based methods.
In this paper, we propose a novel approach to recognize the sports activity and robustly describe “what is going on in a video segment”. Such descriptions help highlight generation based on spectators’ preferences. The proposed approach has practical significance and can help in extracting highlights from various sports categories. First, unlike previous studies, we separate the broadcast sports video scenes based on views, actions, and situations (details are provided in Section 4.1). Second, we propose a deep learning-based network (SPNet) that collectively recognizes exciting events based on spatiotemporal high-level visual features. The proposed network utilizes 3D-ResNet that can directly extract spatiotemporal information using 3D kernels. Moreover, we utilize the Inception V3 block for collectively recognizing views and situations. The inception block stacks 11 inception modules where each module consists of convolution filters, pooling layers, rectified linear units, and filter concatenation. By using every frame under consideration, feature sequences are constructed and further trained using neural networks. Finally, the exciting event is evaluated based on the proposed prediction algorithm. The contributions of this paper can be summarized as follows:
- •
We propose a deep learning network (SPNet) that exploits high-level visual feature sequences to accurately describe “what is happening” in a broadcast sports video scene and utilizes this information for generating highlights based on spectators’/users’ preferences.
- •
We add fine-grained annotations to the SP-2 dataset1 and separate the annotations according to view, action, and situation.
- •
We perform extensive experiments to validate the behavior of our proposed solution and the results of these experiments indicate the superiority of the proposed approach. Relevant data and codes are publicly available.2
The rest of the paper is organized as follows: Section 2 gives an overview of the related work. Section 3 presents the proposed method in detail. In Section 4, we demonstrate the results and their discussion followed by a conclusion in Section 5.
Section snippets
Related work
In this section, we shed some light on some of the related studies and contributions. Many researchers devoted their time and resources related to the field of video summarization or video abstraction. Sports video highlights generation can be considered as a subclass of video summarization. Some of the studies focused on generation highlights from sports videos such as [4]. Various studies focused on analyzing only a specific category of sports, e.g., basketball [17], tennis [18], soccer [19]
Our method
This section first presents the related background knowledge, followed by further details about SPNet.
Experiments and results
We performed comprehensive experiments on the SP-2 dataset to find the best-performing method. In this section, we present details about the dataset, results of the experiments, and some discussion on results.
Conclusion
Broadcasted sports videos usually have long durations and contain only a few exciting moments. It is not feasible for sports enthusiasts to watch the whole game. For this reason, many professional bodies and amateur content creators manually crop the video segments from long-duration broadcast videos. There exists a need for an automatic method capable of extracting the exciting moments while keeping in view the different opinions and preferences of sports enthusiasts (user preferences). In
CRediT authorship contribution statement
Abdullah Aman Khan: Data curation, Methodology, Writing – original draft. Jie Shao: Conceptualization, Supervision, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by the National Natural Science Foundation of China (No. 61832001).
References (38)
- et al.
Content to cash: Understanding and improving crowdsourced live video broadcasting services with monetary donations
Comput Netw
(2020) - et al.
Visual information processing for deep-sea visual monitoring system
Cogn Robot
(2021) - et al.
Multi-camera multi-player tracking with deep player identification in sports video
Pattern Recognit
(2020) - et al.
Collective sports: A multi-task dataset for collective activity recognition
Image Vis Comput
(2020) - et al.
Neural multimodal cooperative learning toward micro-video understanding
IEEE Trans Image Process
(2020) A survey of content-aware video analysis for sports
IEEE Trans Circuits Syst Video Technol
(2018)- et al.
Content-aware summarization of broadcast sports videos: An audio-visual feature extraction approach
Neural Process Lett
(2020) - et al.
Deep fuzzy hashing network for efficient image retrieval
IEEE Trans Fuzzy Syst
(2021) - et al.
Widesegnext: Semantic image segmentation using wide residual network and next dilated unit
IEEE Sens J
(2021) - et al.
User-oriented virtual mobile network resource management for vehicle communications
IEEE Trans Intell Transp Syst
(2021)
DRRS-BC: decentralized routing registration system based on blockchain
IEEE CAA J Autom Sin
RICAPS: residual inception and cascaded capsule network for broadcast sports video classification
Sports video summarization with limited labeling datasets based on 3D neural networks
Analyzing basketball movements and pass relationships using realtime object tracking techniques based on deep learning
IEEE Access
Smarttennistv: Automatic indexing of tennis videos
Cited by (6)
Exploring Deep Learning Methods for Computer Vision Applications across Multiple Sectors: Challenges and Future Trends
2024, CMES - Computer Modeling in Engineering and SciencesUsing the YOLO Network's Games Object to Classify Sports
2023, 2023 26th International Conference on Computer and Information Technology, ICCIT 2023Artificial intelligence in sports
2023, Medical Equipment Engineering: Design, manufacture and applicationsImproving Badminton Player Detection using YOLOv3 with Various Training Heuristics
2023, International Journal on Informatics VisualizationComputer Vision Technologies for Reconstructing the Trajectory of Human Movements in Three-dimensional Space
2023, Proceedings - 2023 IEEE Ural-Siberian Conference on Biomedical Engineering, Radioelectronics and Information Technology, USBEREIT 2023ENet: event based highlight generation network for broadcast sports videos
2022, Multimedia Systems
- ☆
This paper is for regular issues of CAEE. Reviews were processed and recommended for publication by Co-Editor in Chief Prof Huimin Lu.