1 Introduction

In the last decade, as the available sports multimedia has grown, the technology for analysis of content-based sports video has followed. Thereafter, due to high commercial potential and wide viewership, it has become paramount to develop a potent representation of sports video content (Zhu et al. [1]). Several investigations using traditional approaches relying on learning a frame based spatio-temporal features (Zhao and Elgammal [2]) for action recognition and motion tracking focusing on close-up view of human body parts motion have been carried out (Pingali et al. [3], Shah and Jain [4]). Such an approach is constrained with the requirement of high resolution and multiple near-view cameras. In case of far-view frames, tracking human body motion for the purpose of action classification becomes less efficient and hence inapplicable. The advent of deep learning techniques in motion tracking enabled extracting robust spatio-temporal features with lesser constraint on the input. The success of Deep convolutional networks (CNNs) in visual and Recurrent neural networks (RNNs) in sequential data interpretation tasks, encouraged its utilization in video action recognition.

In this context, Donahue et al. [5] employed a temporal feature based deep learning scheme incorporating a recurrent neural network and conventional 2D-CNN with end to end training for activity recognition and captioning of videos. This is contrary to the learning scheme based on a frame by frame spatio-temporal representation by either pooling over motion features or time-varying weight learning. Single stream networks also utilized the prowess of 2-D CNNs for feature extraction from frames and fusing the information at different levels of the network (Karpathy et al. [6]). Tran et al. [7], assessed the merits of 3D-CNNs for spatio-temporal feature learning over 2D-CNNs. A 3D-CNN followed by a linear classifier was designed which outperformed other LSTM and pre-trained 2D-CNN based methods. Other studies include the integration of 3D-CNN and RNN networks Yao et al. [8] and convolutional two stream fusion network Feichtenhofer et al. [9]. These networks work well to derive spatio-temporal representations in video analysis, however at the cost of high computation which affects its real-time application and deployment in broadcasted events especially in games with rapid motions such as badminton.

In this work we aim to address the aforementioned issue. It should also be noted that one of the drawbacks of CNNs and LSTMs used so far in action classification frameworks is vanishing and exploding gradients, which result due to arbitrary scaling after each consecutive layer. These contribute to the computational complexity of the said models. Neural Accumulator (Trask et al. [10]) or NAC, on the other hand, maintains a consistent scale of the input owing to its weight matrix consisting of simply 1, 0 and −1. NAC works as an affine transformation layer where the scale of the output remains consistent as a result of its weight matrix. This enables multiple NAC units to be stacked together without dealing with the drawbacks, as faced by CNNs or LSTMs. We aim to exploit this characteristic of NAC in our video action classification method.

The main contributions of this paper are two novel end-to-end trained NAC based frameworks for action classification in video analysis. The proposed models need not be trained on GPUs. They achieve high classification accuracy in strikingly minimal training and testing time. For the purpose of comparison three deep learning based methods viz. Denoising FCN Autoencoder, Temporal convolutional network and CNN-LSTM (LRCN) for stroke classification have been implemented. These models were required to be trained on GPUs since they take more than 2 h to train on a CPU. The proposed models perform better, in terms of classification accuracy, than the comparing methods, except the LRCN. As far as the computation time, our models always exhibit lower training and testing time, even when they are run on CPU while the comparing methods on GPU. In other words, if all are run on either GPUs or CPUs, the contrast in time difference (i.e the superiority of the proposed models) will be more prominent.

Experiments have been performed using 5-fold cross validation for several test-train splits varying from \(10\%\) to \(50\%\) to verify the effectiveness of the proposed model. The rest of the paper is arranged as follows, a brief discussion on related work in sports video action classification is presented in Sect. 2, followed by the proposed NAC based architectures and comparing methods in Sects. 3 and 4, respectively. In Sects. 5 and 6 we discuss different experimental protocols (viz. Dataset used, its pre-processing and preparation) and results obtained after thorough investigation. Lastly the effectiveness of the proposed method and future scope is summarized in Sect. 7. The code and dataset for this work is available on GitHubFootnote 1.

Fig. 1.
figure 1

Frame instances for three different Youtube badminton matches in UIUC2 dataset

2 Related Works

Badminton is a fast paced sport, analyzing a players strokes and performance can be exploited to be of some benefit to the player. Chu and Situmeang [11] studied classification of player strategy based on pose and stroke information of the player. Ghosh et al. [12] discussed an end to end framework for analysis of badminton videos, performing object detection, action recognition and segmentation. Furthermore, Yoshikawa et al. [13] proposed an automatic serve scene detection method using no prior knowledge, without segmentation and were able to extract motion and posture features of players using shift-invariance information. Following this by employing linear regression they detected specific scenes from the extracted features, and achieved high precision and recall values. Chu and Situmeang [11] clustered the player strategies into two categories namely offensive and defensive and performed stroke classification on badminton video after detecting the players and court.

Studying players strategies retrieved through several successful badminton video analysis tools could add another dimension to the game play and preparation for the players. In this study, three 2006 Badminton World Cup match videos from UIUC2 dataset have been used which are far-view, low resolution as they are captured using static camera. Of the 3 matches one is a singles match and the other two are doubles matches as shown in Fig. 1. We extract the player instances from the video frames using segmented masks. These instances are then annotated into six different action sequences namely forehand, backhand, lob, serve, smash and react. The same stroke sequences of top and bottom players are classified together as one action. Serve and react sequences were found to be the least and the most respectively. Multivariate recognition is performed on the annotated actions and accuracies are reported. Similarity in poses of the players among different actions and low spatial resolution made the classification task more challenging.

3 Proposed NAC Framework for Stroke Classification

Two Novel NAC based frameworks have been proposed in this section. First one explores the advantages of using a NAC unit for feature extraction (spatial features per frame) as opposed to using a computationally expensive Autoencoder or a Convolutional neural network model. Following this the extracted features are fed as input to the LSTM model for classification. The second approach utilizes NAC in a way such that it learns the pixel-wise temporal dependencies from input frames, replacing LSTM. Additionally a dense layer carries out the multivariate classification in this scheme.

3.1 Neural Accumulator (NAC)

The neural accumulator (NAC) is a neural network unit where the weight parameter (W) assigned by these units are 1, 0 or −1. Trask et al. [10] proposed a differentiable and continuous parameterization of weights easily trainable with gradient descent.

$$\begin{aligned} Weights = tanh(\hat{W})\cdot \sigma {(\hat{M})} \end{aligned}$$
(1)

where \(\hat{W}\) and \(\hat{M}\) (1) are randomly initialized. \(\tanh \) is a hyperbolic function whose values lie between −1 and 1, whereas a sigmoid function lies between 0 and 1, hence there dot product ranges in between values [−1,1] with bias towards −1, 0 and 1. Two NAC implementations are simple and complex that support the ability of simple linear operations such as subtraction, addition and complex numeric operations i.e division and multiplication respectively. These two implementations with a gated logic form the basis for NALU or Neural Arithmetic Logic Unit. For the scope of this study we explored the utility of NAC for spatial and temporal feature extraction towards badminton stroke classification.

3.2 NAC for Spatial Feature Extraction

In this framework the prepossessed data sequences each made to be 44 frames long, and of frame dimension 32 \(\times \) 32 are given as input to a NAC unit and transformed frame wise spatially. Number of NAC units stacked together is the same as the number of pixels in a frame, which is 1024 (32 * 32) in this case, to get the same number of input and output units. Here the intuition is to transform the (number of frames * number of pixels) input to a sparse representation with only relevant pixels kept non zero using the NAC layer. This transformed entity is then given as an input to an LSTM layer with number of LSTM cells equal to the number of frames to learn the relations of these pixels across the temporal dimension. The NAC unit will thus have two weight matrices of shape (number of pixels * number of NAC units). In this case number of units is equal to the number of pixels, thus the number of parameters in this layer is equal to \(\textit{ 2 * (number of pixels)}^{2}\). This framework is trained end-to-end.

The major contribution of the proposed architecture is that it eliminates the requirement of using a CNN or an autoencoder based model for feature extraction. NAC easily distinguishes between relevant and non relevant pixels such as background pixels. The weights learned allow maximum signal to pass from the player body and racquet pixel positions that are most deterministic for a stroke such as hand and leg stances. Further the number of computations to extract features by NAC is significantly lower than in the convolutional frameworks owing to reduced number of matrix operations performed. A single layer of NAC units is sufficient compared to multiple convolutional layers. Learning a NAC function over the sequences followed by a LSTM proves to be sufficient to classify the strokes with high accuracy in significantly lesser training time. Considering \(x_{nac}\) as the input set of images (each frame of a sequence) to the NAC, and \(A_{lstm}\) be the output of NAC to be fed to a LSTM layer (as a sequence of 44 frames),

$$\begin{aligned} A_{lstm} = \sigma ( Weights\cdot x_{nac}) \end{aligned}$$
(2)

Where Weights is initialized randomly as given by (1), and \(\sigma \) is the activation function. Following this the output gate of Lstm unit with \(A_{lstm}\) (2) as input can be defined as,

$$\begin{aligned} O_{lstm} = \sigma ( W_{o}A_{lstm} + h_{t-1}U_{o}) \end{aligned}$$
(3)

Where \(h_{t}\) in (3) is the hidden state, \(W_{o}\) and \(U_{o}\) are randomly initialized weights for output gate. Similarly input gate, forget gate outputs and output of hidden state can be estimated using \(A_{lstm}\) as input. For training of the proposed network Adam optimizer with a learning rate 1e−4 is used to minimize the mean square error function. The weights are updated with respect to the gradients calculated by the optimizer after every 32 training samples using backpropagation-over-time algorithm. Table 1 shows the number of parameters and output shape details of the designed framework.

Table 1. Detailed layer configurations NAC-LSTM model

3.3 NAC for Temporal Feature Extraction

In this framework the NAC unit is trained to learn meaningful temporal transformations of the features across time from the input sequence of frames. The intuition is to determine the inter dependency of a pixel at a fixed position across all the frames of a sequence. The input is of dimension (number of pixels * number of frames) to the NAC layer that outputs a compressed representation of size (number of pixels * k), where \(k \le \textit{number of frames}\). When number of NAC units (k) is chosen to be less than the number of frames, this model performs dimensionality reduction. It can be thought of as the video being compressed to smaller number of frames where each non zero pixel in this form contains information most deterministic for a stroke. The outputs of NAC layer are then given as an input to a dense layer with sigmoid as activation function, for stroke classification. This framework has about the same parameters as the previous one, however its training time and testing time is halved. This because the weight matrices for NAC layer are of size (number of frames * k). Thus even for a small image of size 32 * 32 this reduction in the size of weight matrix significantly improves the computation time. Considering \(x_{nac}\) as the input set of images (each frame of a sequence) to the NAC identical as the NAC-LSTM framework, and \(A_{dense}\) be the output of NAC.

$$\begin{aligned} A_{dense} = \sigma ( Weights\cdot (x_{nac}.T)) \end{aligned}$$
(4)

where Weights are initialized randomly as given by (1), and \(\sigma \) is the activation function. The output \(A_{dense}\) (4) is flattened and given to a dense layer with sigmoid activation. It classifies the input into different stroke actions as shown in (5).

$$\begin{aligned} O_{dense} = \sigma ( W_{o}A_{dense}) \end{aligned}$$
(5)

where \(O_{dense}\) gives the output of the dense layer and \(W_{o}\) being weights initialized with glorot uniform. The batch size is taken as 32 and the model is trained with the optimizer Adam with a learning rate of 1e−4, the following Table 2 gives the detailed network configurations.

Table 2. Detailed layer configurations NAC-Dense model

4 Comparing Methods

4.1 Denoising Autoencoder Fed LSTM for Stroke Classification

In this study a Denoising Autoencoder fed LSTM model was implemented where initially the autoencoder model was trained using stochastic gradient descent through backpropagation to optimize the mean square error function. Weights were updated after every defined batch size of noisy or corrupted training samples (noise factor of 0.2) with the rate: 1e−3 times the partial derivative of error w.r.t the initial weight. The LSTM model was trained on the latent space features extracted from the denoising autoencoder, multivariate classification required a one-hot coding of all classes. Further 5-fold cross validation was carried out for 300 epochs with Adam optimizer (learning rate = 1e−4) to acquire the accuracy for each of the defined classes.

The Encoder unit for the designed autoencoder consists of four weight layers, each convolutional, with three 7 \(\times \) 7 and one 5 \(\times \) 5 size filters. In between convolution layers, a simple max pooling operation is employed with kernel dimension 2 \(\times \) 2. The decoder model has eight weight layers, each convolutional, with kernel dimensions identical to the encoder in an attempt to reconstruct the input. In place of a maxpooling layer in encoder the decoder has an upsampling layer with filter dimension 2 \(\times \) 2. For adding non-linearity, Relu activation for encoder unit and LeakyRelu for decoder unit has been used, to prevent back propagating gradients from vanishing or exploding often faced when using sigmoid activation. In LSTM model, we experimented with both a NAC and a sigmoid unit as the top layer for class prediction.

4.2 TCN for Stroke Classification

TCN described by Lea et al. [14], namely Dilated TCN was utilized for experimentation. The Dilated-TCN model is similar to the wavenet architecture, where a series of blocks are defined with several convolutional layers. For the ease of combining activations from different layers, the number of filters is kept same for these layers. Each layer has a set of dilated convolution with the rate parameters. The dilation rate increases for consecutive layers in a block. A residual connection combines the layer’s input and convolution signal.

The performance of dilated TCN model is evaluated for multiple dilation rate settings varying from 1 to 64 (i.e. 1, 2, 4, 8, 16, 32, 64) and with different number of filters ranging from 64 to 256. Training is done using the Adadelta optimizer for 110 epochs. In this work instead of using an additional spatio-temporal feature extractor and using the TCN to learn on those features, the TCN network is used to perform both the tasks since the image size is small (32 \(\times \) 32) and the idea is to allow the network to learn the pixel wise variation across the frames of sequence instead of learning the variation of features.

4.3 Long Term Recurrent Convolutional Network (LRCN) for Stroke Classification

LRCN (CNN-LSTM) combines a deep convolutional model for spatial feature extraction and a separate model for temporal feature synthesis (Donahue et al.  [5]). In this framework a Time Distributed convolution Network is designed followed by a RNN unit for sequence prediction. The implemented model has 4 time distributed convolutional layers. The filter dimension is kept same for all the layers, set to 3 \(\times \) 3 and the number of filters chosen with ablative experimentation are 32 and 64. A time distributed MaxPool unit is employed after every two convolutional layers in an effort to check the number of trainable parameters. Use of regularizers such as batch normalization and dropout ensure that the network does not overfit the training samples. Following this, a bidirectional LSTM layer and a fully connected layer classify the input sequence of frames into 6 classes. The network is trained using the Adadelta optimizer adhering to the five fold cross validation scheme with the number of epochs equal to 80. Similar test train protocol is followed to verify the efficacy of the results.

5 Experiments

This section first provides an overview about the experiments performed followed by the dataset used and the pre-processing techniques employed.

5.1 Overview

Experiments have been performed on the badminton matches in UIUC2 dataset which were taken from Youtube 2006 badminton world cup matches. Every player is cropped out of the frames and the corresponding stroke played is determined. Six different stroke sequences played each consisting of 44 frames have been annotated. Following data preparation, total number of sequences were 427, 5-fold cross-validation has been used for training all the models. The split between train and test data was set at different percentages from 10% to 50% for thorough evaluation. Table 3 shows the data split statistics. For comparison Dilated Temporal convolutional networks (TCN), Autoencoder-LSTM and Long term Recurrent convolutional networks (LRCN) have been implemented.

In addition, the models were trained on a i7 7700 processor with Nvidia GPU 1050Ti and the GPU Tesla K-80. The models were implemented with Keras libraries using python as the programming language and Google Colab for GPU (K-80). The training time of different proposed models were between half to two hours.

Table 3. Test data statistics

5.2 Dataset

In the dataset there are video frames from three different matches consisting of one singles and two doubles-matches. The total number of frames for the one singles and two doubles-matches were 3071, 1647 and 3936 respectively. Since the frames are derived from a Youtube video, the resolution of the images is quite low. Low quality of these images restricted us from using well researched approaches for action classification such as pose or posture estimation. It also prevented us from exploring the advantage of analysing footwork of the players for the purpose of strategy prediction. Utilizing the segmentation masks and bounding boxes data present in the dataset, we extracted every player’s segmented image for the given frames from three matches. Total number of unique players from UIUC2 dataset can be estimated to be ten given one singles match and two doubles matches.

For this study we required player instances for different strokes played in order to classify them into six different annotated actions. There were several challenges to prepare the required data set such as occlusion and irregular frame instances per stroke. Occlusion made it difficult to extract the bounding boxes of players separately, hence we had to avoid all instances where severe occlusion occurred. In singles match dataset most of the bounding boxes for top and bottom players were easily separable, however in a few cases the stroke played could not be discerned. Occlusion posed a bigger problem with the doubles matches dataset, as it not only occurred between the top and bottom players but also among the two top and two bottom players themselves. Examples of occlusion is given in Fig. 2.

Fig. 2.
figure 2

(a),(c) and (b),(d) display occlusion instances in bottom and top players respectively for doubles matches dataset, (e) and (f) shows occlusion instances in singles match.

5.3 Data Preparation and Pre-processing

After extracting the bounding boxes of the top and bottom players separately from above discussed dataset, we manually annotated the strokes into six categories referencing from the initial badminton video frames. Different strokes recognised were react, forehand, backhand, lob, serve and smash (Fig. 3). An additional label chosen was no-play referring to the instances when the players were neither reacting to a stroke played by the opponent nor playing any of the defined strokes. These instances thus have no effect on game play and were thus discarded for the purpose of stroke classification.

The second challenge faced while data preparation was that an unequal number of frames for each of the defined strokes could be extracted due to occlusion. In order to avoid adding complexity to our stroke classification framework we required same number of frames for all the sequences. This was achieved by augmenting the initial set of extracted sequences uniformly to constitute a fixed number of frames. Each stroke sequence for simplicity and uniformity were made to have exactly 44 frames each. The extracted stroke sequence frames were made uniform and then converted to gray scale and resized to 32 \(\times \) 32 for NAC-LSTM, CNN-LSTM models and 80 \(\times \) 80 for Autoencoder-LSTM model, to reduce the number of trainable parameters and avoid over fitting, while extracting useful features. The total number of stroke sequences per dataset obtained by adhering to the above discussed protocols are presented in the Table 4. Furthermore, the stroke sequences have been sample-wise normalised along the mean using the equation.

Table 4. Stroke sequences per dataset
Fig. 3.
figure 3

Instances of different strokes annotated, (a) backhand, (b) forehand, (c) lob, (d) react, (e) serve and (f) smash

$$\begin{aligned} X_{norm} = \frac{X - X_{mean}}{X_{std}} \end{aligned}$$
(6)

where, \(X_{norm}\) (in (6)) is the normalised vector \(X_{mean}\) is the mean of the sample \(X_{std}\) is the standard deviation of the sample. Data normalization is a powerful pre-processing tool that subdues the overall impact of outliers on the generalization of the network.

6 Results and Discussion

The following section discusses the performance evaluation and comparisons of the proposed NAC based models with other deep learning based approaches.

6.1 Average Classification Accuracy

NAC based frameworks have been implemented on CPU, in an effort to develop an algorithm capable of real-time applications in comparison with other models executed on GPUs as they are computationally intensive. Table 5 shows the classification accuracies obtained from LRCN, TCN, Autoencoder-LSTM and NAC (CPU) based models for different test data splits. The autoencoder-LSTM models with NAC and sigmoid units performed poorly on the combined dataset and took substantially more time (an average of 2 h on GPU) to train than the other models. However with the singles-match videos consisting a total of 233 sequences the model with NAC and Sigmoid achieved an average classification accuracy (over all the test split scenarios) of 85.37 and 84.66 respectively. LRCN was able to achieve the highest average accuracy over the combined dataset, although it took much longer for training than other frameworks. NAC-Dense and NAC-LSTM performed equally in terms of the average accuracy although the average training time (Table 6) taken by the NAC-LSTM was almost double to that of NAC-Dense, this is due to the difference between the number of NAC units stacked together in both the frameworks which in turn affects the size of weight matrices. In NAC-LSTM a total of 1024 (32 * 32) units determined by the dimension of the input frames were stacked together compared to NAC-Dense in which only 44 (Number of frames in a sequence) units were stacked.

Table 5. Accuracy Values for the discussed models

6.2 Computation Time

The following Table 6, shows the training time taken by different models when run on GPUs in comparison to the proposed NAC-Frameworks run on CPU. The proposed models are superior to other comparing models in terms of computation time. In addition, on CPU the comparing models take 2 to 8 h to train, whereas the proposed model exhibit better training time even when compared with the other models GPU runtime. For determining the deployability of the proposed networks in real time application the testing time evaluation is necessary. The following Table 7 shows the testing time for different models for the same protocol of test data split. NAC-Dense gives the best test time of 0.0453 s averaged over all the data splits. NAC-LSTM with the average test time of 0.1314 s performs better than the LRCN model (with avg. test time of 0.8359 s).

Table 6. Training time (in seconds) for the discussed models
Table 7. Test time (in seconds) for the discussed models

6.3 Performance Analysis

The performance evaluation graph Fig. 4 is a plot between the average training time and the average classification accuracy over all test-train split scenarios. It can be observed that Dilated-TCN and NAC-LSTM have comparable training time, although the accuracy of NAC based model is still higher. Additionally it is evident that the trade-off between time and average accuracy is best for NAC-dense framework. To further verify the classification ability of the proposed models, single stroke classification with NAC-Dense has been performed as shown in Table 8. Six classifiers each attuned to a single stroke has been trained following Similar \(10\%\) to \(50\%\) test-train split protocol. The performance of the model is computed as the average for all classifiers over all splits which comes out to be 88.10.

Fig. 4.
figure 4

Performance evaluation of the proposed Architectures

7 Conclusion and Future Work

In this work an attempt has been made towards developing an architecture to extract rich spatio-temporal features from videos in minimal time with less computational requirement. The ability of Neural Accumulator (NAC) to maintain a consistent scale enables it to preserve gradients over multiple stacked NAC units. This results in faster convergence of NAC based frameworks over other CNN and LSTM based models. Following this, two novel NAC based frameworks viz, NAC for learning spatial (NAC-LSTM) and temporal (NAC-Dense) transformations have been developed and applied for the task of action classification. as an example, in video analysis. The dataset used in this study is from UIUC2 which contains low resolution, far-view, static-camera videos of one ‘singles’ and two ‘doubles’ matches taken from Youtube. Six stroke sequences for all the three videos namely, forehand, backhand, lob, smash, serve and react have been annotated and pre-processed for the task of classification. Performance of the proposed models is compared with some state-of-the-art networks, e.g., FCN Autoencoder-LSTM, CNN-LSTM and Dilated TCN. 5-fold cross validation protocol for training, with test data splits from \(10\%\) to \(50\%\) have been used, and average accuracy values, training and testing time have been computed. The proposed models are superior to all the comparing models in terms of computation time. Performance (classification accuracy) wise they are also better than FCN autoencoder and Dilated TCN models. One may note that the superiority in computation time of the proposed models is achieved even when they are run on CPUs whereas the comparing models on GPUs. This contrast becomes more prominent when all are run on CPUs or GPUs. In addition NAC-Dense achieves the best trade off between average classification accuracy and training time. Thus, for future work it can be employed for real-time application of broadcasted sport events with least computational requirement. Player and stroke-played dependencies could be learned via deep learning models and consequently player-wise stroke prediction in real time can be performed.

Table 8. NAC-Dense stroke wise classification accuracy