1 Introduction

Human action recognition from video data is a well-established area in computer vision [1]. To date, the main efforts are directed at creating a sufficiently robust dynamic model of human movement accomplished under a priori given actions. All of these models have been validated in predicting the movement and/or action recognition through a certain indicator of performance [2]. However, the models are mostly oriented to detect discrepancies in the performance of players rather than to observe their progress or development of actions.

Aiming at modeling, playing dynamics are usually constructed based on Hidden Markov Models or Gaussian Process from high-dimensional MoCap markers [3, 4]. Although these models can provide adequate performance, the involved actions are assumed to be produced in the form of smooth, slow, and cyclical movements. In the case of movements having quick actions or responses, development and tuning of either approach may face difficulties, leading to low rate classification. Furthermore, these methods assign high weights to the body segments that involve more information about the posture and gait as the lower limbs, trunk, and pelvis. However, the actions that involve faster movements and smaller body segments are dismissed, such as the upper limbs. To overcome this issue, the kernel methods appear as an option that enables nonlinear dynamics modeling in Hilbert space, and hence, that can enhance the classification and prediction performance.

On the other hand, the efficiency rate may be diminished if the movement analysis is carried out just on a movement that involves more speed or much higher dynamic response and upper limbs [2]. Therefore, the performance evaluation of tennis players should also incorporate the active participation of upper limbs and high-speed response, during the service and forehand strokes. Another aspect to consider is the technique mastered by each player so that its biomechanics plays a key role in stroke production [5]. Each stroke movement has its fundamental mechanical structure, making necessary to analyze the practicing and training player procedures individually. However, the biomechanical analysis is quite extensive and strongly depends on two types of variables: kinetics and kinematics [6]. In this regard, the use of motion capture (MoCap) makes the kinematic analysis lower in cost than kinetic analysis [1].

In practice, the kinematic analysis is of interest to the early analysis of player skills in sports training. Although the kinetic study looks like a more suitable approach due to the calculation of force and torque in specific biomechanical gestures, its involves mass calculation of every single body segment, resulting in a very time consuming and expensive procedure when only MoCap data are employed. As a result, just the professional athletes have access to these tools because of their price. On the other hand, the kinematic analysis is mainly performed using information of linear and angular velocity, as well as the joints and angles of alignment as the initial input [7]. Yet, the majority of approaches do not take into consideration the parameter drift that can describe the progressive development of player performance due to its training improvement. Therefore, there is growing demand for a kinematic analysis at affordable implementation cost but with more flexibility to capture the continuous progress of players.

In this paper, we introduce a data-driven approach to estimate the kinematic performance devoted to training of tennis players that is based on kernels to extract a dynamic model of each player from motion capture (MoCap) data. This methodology allows to see each player performance against other players, his progress, and as a group their similarity in motions executed. This level of interpretation is not achieved with classical kinematics variables studied on tennis training.

2 Kernel Based Multi-channel Data Representation

Let \({\varvec{X}}\,{{\mathrm{\in }}}\,\mathbb {R}^{N {{\mathrm{\times }}}M}\) be a multi-channel input matrix that holds M channels and N samples, where \({{\varvec{x}}}_i\,{{\mathrm{\in }}}\,\mathbb {R}^{1{{\mathrm{\times }}}M}\) is a row vector containing the information of all the provided channels at different time instants, with \(i\,{{\mathrm{\in }}}\,\{1,\dots ,N\}\). With the aim at modeling properly each player from MoCap data, we represent the dynamic behavior of its movements through the kernel methods using generalized similarities. To this end, a kernel function is employed as a nonlinear mapping \(\varphi \negthickspace :\negthickspace \mathbb {R}^{N{{\mathrm{\times }}}M} \negthickspace \mapsto \negthickspace \mathcal {H}\), where \(\mathcal {H}\) is a Reproducing Kernel Hilbert Space - RKHS. Thus, the kernel based representation allows dealing with nonlinear structures that can not be directly estimated by traditional operators, such as, the linear correlation function. Regarding this, the inner product between two samples \(\left( {{\varvec{x}}}_i,{{\varvec{x}}}_j\right) \) is computed in RKHS as , being \(\kappa \left( \cdot ,\cdot \right) \) a Mercer’s kernel. Taking advantage of the so-called kernel trick, the kernel function can be computed directly from \({\varvec{X}}\). Here, we rely on the well-known Gaussian kernel defined as follows:

$$\begin{aligned} \kappa \left( {\varvec{x}}_i,{\varvec{x}}_j\right) = {\exp }\left( -{\left\| {{\varvec{x}}_i-{\varvec{x}}_j} \right\| _2^2}/{2\sigma ^2} \right) , \end{aligned}$$
(1)

where \(\sigma \,{{\mathrm{\in }}}\,\mathbb {R}^+\) is the kernel band-width. Notation \(\Vert \cdot \Vert _2\) is the \(L_2\) Euclidean norm.

Therefore, we obtain the similarity matrix \({\varvec{S}}\,{{\mathrm{\in }}}\,\mathbb {R}^{N {{\mathrm{\times }}}N}\) that holds elements \({s}_{i,j}\,{{\mathrm{=}}}\,\kappa \left( {\varvec{x}}_i,{\varvec{x}}_j\right) \) with \({s}_{i,j}\,{{\mathrm{\in }}}\,\mathbb {R}^+\). In this sense, matrix \({\varvec{S}}\) encodes the temporal dynamics of the multi-channel input data.

Due to the Representer Theorem, the nonlinearities of wide range problems can be described as a kernel expansion in terms of the data [8]. That is,

$$\begin{aligned} f({\varvec{\xi }})=\sum \nolimits _{i\,{{\mathrm{\in }}}\, N} \alpha _i \kappa ({\varvec{x}}_i,{\varvec{\xi }}),\, {\varvec{\xi }}\in {\varvec{X}} \end{aligned}$$
(2)

where N is the number of samples, \(\alpha _i\,{{\mathrm{\in }}}\,\mathbb {R}^{+}\) is built on Kernel least-mean-squares algorithms (kernel least mean square – KLMS) like the one based on solving a least-square problem (KRLS) or the ones using stochastic gradient descent methods [9]. In particular, we employ the Quantized KLMS method (QKLMS) that constructs a dictionary set or codebook, noted as C, according to a quantization process in which data points are mapped onto the closest dictionary point. By using QKLMS, the learned mapping before iteration k is as follows:

$$\begin{aligned} f_{k-1}({\varvec{\xi }})=\sum \limits _{i=1}^{size(C_{k-1})} \alpha _{i,k-1} \kappa (C_{i,k-1},{\varvec{\xi }}) \end{aligned}$$
(3)

where \(C_{i,k-1}\) denotes the i-th element (code-vector) that belongs to codebook \(C_{k-1}\) and \(\alpha _{i,k-1}\) is the coefficient of i-th center.

3 Performance Comparison Expanded to RKHS Representation

The main rationale behind validation is to define a real-valued distance \(\text {d}()\,{{\mathrm{\in }}}\,\mathbb {R}^{+}\) that accomplishes the pairwise comparison of the models estimated for the tested players. Taking into account Eq. 2, we compute the distance between a couple of models, \(f_n\) and \(f_m\), as below:

$$\begin{aligned} \text {d}_f (f_n,f_m)= \varphi \left( f_n({\varvec{X}}^n), f_m({\varvec{X}}^m)\right) . \end{aligned}$$
(4)

where \({\varvec{X}}^n\) and \({\varvec{X}}^m\) are the multi-channel input matrices for subjects n and m, respectively. Here, we use the following RKHS-related norm:

$$\begin{aligned} \text {d}_f (f_n,f_m)= \left\| f_n - f_m \right\| _{2}, \end{aligned}$$
(5)

where \(f_n \,{{\mathrm{\in }}}\,\mathcal {H} \left\{ \varphi _n(\cdot ); \left\{ \alpha ^n_i: \forall i\,{{\mathrm{\in }}}\, size(C^n)\right\} \right\} \). Consequently, the evaluation of the euclidean distance proposed in Eq. 5 is as below:

$$\begin{aligned} \text {d}_f (f_n,f_m) = \langle f_n,f_n\rangle _2 - 2 \langle f_n,f_m\rangle _2+\langle f_m,f_m\rangle _2. \end{aligned}$$
(6)

Because of the codebook in Eq. 3, we have \(i\,{{\mathrm{\in }}}\,size(C^n)\) and \(j\,{{\mathrm{\in }}}\,size(C^m)\) so that we can rewrite the distance (5) as follows:

$$\begin{aligned} \text {d}_f (f_n,f_m) =&\left\langle \sum _{i} \alpha ^n_i \kappa ({\varvec{x}}^n_i,\cdot ), \sum _{i} \alpha ^n_i \kappa ({\varvec{x}}^n_i,\cdot ) \right\rangle _2 -2 \left\langle \sum _{i} \alpha ^n_i \kappa ({\varvec{x}}^n_i,\cdot ), \sum _{j} \alpha ^m_j \kappa ({\varvec{x}}^m_j,\cdot ) \right\rangle _2 \\&+ \left\langle \sum _{j} \alpha ^m_j \kappa ({\varvec{x}}^m_j,\cdot ), \sum _{j} \alpha ^m_j \kappa ({\varvec{x}}^m_j,\cdot )\right\rangle _2 ,\ \text {with } \ {\varvec{x}}^n_i\,{{\mathrm{\in }}}\,{\varvec{X}}^n,\, {\varvec{x}}^m_j\,{{\mathrm{\in }}}\,{\varvec{X}}^m . \end{aligned}$$

Lastly, we obtain the following closed expression for the Euclidean-distance based indicator:

$$\begin{aligned} \text {d}_f (f_n,f_m) =\sum _{i,i'} \alpha ^n_i\alpha ^n_{i'} \kappa ({\varvec{x}}^n_i,{\varvec{x}}_{i'}^{m})-2 \sum _{i,j} \alpha ^n_i\alpha ^m_j\kappa ({\varvec{x}}^n_i,{\varvec{x}}^m_j)+ \sum _{j,j'} \alpha ^m_j\alpha ^m_{j'} \kappa ({\varvec{x}}^m_j,{\varvec{x}}^m_{j'}). \end{aligned}$$
(7)

Thus, the obtained indicator allows encoding the pairwise relationships between all considered models as a functional measure in a RKHS representation.

4 Experimental Setup and Results

4.1 Database

The data were collected from 16 players: four labeled as high performance (HP) and 12 more as regular (RP). All participants belonged to the Caldas-Colombia tennis league and explicitly volunteered to participate in the study, approved by the Ethics Committee of the Universidad de Caldas. Each group had the following anthropometric parameters (HP vs RP): age \(21\pm 2.7\) vs. \(16.7\pm 3.9\) years, mass \(64\pm 14.9\) vs. \(61.6\pm 9\) kg, and height \(168.8\pm 8.4\) vs. \(170.8 \pm 7.7\) cm. The employed motion capture protocol was Biovision Hierarchy (BVH), stipulating the placement of 34 markers for collecting information of body joints. Optitrack Flex V100 (100 Hz) infrared videography was collected from six cameras to acquire sagittal, frontal, and lateral planes. All subjects were encouraged to hit the ball with the same velocity and action just as they would in a match. They were instructed to hit one series of five serve strokes followed by 12 forehand strokes.

Table 1. Tennis serve and forehand stroke kinematics (mean ± SD) of high performance and regular players. \(^\dag \) High performance group tends to be different from regular: hip alignment (\(p<0.01\)) and elbow velocity (\(p<0.03\))

4.2 Baseline Kinematic Analysis

For the purpose of deriving representative and accurate kinematics of the video recorded hits, a total of four serves and six forehand strokes per subject were manually selected and considered for kinematic analysis. The kinematics variables of interest for the serve and forehand were the following: (a) maximum angular displacement of shoulder alignment, hip alignment and wrist extension, (b) maximum velocity of hip, shoulder, elbow, and wrist, lastly (c) maximum angular velocity of elbow extension, trunk rotation, and pelvis rotation [7].

As seen in Table 1, analysis of variance (high performance, regular) estimated for each swing allows detecting some statistical differences and effects in the selected kinematic variables. Thus, the used one-way ANOVA shows that no group differences are achieved by the kinematics variables for the serve hit, assuming a significance level \(p<0.05\). Instead, the hip alignment (\(p<0.01\)) and elbow velocity (\(p<0.03\)) are significantly different between high performance and regular players in forehand stroke; all these findings are consistent with some related works on this area [5]. However, these selected variables do not clearly distinguish between groups. Indeed, one can not identify any cluster relating to the considered groups of players when embedding these ten variables into a 2-D representation using Kernel PCA as shown in Fig. 1(a) and (b). In fact, the comparison between players using this characteristic does not allow to see a compact group of high-performance players which regular players can come closer in performance. Silhouette index is used to measure the cohesion and separation between groups [10]. A silhouette group index near to 1 indicates that the group is well clustered. On the other hand, group index near to \(-1\) is not well clustered. Index close to 0 indicates that a point is on the border of two natural clusters.

Fig. 1.
figure 1

2-D embedding of the 16 players using kinematics characteristics. HP means high performance (blue color) and RP means regular performance (red color). Gaussian densities of each group are observed. (Color figure online)

4.3 Kernel-Based Performance Analysis

We compute each model \(f_n(\cdot )\) as a pairwise kernel-based similarity measure between two samples \({\varvec{x}}_i\) and \({\varvec{x}}_j\). The Gaussian kernel defined in Eq. 1 estimates each pairwise sample relationship, mapping from the original feature space to an RKHS representation. The kernel bandwidth \(\sigma \) is calculated by maximization of information potential variability, which provides better results in this kind of data than the Silverman’s rule. QKLMS parameters are set as follows: quantization size \(\epsilon _U\,{{\mathrm{=}}}\, 0.95\), step size \(\eta \,{{\mathrm{=}}}\,0.81\), window size \(\omega _s\,{{\mathrm{=}}}\,30\) and the initial codebooks are built directly from the input time series \({\varvec{X}} \,{{\mathrm{\in }}}\, \mathbb {R}^{N {{\mathrm{\times }}}M}\), with \(M\,{{\mathrm{=}}}\,57\) corresponding to 19 joints estimated from .bhv file in x, y and z positions. Each model is validated doing a simple task: predict \(z(t+1)\) coordinate from x(t) and y(t). For the serve models, the mean prediction error is \(3.29\pm 4.84\), and this amount is \(2.56\pm 3.90\) for the forehand models. Although the variability is high, it shows a low and regular mean error, which is significant due to the high variability of both: MoCap data and kinematics variables. Besides, our approach works with the full-length videos of two series: five serve strokes and 12 forehand strokes per each player; segmentation and selection of actions are not required in our approach.

Our proposed functional distance \(\text {d}()\) allows having models that evaluate the performance each other and between groups, where the two groups are well established, and the performance between players is highlighted thanks to the metric proposed. Embedding the coefficients obtained from the QKLMS model of \(f_n\) in 2-D with KPCA, a better clustering is achieved for high-performance players. In fact, a clear cluster is described as the high-performance players in the serve action according to its silhouette index (0.92), and the other clusters also exhibit some improvement.

Fig. 2.
figure 2

2-D embedding of the 16 players using QKLMS \(f_n\) coefficients model for each player. HP means high performance (blue color) and RP means regular performance (red color). Gaussian densities of each group are observed. (Color figure online)

5 Discussion and Concluding Remarks

In the kinematic performance analysis of tennis players, mostly, the following steps are to be accomplished: estimation of the contact with the time ball position, calculation of the segments ranging from starting position till the recovery position, and then, extraction of kinematics parameters for each segment. It is worth noting that segmentation highly influences the performed classifier accuracy, where manual segmentation is often carried out even that MoCap data are widely used. Therefore, efforts to avoid the segmentation stage will promote the automation of supporting tools for improving player skills.

Since there is a strong intersubject variability that should be taken into consideration, we focus on developing an accurate dynamic model individually. This allows to discriminate each player among others much better. Thus, Fig. 2(a) shows an improvement of the performed service action for each one of the players that belong to the high-performance group. Likewise, Fig. 2(b) shows a better performance of forehand action. Also, we consider the progressive performance index to assess the performance evolution along the time. As a result, we find that the performance of fast actions increases, when involving dynamics information of the upper limbs. Though the actions involving upper limbs can be more precisely measured by employing capture techniques like inertial wireless wearable sensors, the proposed methodology can extract from MoCap data a suitable training model that provides good performance of actions involving the movement of upper limbs. Particularly, the developed dynamic model takes advantage of the trajectories derived from the upper limbs, learning from the extracted nonlinear dynamical time series from MoCap data with enough accuracy so that segmentation is not required any more. Namely, no segmentation is required for each action or marker of joint body selection.

To this end, the kernel based training is a data-driven approach that effectively encodes the stochastic behavior of the MoCap trajectories. Here, we use the QKLMS model that seeks for the time relationships among channels, even for actions with big changes between adjacent frames. Consequently, the dynamics model obtained by the kernel-based methodology along with the incorporated similarity indicator allows evaluating more accurately the kinematic performance reached by a subject in the tennis court, having less complexity of implementation. Moreover, the Kernel-based analysis allows to represent the performance reached for each player; this aspect is crucial to individualize the improving process of training skills.

In the future, the authors plan to incorporate more complex dynamics, meaning the introduction of more elaborate measures of similarity. Also, the inclusion of trainers information (like labeled segments) is to be considered.