Introduction

Over the years, sport has evolved from being a mere recreational activity into a serious profession which fuels a 400-billion-dollar industry globally. Today a majority ofthe sports industry has ventured intosports analytics based on numerical data which eventually helps in the better understanding of the game. In team games such as association football, the playing positions are impacted by the individual’s physical and fitness composition [1, 2], which is measured by their anthropometry and motor fitness. Football being a physically demanding sport, also aids in the improvement of muscle strength, muscle tone, incorporates the participation of the bones and joints, muscles of the lower leg, knee extensions and hip flexions [3]. It also revolves around responsive aspects such as balance, reflexes, speed and endurance. The most important requirement in football is however, active communication and team work [4]. In a football team, there are 11 players, with one Goal Keeper (GK) position and ten outfield players whose positions could be broadly classified into defenders (D), midfielders(MF) and forwards (F) [5]. Each of these positions require certain set of unique skills. However, there is an overlap in these skills and it is very difficult to ascertain the ideal playing position for a given player based on any singular assessment [6, 7]. Such diversity in the skillsets necessitates the development of novel approaches to determine the suitable playing position, one such approach being the inclusion of machine learning models with critical player parameters so as to arrive at the most suitable playing position which would elicit the best performance during the game, for a given player [8, 9]. The features for such approaches are often those which reflect the dominating skills and abilities. In the present context, anthropometric and motor fitness parameters have been deemed suitable for a predictive analysis [10]. Football player substitutions are crucial when a team is trailing or trying to hold onto a lead since they can improve the team’s performance. But replacing players based on their past performance wouldn’t help the team make wise choices. Reading numerous research studies, it was found that a player’s mentality should be competitive and stable, both of which are quite important during the game. Thus, the framework that this study suggests consists of two models: Natural Language Processing (Sentimental Analysis) and Survival Analysis (Kaplan-Meier Fitter) [11].

Association football is a dynamic, intricate sport in which many things happen at once throughout a game. It’s difficult to analyze football videos because you have to find subtle, varied spatiotemporal patterns. The performance of existing algorithms in detecting these patterns is reduced when learning from limited annotated data, despite recent breakthroughs in computer vision [12].

The rise of sports analytics and the availability of rich and complex match data have made sports predictions a more attractive area of research. Football is the most popular sport in the world, with around 3.5 billion fans worldwide. It is played all over the world. It is played in over 180 nations worldwide. The majority of fans are able to forecast who will win the match. Based on variables like home stadium, team form, squad strength, win %, and other variables, anyone can make predictions. Prediction is a great tool for club personnel to utilize when making decisions about player management and training. It also helps teams get ready for upcoming games by analyzing the performance of rival teams [13].

Pitch analysis, moment recognition (pass, shot, assist, and free kick), multiple cameras positioned throughout the stadium, dynamic background, ball localization throughout the entire field, quick movements, and mutual overlapping between team members are some of the major challenges and difficulties that are examined in the football game [14].

There are several strategic and tactical difficulties in a very contested and physical football game. As such, it is imperative that players are aware of the methods and tactics employed by their opponents. But because of the intricacy of the match, the opponents’ objectives are frequently subject to change [15].

Football has always attracted a lot of attention and study because it is a very popular sport worldwide. The Internet of Things (IoT) and Deep Learning (DL), two cutting-edge technologies, are showing rising signs of tremendous application potential in sports as a result of technological advancements and applications [16].

Motivation

Sports analytics have grown in significance in recent years. Football players can now record their position and motion data during a game thanks to newly designed wearable tracking devices. Teams as well as individual players can improve their performance by utilizing these statistics. This article represents the machine learning based assessment of elite football players based on anthropometric and motor fitness parameters with regard to their playing positions.

Materials and Methods

Subjects

The study was conducted on 165 male football players across teams participating in the all India south zone inter university football tournament for men, held at Kerala, India. The players were of mean age 21.12 ± 1.52 years. As all participants of the study were majors, the study was explained to themprior to the data collection, and their consent was obtained on documents highlighting the same. The players were categorized into 4 categories, namely Forward (F), Defence (D), Midfielder (MF) and Goal Keeper (GK), with a count of 48 F, 58 D, 41 MF and 18 GK, being their actual playing positions in this tournament.

Test Battery

The test battery encompassed Generic Parameters (GP), Anthropometric Parameters (AP), Motor Fitness Parameters (MFP) and the Physiological Parameters (PP), adding to a total of 22 features [17]. The GP included the age, team code and the actual position of play. The AP acquired were Height (HT) (cm), Weight (WT) (kg), BMI (kg/m2), BMR (J/s), Fat (%), Thigh Circumference (TC) (cm), Calf Circumference (CC) (cm), Arm Length (AL) (cm), Leg Length (LL) (cm), Elbow Circumference (EC) (cm) and Knee Circumference (KC) (cm). The MFP included Explosive Power (EP) (cm), Relative Explosive Power (REP) (index), Sit and Reach (SR) (cm), 40 m sprint time (40 m) (s), 80 m sprint time (80 m) (s), 120 m sprint time (120 m) (cm), Ruler Drop Test (RD) (cm) andT-Test (TT) (cm). The PP measured were Blood Pressure (BP) (index) and pulse (PRA). Before the process of measuring and testing, the entire test battery was demonstrated by trained sportsmen for better understanding of the subjects.

Data Acquisition

Post the demonstration, informed consent was obtained from the subjects and the GP were collected in the presence of their respective coaches for any clarifications. The equipment was set up to first measure the candidate’s AP, followed by MFP and then PP. The equipment used for AP tests were fixed linear scales for HT, Omron Body composition analyser for WT, BMI, and BMR [18]. The other parameters of AP were obtained using flexible measuring tapes. The MFP measuring devices included a fixed linear scale for EP and SR. Stopwatches were used on marked tracks for 120 m, 80 m, 40 m and TT and a wooden ruler was used to acquire the RD [19, 20]. The REP, being the explosive power with respect to the height of the subjects, was calculated from Eq. 1. The PP were measured using an OmronDigital BP meter

$$\:REP=\:\frac{EP}{HT}$$
(1)

Data Handling

The most important aspect of modelling is the preparation of the data for the development of the model. This was accomplished by classifying the parameters into categorical and continuous data. The oversampling was addressed by SMOTE and data standardization approaches. Parameter classification was achieved by ignoring the columns which were used to identify each row, such as Serial Number, Player Name, University Name and Team Code. The remaining parameters were then classified as either Categorical or Continuous. For instance, age and position of play were considered to be categorical variables while the remaining AP and MFP were regarded to be continuous in nature. This data was then operated on independently to perform over sampling for one model and data standardization for another model.

Over sampling is a technique used to eliminate class imbalance in a dataset. In the present case, the class distribution was found to be, as shown in Fig. 1. Most classification models have a majority class bias and tend to predict their class better than the minority classes [21, 22]. Thus by the incorporation of the Synthetic Minority Oversampling Technique, known as SMOTE (which generates synthetic data points for the minority class using K-Nearest Neighbours), multiclass oversampling was performed to generate data points in the training data to avoid class imbalance. This is shown in Fig. 2.

Fig. 1
figure 1

Initial class distribution of the dataset

Fig. 2
figure 2

Class distribution of the training dataset after SMOTE

Majority of the features were found to be continuous and AP showed a skewed distribution. This reduced the impact of low frequency and low magnitude data points, which could have been equally significant, thus requiring to be operated on, before running the models for prediction. Standardization was performed as a part of data transformation. This process rescaled the values to have a mean of 0 and variance of 1, reducing the bias caused by the magnitude of the values yet accounting for outliers optimally.

Classification

The prediction of a suitable playing position often depends on multiple parameters [23]. Statistical modelling is not useful in cases where multidimensional analysis demonstrates an overlap with respect to the characteristic requirement for each of the target. In order to prevent the issue of overfitting, machine learning models are often preferred in cases with multiple features such as that, seen in the present context [24, 25], wherein SVM and XG Boost [26] were used for multi-class classification. We used SVM algorithm because in high-dimensional spaces, it works well. In situations where there are more dimensions than samples, it works effectively. In the decision function, it makes use of a portion of the training set. XGBoost is a scalable and very accurate gradient boosting solution that pushes the computational boundaries of boosted tree algorithms, primarily to accelerate computational speed and machine learning model performance.

SVM depends on the hyperparameters which are to be defined before the initiation of learning process [19, 27]. Due to the fact that these hyperparameters do not regularize automatically, they need to be tuned as per the requirements of the data initially. The best hyperparameter fit elicits the best model performance and is usually a trial-and-error based approach developed with the aid of algorithms such as grid search and Bayesian optimization. XG boost, however, utilizes internal model parameters and are data dependant and hence do not require any tuning process. The entire workflow, incorporated in the present analysis is depicted in Fig. 3 and the proposed work is validated using various test strategies [28, 29] and methodologies [30, 31].

Fig. 3
figure 3

Predictive analysis - workflow

Results and Discussions

Descriptive Analysis of AP and MFP

Statistical Analysis of the AP and MFP of the football players is tabulated in Tables 1 and 2. AP consisted of 11 parameters namely HT, WT, BMI, BMR, FAT, TC, CC, AL, LL, EC and KC that were analysed. The MFP consisted of 7 parameters namely EP, REP, SR, TT, 40 m dash, 80 m dash, 120 m dash and RD. The 2 Physiological parameters considered were BP and PULSE represented as SBP, DBP and PULSE in the dataset. Overall, 20 parameters were used to predict the Target, of which certain AP and MFP features showed position indicative results.

Table 1 Distribution of AP
Table 2 Distribution of MFP

Positional Exploratory Data Analysis

The exploratory data analysis of the AP showed that HT, BMI, BMR, FAT, TC and CCelicited position indicative results, reflecting the characteristics of each position. Anthropometric patterns were indicative of physical development of certain traits as a result of a playing style or a positional requirement.

HT and BMR were found to have a moderate correlation of + 0.733 and showed similar patterns, being highest respectively for GK (172.17 ± 4.73 cm) (1676.89 ± 94.00 J/s) followed by D (170.24 ± 4.99 cm) (1653.89 ± 100.67 J/s), F (168.92 ± 5.79 cm) (1634.45 ± 85.21 J/s) and finally MF (167.72 ± 4.50 cm) (1617.95 ± 88.62 J/s) as observed in Fig. 4. This may be attributed to the positional requirement of GKs to be taller and broader. BMR is affected by height, weight, body surface area and so on and this is seen through the results obtained as well.

Fig. 4
figure 4

Position based comparison of AP- HT and BMR

The mean BMI and FAT were also found to be position indicative in nature with them being the lowest for GK and highest for F. The distribution of these two metrics was found to be the opposite of that of HT and BMR. This can be observed through comparison of Figs. 4 and 5. This may be attributed to how fat and metabolic rate are inversely related. Furthermore, BMI is a product of several factors, FAT and HT are two such factors. As seen by their correlation BMI was observed to be impacted positively by FAT (0.226) and negatively by Height (-0.160).

Fig. 5
figure 5

Position based comparison of AP- BMI and FAT

Thigh Circumference and Calf Circumference were position indicative with a moderate correlation of + 0.551, as seen in Fig. 6. It was also evident that the defenders seem to have a better strength in terms of thigh and calf muscles.

Fig. 6
figure 6

Position based comparison of AP- CC and TC

MFP were successful in reflecting the athleticism of each position which are explainable by the on-field requirement of the same. Of the 7 MFP considered, 4 were found to be statistically significant features while 6 features, provided positional results.

Figure 7 highlights the mean 120 m sprint timing as well as the mean explosion power of players based on their position. Explosion Power was considered to be the vertical height a player jumped to, indicated by the red line and the secondary axis. As expected of Goal Keepers, it was found to be the highest for GK, followed by D, then by MF and finally F. The 120 m sprint timing was found to be the least for GK. This may be attributed to their quick reactions or their height. GK had the highest mean HT values. This may also be because of the class imbalance of GK. GK was followed closely by D, then F and finally MF.

Fig. 7
figure 7

Position based comparison of MFP- 120 m and explosive power

Figure 8 highlights the mean 80 m sprint timing as well as the mean 40 m sprint timing of players based on their position. F were observed to have the fastest 80 m and 40 m sprints, while D had the slowest 80 m and 40 m sprint. This could be because of the fact that the forwards are required to play at high speed and high intensity in a small area of the field. MF cover the most ground on field in a match and they are hence long distance runners and not sprinters. This could be observed with their consistent pace in covering 40 m and 80 m.

Fig. 8
figure 8

Position based comparison of MFP- CC and TC

Figure 9 highlights the mean RD (cm) as well as the mean T-Test time (s) of players based on their position. F were found to have the fastest reaction time, followed by D, GK and MF. This was measured by the vertical Ruler Drop test with a wooden ruler. The T-Test indicates the agility of player in shifting directions, being the lower in F, followed by MF, then GK and lastly D.

Fig. 9
figure 9

Position based comparison of MFP- CC and TC

Statistical Hypothesis Testing

The dataset consisted of a total of 23 features influencing the target. The study started with the null hypothesis (H0) that each of these features affects the position of play of a football player. To test the same, Spearman’s Rank Correlation test was performed to identify the correlation of each variable with the target by checking the p-value for statistical significance. The statistical significance level for this test was taken as 5%. In this study, 4 features, namely 40 m, 80 m, 120 m and TT, were found to be statistically significant, accepting H0, whilethe rest of the features rejected the null hypothesis, as shown in Fig. 10.

Fig. 10
figure 10

Hypothesis testing and correlation between statistically significant features and target variable

Modelling

This study used two models for modelling, namely SVM and XG Boost. SVM was run on the data multiple times utilizing different pre-processing techniques to improve model performance. A model’s performance may be measured by not just the accuracy but by the consistency between the accuracy and weighted f1-score. As seen in by Figs. 11 and 12, the performance of the SVM classifier with just categorised data input had an accuracy of 34% and an f1-score of 0.18. With a 50% difference between the two, this model was rejected. The training data was then over sampled to give an accuracy of 40% and an f1-score of 0.29. Although the difference had reduced, this model would not suffice as well. The model parameters of C and Gamma were then tuned by grid search to the values of 0.001 and 0.01 respectively, with a polynomial kernel for best performance, upon which the accuracy and f1-score were 48% and 0.49 respectively. Finally, data standardization was performed on the data to reduce the impact of magnitude on the model, to get the best result from the classifiers with the highest accuracy and f1-score of 52% and 0.52 respectively using the algorithm. The best performance was obtained by XG Boost which regularized model formalization to prevent over fitting with an accuracy and f1-score of 90% and 0.9 respectively, giving an optimized tangible output.

Fig. 11
figure 11

Prediction accuracy of the different models implemented

Fig. 12
figure 12

f1-score of the different models implemented

Over the past few years, XGBoost has been increasingly popular after helping teams and individuals win almost every Kaggle structured data tournament. Companies and researchers submit data to these competitions, where statisticians and data miners compete to create the best models that predict and describe the data.

Conclusions and Future Scope

Positional Analysis

In the present work, for F, D, MF and GK, 6 AP and 4 MFP were found to be specific for each of the playing positions in football and could be used as metrics to predict the most suited playing position for a person.

Goalkeeper

From the results, the goalkeepers were found to have the highest basal metabolic rate and height but had least fat percentage and body mass index. They also had the least developed thigh and calf muscles compared to the other positions. In the MFP it was found that goalkeeper had a comparable explosive power and 120 m sprint running time with the defenders and also covered a very small distance in the field as depicted by their 120 m sprint timings. Goalkeeper were also found to have an overall best reaction and agility as depicted by T-Test and RD distance.

Defender

From the above analysis defenders were found to have a good height and basal metabolic rate but not more than goalkeepers. Defenders had the maximum developed thigh and calf muscles. In the MFP it was found that defenders had comparable explosive power and 120 m sprint running time with the goalkeepers and also covered a very small distance in the field depicted by their 120 m sprint timings. Defenders covered the area in the field near the goalpost which was depicted by their 40 m and 80 m sprint timing. Defenders also had highest agility in the movement.

Forwards: Forwards were found to have a relatively lower basal metabolic rate and height, compared to GK and D but higher than MF. They also had the highest Body Fat % as well as BMI. With relative low CC and TC, they stood just above GK in terms of these AP. Among the MFP, they were found to have relatively slower 120 m sprints and that the lowest explosive power, which could be attributed to their lack of explosive power. With the lowest timing for both 80 m and 40 m sprints, they were found to be faster than the other positions for short sprints. The forwards were found to have the highest agility, depicted by their T-Test scores and the shortest ruler drop distance indicating fast reflexes.

Midfielders

Upon analysis of the measured AP, mid fielders were found to have the lowest basal metabolic rate and height. While their average body fat percentage fell second only to forwards and their average BMI was greater than GK. They were found to have relatively lean thighs and calves, possibly owing to the large areas covered by them on field. Among the MFP, they were found to have the slowest 120 m sprint timing and the highest explosive power. They were found to be relatively faster at the 80 m sprints compared to the 40 m sprints, with respect to the other positions. With the largest distance in the ruler drop test but relatively better T-Test scores than the other positions, the midfielders were found to be over all quick with reflexes.

Prediction of Ideal Playing Position

6 AP and 4 MFP were found to be specific for each of the playing positions in football and were used as metrics to predict the most suited playing position for a player.

Spearman’s Rank correlation test was successful in validating 4 of the 23 hypothesis structured in this study, namely 120 m sprint timing, 80 m sprint timing, 40 m sprint timing and T-Test Timing respectively with a confidence interval of 95% (p < 0.05). The models effectively predicted the position of play of a player based on their physiological parameters (BP and heart rate), AP and MFP. SVM was built in stages, such that when it was applied directly to data it gave a 34% accuracy and 0.18 f1-score, which improved the accuracy and f1-score to 40% and 0.29 respectively upon over sampling the training data to remove multi-class imbalance. The performance further improved by hyperparameter tuning to give 48% accuracy and 0.49 f1-score. Finally, data standardization helped get the best output of 52% accuracy and 0.52 f1-score. XG Boost gave the most optimized output of 90% accuracy and a consistent f1-score of 0.90. This gradient boosting classifier did not require any pre-processing. Furthermore, it minimized the over fitting by utilization of model parameters that learnt on the data with no pre assigned values.

Although the results were satisfactory regarding prediction of the playing position of elite players using AP and MFP, the study can be further extended to include more data points for GK. Furthermore, the test battery can be extended to include Psychoacoustic tests that measure the interpersonal interaction and reaction of sound by players. It would be immensely valuable to analyse other sports players for similar comprehensive positional analysis as well.