1 Introduction

The depression disorder, according the World Health Organization (WHO)Footnote 1, is different from common mood swings and short-lived feelings in response to daily life challenges. It is pointed by WHO as the worldwide main cause of diseases and disabilities in adolescents, potentially evolving into a serious health issue when lasting long with moderate or high intensity. Depression, as a disorder, afflicts about 300 million people, an increase of over 18% between the years of 2005 and 2015. When untreated, this disorder causes great suffering to the individual, and interferes with their professional performance, education, relationships, and possibly leads to suicide, in the worst cases. Every year, about 800 thousand people commit suicide, and this is considered the second greatest cause of death of young people between 15 and 29 years old [1]. Some studies [21] claim that half of people who suffer mental disorders show their first symptoms before completing 15 years of age. The American Psychiatric Association (APA)Footnote 2 claims that one in every six people will suffer depressive episodes during their lives, which indicates over a billion possible victims worldwide. Therefore, caring for children and adolescents with mental health issues is important to avoid death and suffering throughout their lives.

A precise diagnosis is fundamental before administering psychological and pharmacological treatments for depression. Thus, there’s a need for research on the diagnosis of depression, as well as its treatment. Some researchers conducted studies to aid in depression diagnosis, and among those studies some promising results came from applying machine learning techniques. The literature showed satisfactory results from combining machine learning techniques and pattern recognition to characterise diseases, being particularly efficient for mental health issues such as depression [17].

The Support Vector Machines (SVM) classifier has shown superior performance and generalisation capabilities when compared to other classification techniques, in various different applications, including depression diagnosis [6]. We performed some preliminary experiments training different classification algorithms with the dataset used in our study, and the SVM outperformed all others, namely: C4.5 decision tree [18], CART [4], Multilayer Perceptron Neural Network and Random Forests [5] (results not shown). The SVM is considered a black box method, as it somewhat conceals its internal logic from the user, creating models that are difficult to interpret. Because of that limitation, and because it was especially important in this application to interpret the generated model, we employed the SHAP framework [14] to help explain the output of the SVM classifier, and characterise the diagnosed individuals using a feature importance metric. The classifier was training with a dataset with data from 377 patients between 10 and 16 years of age. The data was obtained through a partnership with the Cognition and Behavioural Psychology Postgraduate Research Program of a University.

2 Background

2.1 Black Box Models

As machine learning algorithms become more complex and precise, they often become less comprehensible and generate models that are harder to interpret. A model is said to be a black box if it keeps its internal structure unknown or hard to interpret, making its classifications hard to explain. The behaviour of a black box model can be described as such: given an input, the model calculates the output based on an internal function, without providing an explanation of how it reached its result.

Although they usually have superior generalisation capabilities, when compared to other classifiers, the non-intuitive solutions provided by black box models can become an obstacle to their practical use, especially when it is vital to the project to explain how the classification was made.

Lately, black box classifiers such as the SVM and Artificial Neural Networks have been achieving good results in several applications, but the low interpretability of their models hinders their applicability for cases where the classification process needs to be understood, such as medical applications. In [8], the authors argue that even a limited explanation can positively influence the likelihood of these methods being applied in such cases.

Some rule induction techniques, such as Quinlan’s C4.5 decision trees [18], build highly interpretable models, but are likely to lose performance for doing so, being outperformed by more complex classifiers such as the SVM. Therefore, efforts for extracting rules that may help explain black box models have been made, to maintain their superior performance and gain some interpretability. The ultimate goal is to have the performance of black box models associated to a transparent and easily interpretable model that can, for example, be modelled as a decision tree or a rule set [10].

A black box model can be explained either from a global or a local perspective. A global explanation considers the internal functioning of the whole model [11]. The local explanations aim to elucidate the reasoning behind a single prediction. The rule extraction algorithms that are used for this purpose can be classified as “pedagogical” or “decomposition”. The pedagogical approach extracts rules directly related to the inputs and outputs of a classifier. The approach utilises the trained model as an oracle to produce a set of examples of inputs and outputs, and then applies pattern search strategies to construct its model (a decision tree, for instance). The decomposition approach is interwoven with the internal structure of the SVM and its hyperplanes, aiming to explain the individual computation of the internal components in the model [10].

2.2 Support Vector Machines

The SVM [7] classifier is based on the Statistical Learning Theory. It constructs hyperplanes with a decision surface, maximising class separation. Several hyperplanes can be constructed to separate the instances, each of them defining a separation margin, where points situated in the limits make the support vectors, and the middle point of the margin is the optimal hyperplane. It is expected that broader hyperplanes with broader margins will be able to classify unseen data better than those with narrower margins.

These classifiers frequently show good generalisation capabilities when compared to others. However, their models are non-intuitive and hard to interpret [9]. To overcome that limitation, several techniques to extract knowledge from SVM were developed, to help interpret them and explain their classifications, such as: SVM+Prototype [16], Barakat [3], Fung [9], SHAP [14], and others [10].

2.3 The SHAP Framework

The SHAP (SHapley Additive exPlanations) is a unified approach to interpret predictions and explain the outputs of any machine learning model. The SHAP connects game theory with local explanation, to attribute an importance measure to each feature, for a given prediction [14], with larger values indicating a greater participation of a feature in a prediction. The SHAP calculates this feature importance measure using Shapley’s values, introduced in 1953 in the game theory field [20], but only recently being applied in this context.

The SHAP framework has unified six existing feature importance measures and, according to the author, ensures three desired properties to methods of the same class, with better computational performance and interpretability than other approaches [14]. These properties are: (1) Local Precision: the sum of the feature importance’s attributions is equal to the model’s output, (2) Missingness: missing features are not attributed any impact on the model’s output, and (3) Consistency: altering a model so that a feature gains a bigger impact in it will never reduce the feature importance attribution of that feature.

The calculations of the SHAP values are simple, but computationally expensive. The idea is to re-train the model for all feature subsets \(S \subseteq F\), where F is the set of all features. The Shapley values attribute an importance metric to each attribute, representing its impact on the model’s predictions. In order to calculate this impact, it compares the predictions of models trained with and without the feature. As the impact of a feature also depends on other features included in the model, the previous comparisons are made for all possible feature subsets, and the Shapley values are a weighted average of all comparisons.

Figure 1 illustrates the SHAP approach to explain the output of a machine learning model. The framework, thus, obtains the model generated by a machine learning method and outputs its feature importance measures.

Fig. 1.
figure 1

Diagram of the SHAP framework.

3 Related Work

In [12] the authors analysed data from a city in India serviced by a phone-call based screening system for tuberculosis patients, used to help the patient screening process. The dataset had close to 17 thousand patients and 2.1 million registered phone calls. The authors report that the technique with best predictive performance was a deep learning approach, considered a black box. They employed the SHAP framework to generate visualisations and help explain the model, providing insight to the medical researchers. Their conclusions were that, in a real-time application scenario, the model would be able to support health professionals in making precise interventions on high risk patients.

Yan et al. [23] utilised data from the Acute Myocardial Infarction record, from China, and applied the XGBoost machine learning method to generate a risk prediction model for hospital mortality among patients that had suffered a myocardial heart attack. They employed the SHAP framework to explain the impact of their features in the predictions and, from its results, were able to find new relations between clinical variables and hospital mortality. One example of such relations was the blood glucose level, which showed a nearly linear relationship with hospital mortality in the patients. The authors concluded that the new prediction model had a good discrimination capability and offered individualised explanations of how the clinical variables had influenced the results.

4 Materials and Methods

4.1 Dataset Description

The dataset utilised in this study was obtained from a partnership with the Cognition and Behavioural Psychology Postgraduate Research Program of Federal University of Minas Gerais. The dataset hold information from 377 children and adolescents between 10 and 16 years of age (158 male and 219 female), and has 75 featuresFootnote 3 representing different symptoms of possible depression disorder.

The dataset stores broader demographic data such as the patient’s age and gender, and more specific data such as schooling, who they live with, use of medication, Youth Self-Report (YSR) scores, and questions of the Children’s Depression Inventory (CDI) [13]. The dataset also stores information about the patient’s relationship with their parents, such as hours a week they spend with their parents, whether the patient or the parents has had psychological or psychiatric treatment, and the parents’ schooling. Other features deemed important by the mental health study community were included, such as anxiety factors, social problems, lack of attention, aggressiveness, and behaviour issues. Most features in the dataset have ordered categorical values.

4.2 Data Preprocessing

In order to obtain a more robust model, before training the classification models, we did some preprocessing on the dataset. The goal of the preprocessing was to remove features that were unrelated to the problem, merge features when necessary, encode the features and handle missing data and outliers. All the data preprocessing was done using Python on the Jupyter Notebook framework. The preprocessing tasks, executed sequentially, were: removal of irrelevant features and treating inconsistencies in the data. For the second task, two instances had unexpected values for some features (greater than the limit in the presented context, which was daily time spent with the parents). For these cases, we assumed the greatest possible value of the context, 24.

Continuing preprocessing, the following tasks were performed: nominal to numeric values encoding and identification of the class feature. We observed that, among the patients in the dataset, the “CDI score” feature had values between 0 and 46. The CDI score does not determine the diagnosis of depression, but it does show evidence that can help the precise diagnosing, and is calculated by evaluations made by professionals. However, there was no unanimous threshold to determine a depression diagnosis, as this value can vary on different samples. The recommendation by Kovacs [13] is to utilise a 85 percentile threshold to indicate high symptomatology. Thus, in our dataset, 63 patients had high enough CDI scores to be classified as HIGH symptomatology, and the others were classified as LOW.

The dataset has 314 individuals of the LOW class and 63 of the HIGH. In order to avoid the classifier from having a bias towards the majority class, we employed a class balancing strategy of random undersampling, until the number of patients of both class was the same.

To validate the generated classification models, we divided the dataset into a training and a test set. The model is created using the training instances and, then, evaluated in the unseen test instances. The division is shown in Table 1.

Table 1. Number of instances per class.

With the balanced dataset, with randomly selected 10 instances of the HIGH class and 50 of the LOW class for testing. The number of instances of each class maintained the original proportions of classes. The model was trained with 53 instances of each class using 10-fold cross-validation.

4.3 Methods

The SVM experiments were conducted using the libSVM implementation on Python’s scikit-learn open-source libraryFootnote 4. The algorithm was selected based on its frequent use in the literature and for meeting the standards required for our study. Three parameters were adjusted when training the classifier: the C value (\(C=12\)), a smoothing parameter for the hyperplane margins, the gamma (\(gamma=0.001\)), which is the width of the Gaussian, and the kernel type (\(kernel=rbf\)). These parameters are highly relevant to the performance of the model, as they are directly related with the training times and prediction performance. The SVM parameters were adjusted using the Grid Search algorithm. The Grid Search performs and exhaustive search over specified values of parameters for a classifier, and finds the best combination of values for these parameters, based on a quality criterion.

Given the SVM’s complexity and the need to understand the importance of each feature for the predictions, after training and validating the SVM model, we employed the SHAP framework [14] to help interpret the classifier’s output.

5 Results and Discussion

In this section, we present the results obtained from the SVM model trained with the preprocessed dataset. Table 2 shows the average values of the evaluation metricsFootnote 5.

Table 2. Training set results (in percentage).

The training set results showed a greater precision for the HIGH class, meaning the model correctly classified 90% of the instances of that class. There is also a noticeably higher recall value for the LOW class, meaning the model was correct when predicting that class 90.6% of the times it did.

Table 3 shows the test set results. For these experiments, the test sets had 10 instances of the HIGH class and 50 of the LOW class, keeping the proportions of the original dataset, and all instances were previously unseen by the model. Analyzing the F-measure metric, the harmonic mean between precision and recall, we see that SVM performance was good, despite observing a reduction in the HIGH class.

Table 3. Test set results (in percentage).

In the following part of the study, we identified the most relevant features in the model based on the test set results. Figure 2 shows the SHAP values for the best features and all the test set examples, classified based on their impact on the classifier’s output.

The horizontal axis represents the impact (SHAP value) of a feature, with positive values meaning that the values of the feature will increase the likelihood of the positive class (HIGH), whereas negative values mean the opposite. The attributes were ordered vertically by their average impact, the highest impact at the top. The points in each distribution for a feature represents a single patient, with high density represented by stacking of points. The colours of the points represent high (red) and low (blue) values of the features. The Figure clearly shows how high or low values of the features impact their SHAP values.

Fig. 2.
figure 2

Impact of the most relevant features for the classification of test set instances. For each feature, the vertical dispersion represents data points with the same SHAP value for that feature. Higher SHAP values mean greater likelihood of a positive class prediction (HIGH). (Color figure online)

To better understand the Fig. 2, let’s look at the CDI20_T1 attribute, which is the highest impact attribute for the model. The points represented by the red color indicate higher values for the attribute, while the points represented by the blue color represent lower values. The CDI20_T1 attribute, which characterizes the feeling of loneliness, has the following values: (0) I don’t feel lonely, (1) I feel lonely often and (2) I always feel lonely, therefore, the higher the value of the attribute, the greater the feeling of loneliness. High values of this attribute (red dots) indicate a higher probability of a HIGH class prediction, while low values (blue dots) indicate a higher probability of a LOW class prediction.

An interesting information represented in the Figure is that, among the 20 most relevant features in the model, 16 of them are from the CDI, being CDI20_T1 (loneliness feelings), CDI7_T1 and CDI14_T1 (both low self-esteem) the three features with greater impact in the prediction.

These results corroborate the findings [2, 22, 24] that self-esteem is an important factor for depression and suggest continuous interventions to increase self-esteem in adolescence, which can greatly reduce the degree of depression. In [15] the authors propose an investigation in young adulthood about the association between social isolation and loneliness and how they relate to depression and conclude that both were associated with depression.

6 Conclusions and Future Work

The objective of this studyFootnote 6 was to explore the diagnosis of depression disorder in children and adolescents, utilising a SVM classifier. Although there are several studies utilising machine learning classification algorithms to tasks related to depression diagnosis, there are few studies focused on younger individuals. Patients of young age are usually harder to evaluate on a deeper level, reaching complex analysis results that can lead to a more precise diagnosis, hence the need for precise predictions based on high-level data.

Other authors have considered classification metrics above 75% as satisfactory and meaningful [19]. Our model surpasses that threshold, and has a good success rate in discriminating the classes. With the SHAP framework, we were able to analyse the predictions made by the SVM model, and discuss feature importance to an extent.

The model uses the symptomatology feature as its class value. This feature is calculated based on data from the CDI. As future work proposal we suggest the application of classifiers in a dataset that utilises other means of classifying young individuals with depression symptoms, and the discussion of other thresholds to classify the disorder. We also recommend deeper classification analysis, in order to possibly reach more robust models (using other algorithms or parameter combinations). Another possible future direction is to use techniques of rule extraction from SVM models to analyse the predictions made by the classifier, and further understand the diagnosis of depression on children and adolescents.