Abstract
Depression is, worldwide, the main cause of diseases and disabilities during the adolescence. This disorder ails over 300 million people, and can interfere with an individual’s professional performance and education. Therefore, it is essential to conduct research that contributes in the correct diagnosis and treatment of depression, especially on children and adolescents. The Support Vector Machines (SVM) classifier has shown great performance 3 and generalization capabilities when compared to other classifiers, in the context of depression diagnosis. The objective of this study is to explore the depression disorder on children and adolescents, using this classifier. Since the SVM is a black box method, to better understand the model generated we employed the SHAP approach to help explain the model’s output based on feature importance. The final model obtained F-measure results above 87% during training and 82% in its testing. We concluded that the predictive model had satisfactory results and, using the SHAP framework, we explored how the features influenced the results.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
The depression disorder, according the World Health Organization (WHO)Footnote 1, is different from common mood swings and short-lived feelings in response to daily life challenges. It is pointed by WHO as the worldwide main cause of diseases and disabilities in adolescents, potentially evolving into a serious health issue when lasting long with moderate or high intensity. Depression, as a disorder, afflicts about 300 million people, an increase of over 18% between the years of 2005 and 2015. When untreated, this disorder causes great suffering to the individual, and interferes with their professional performance, education, relationships, and possibly leads to suicide, in the worst cases. Every year, about 800 thousand people commit suicide, and this is considered the second greatest cause of death of young people between 15 and 29 years old [1]. Some studies [21] claim that half of people who suffer mental disorders show their first symptoms before completing 15 years of age. The American Psychiatric Association (APA)Footnote 2 claims that one in every six people will suffer depressive episodes during their lives, which indicates over a billion possible victims worldwide. Therefore, caring for children and adolescents with mental health issues is important to avoid death and suffering throughout their lives.
A precise diagnosis is fundamental before administering psychological and pharmacological treatments for depression. Thus, there’s a need for research on the diagnosis of depression, as well as its treatment. Some researchers conducted studies to aid in depression diagnosis, and among those studies some promising results came from applying machine learning techniques. The literature showed satisfactory results from combining machine learning techniques and pattern recognition to characterise diseases, being particularly efficient for mental health issues such as depression [17].
The Support Vector Machines (SVM) classifier has shown superior performance and generalisation capabilities when compared to other classification techniques, in various different applications, including depression diagnosis [6]. We performed some preliminary experiments training different classification algorithms with the dataset used in our study, and the SVM outperformed all others, namely: C4.5 decision tree [18], CART [4], Multilayer Perceptron Neural Network and Random Forests [5] (results not shown). The SVM is considered a black box method, as it somewhat conceals its internal logic from the user, creating models that are difficult to interpret. Because of that limitation, and because it was especially important in this application to interpret the generated model, we employed the SHAP framework [14] to help explain the output of the SVM classifier, and characterise the diagnosed individuals using a feature importance metric. The classifier was training with a dataset with data from 377 patients between 10 and 16 years of age. The data was obtained through a partnership with the Cognition and Behavioural Psychology Postgraduate Research Program of a University.
2 Background
2.1 Black Box Models
As machine learning algorithms become more complex and precise, they often become less comprehensible and generate models that are harder to interpret. A model is said to be a black box if it keeps its internal structure unknown or hard to interpret, making its classifications hard to explain. The behaviour of a black box model can be described as such: given an input, the model calculates the output based on an internal function, without providing an explanation of how it reached its result.
Although they usually have superior generalisation capabilities, when compared to other classifiers, the non-intuitive solutions provided by black box models can become an obstacle to their practical use, especially when it is vital to the project to explain how the classification was made.
Lately, black box classifiers such as the SVM and Artificial Neural Networks have been achieving good results in several applications, but the low interpretability of their models hinders their applicability for cases where the classification process needs to be understood, such as medical applications. In [8], the authors argue that even a limited explanation can positively influence the likelihood of these methods being applied in such cases.
Some rule induction techniques, such as Quinlan’s C4.5 decision trees [18], build highly interpretable models, but are likely to lose performance for doing so, being outperformed by more complex classifiers such as the SVM. Therefore, efforts for extracting rules that may help explain black box models have been made, to maintain their superior performance and gain some interpretability. The ultimate goal is to have the performance of black box models associated to a transparent and easily interpretable model that can, for example, be modelled as a decision tree or a rule set [10].
A black box model can be explained either from a global or a local perspective. A global explanation considers the internal functioning of the whole model [11]. The local explanations aim to elucidate the reasoning behind a single prediction. The rule extraction algorithms that are used for this purpose can be classified as “pedagogical” or “decomposition”. The pedagogical approach extracts rules directly related to the inputs and outputs of a classifier. The approach utilises the trained model as an oracle to produce a set of examples of inputs and outputs, and then applies pattern search strategies to construct its model (a decision tree, for instance). The decomposition approach is interwoven with the internal structure of the SVM and its hyperplanes, aiming to explain the individual computation of the internal components in the model [10].
2.2 Support Vector Machines
The SVM [7] classifier is based on the Statistical Learning Theory. It constructs hyperplanes with a decision surface, maximising class separation. Several hyperplanes can be constructed to separate the instances, each of them defining a separation margin, where points situated in the limits make the support vectors, and the middle point of the margin is the optimal hyperplane. It is expected that broader hyperplanes with broader margins will be able to classify unseen data better than those with narrower margins.
These classifiers frequently show good generalisation capabilities when compared to others. However, their models are non-intuitive and hard to interpret [9]. To overcome that limitation, several techniques to extract knowledge from SVM were developed, to help interpret them and explain their classifications, such as: SVM+Prototype [16], Barakat [3], Fung [9], SHAP [14], and others [10].
2.3 The SHAP Framework
The SHAP (SHapley Additive exPlanations) is a unified approach to interpret predictions and explain the outputs of any machine learning model. The SHAP connects game theory with local explanation, to attribute an importance measure to each feature, for a given prediction [14], with larger values indicating a greater participation of a feature in a prediction. The SHAP calculates this feature importance measure using Shapley’s values, introduced in 1953 in the game theory field [20], but only recently being applied in this context.
The SHAP framework has unified six existing feature importance measures and, according to the author, ensures three desired properties to methods of the same class, with better computational performance and interpretability than other approaches [14]. These properties are: (1) Local Precision: the sum of the feature importance’s attributions is equal to the model’s output, (2) Missingness: missing features are not attributed any impact on the model’s output, and (3) Consistency: altering a model so that a feature gains a bigger impact in it will never reduce the feature importance attribution of that feature.
The calculations of the SHAP values are simple, but computationally expensive. The idea is to re-train the model for all feature subsets \(S \subseteq F\), where F is the set of all features. The Shapley values attribute an importance metric to each attribute, representing its impact on the model’s predictions. In order to calculate this impact, it compares the predictions of models trained with and without the feature. As the impact of a feature also depends on other features included in the model, the previous comparisons are made for all possible feature subsets, and the Shapley values are a weighted average of all comparisons.
Figure 1 illustrates the SHAP approach to explain the output of a machine learning model. The framework, thus, obtains the model generated by a machine learning method and outputs its feature importance measures.
3 Related Work
In [12] the authors analysed data from a city in India serviced by a phone-call based screening system for tuberculosis patients, used to help the patient screening process. The dataset had close to 17 thousand patients and 2.1 million registered phone calls. The authors report that the technique with best predictive performance was a deep learning approach, considered a black box. They employed the SHAP framework to generate visualisations and help explain the model, providing insight to the medical researchers. Their conclusions were that, in a real-time application scenario, the model would be able to support health professionals in making precise interventions on high risk patients.
Yan et al. [23] utilised data from the Acute Myocardial Infarction record, from China, and applied the XGBoost machine learning method to generate a risk prediction model for hospital mortality among patients that had suffered a myocardial heart attack. They employed the SHAP framework to explain the impact of their features in the predictions and, from its results, were able to find new relations between clinical variables and hospital mortality. One example of such relations was the blood glucose level, which showed a nearly linear relationship with hospital mortality in the patients. The authors concluded that the new prediction model had a good discrimination capability and offered individualised explanations of how the clinical variables had influenced the results.
4 Materials and Methods
4.1 Dataset Description
The dataset utilised in this study was obtained from a partnership with the Cognition and Behavioural Psychology Postgraduate Research Program of Federal University of Minas Gerais. The dataset hold information from 377 children and adolescents between 10 and 16 years of age (158 male and 219 female), and has 75 featuresFootnote 3 representing different symptoms of possible depression disorder.
The dataset stores broader demographic data such as the patient’s age and gender, and more specific data such as schooling, who they live with, use of medication, Youth Self-Report (YSR) scores, and questions of the Children’s Depression Inventory (CDI) [13]. The dataset also stores information about the patient’s relationship with their parents, such as hours a week they spend with their parents, whether the patient or the parents has had psychological or psychiatric treatment, and the parents’ schooling. Other features deemed important by the mental health study community were included, such as anxiety factors, social problems, lack of attention, aggressiveness, and behaviour issues. Most features in the dataset have ordered categorical values.
4.2 Data Preprocessing
In order to obtain a more robust model, before training the classification models, we did some preprocessing on the dataset. The goal of the preprocessing was to remove features that were unrelated to the problem, merge features when necessary, encode the features and handle missing data and outliers. All the data preprocessing was done using Python on the Jupyter Notebook framework. The preprocessing tasks, executed sequentially, were: removal of irrelevant features and treating inconsistencies in the data. For the second task, two instances had unexpected values for some features (greater than the limit in the presented context, which was daily time spent with the parents). For these cases, we assumed the greatest possible value of the context, 24.
Continuing preprocessing, the following tasks were performed: nominal to numeric values encoding and identification of the class feature. We observed that, among the patients in the dataset, the “CDI score” feature had values between 0 and 46. The CDI score does not determine the diagnosis of depression, but it does show evidence that can help the precise diagnosing, and is calculated by evaluations made by professionals. However, there was no unanimous threshold to determine a depression diagnosis, as this value can vary on different samples. The recommendation by Kovacs [13] is to utilise a 85 percentile threshold to indicate high symptomatology. Thus, in our dataset, 63 patients had high enough CDI scores to be classified as HIGH symptomatology, and the others were classified as LOW.
The dataset has 314 individuals of the LOW class and 63 of the HIGH. In order to avoid the classifier from having a bias towards the majority class, we employed a class balancing strategy of random undersampling, until the number of patients of both class was the same.
To validate the generated classification models, we divided the dataset into a training and a test set. The model is created using the training instances and, then, evaluated in the unseen test instances. The division is shown in Table 1.
With the balanced dataset, with randomly selected 10 instances of the HIGH class and 50 of the LOW class for testing. The number of instances of each class maintained the original proportions of classes. The model was trained with 53 instances of each class using 10-fold cross-validation.
4.3 Methods
The SVM experiments were conducted using the libSVM implementation on Python’s scikit-learn open-source libraryFootnote 4. The algorithm was selected based on its frequent use in the literature and for meeting the standards required for our study. Three parameters were adjusted when training the classifier: the C value (\(C=12\)), a smoothing parameter for the hyperplane margins, the gamma (\(gamma=0.001\)), which is the width of the Gaussian, and the kernel type (\(kernel=rbf\)). These parameters are highly relevant to the performance of the model, as they are directly related with the training times and prediction performance. The SVM parameters were adjusted using the Grid Search algorithm. The Grid Search performs and exhaustive search over specified values of parameters for a classifier, and finds the best combination of values for these parameters, based on a quality criterion.
Given the SVM’s complexity and the need to understand the importance of each feature for the predictions, after training and validating the SVM model, we employed the SHAP framework [14] to help interpret the classifier’s output.
5 Results and Discussion
In this section, we present the results obtained from the SVM model trained with the preprocessed dataset. Table 2 shows the average values of the evaluation metricsFootnote 5.
The training set results showed a greater precision for the HIGH class, meaning the model correctly classified 90% of the instances of that class. There is also a noticeably higher recall value for the LOW class, meaning the model was correct when predicting that class 90.6% of the times it did.
Table 3 shows the test set results. For these experiments, the test sets had 10 instances of the HIGH class and 50 of the LOW class, keeping the proportions of the original dataset, and all instances were previously unseen by the model. Analyzing the F-measure metric, the harmonic mean between precision and recall, we see that SVM performance was good, despite observing a reduction in the HIGH class.
In the following part of the study, we identified the most relevant features in the model based on the test set results. Figure 2 shows the SHAP values for the best features and all the test set examples, classified based on their impact on the classifier’s output.
The horizontal axis represents the impact (SHAP value) of a feature, with positive values meaning that the values of the feature will increase the likelihood of the positive class (HIGH), whereas negative values mean the opposite. The attributes were ordered vertically by their average impact, the highest impact at the top. The points in each distribution for a feature represents a single patient, with high density represented by stacking of points. The colours of the points represent high (red) and low (blue) values of the features. The Figure clearly shows how high or low values of the features impact their SHAP values.
To better understand the Fig. 2, let’s look at the CDI20_T1 attribute, which is the highest impact attribute for the model. The points represented by the red color indicate higher values for the attribute, while the points represented by the blue color represent lower values. The CDI20_T1 attribute, which characterizes the feeling of loneliness, has the following values: (0) I don’t feel lonely, (1) I feel lonely often and (2) I always feel lonely, therefore, the higher the value of the attribute, the greater the feeling of loneliness. High values of this attribute (red dots) indicate a higher probability of a HIGH class prediction, while low values (blue dots) indicate a higher probability of a LOW class prediction.
An interesting information represented in the Figure is that, among the 20 most relevant features in the model, 16 of them are from the CDI, being CDI20_T1 (loneliness feelings), CDI7_T1 and CDI14_T1 (both low self-esteem) the three features with greater impact in the prediction.
These results corroborate the findings [2, 22, 24] that self-esteem is an important factor for depression and suggest continuous interventions to increase self-esteem in adolescence, which can greatly reduce the degree of depression. In [15] the authors propose an investigation in young adulthood about the association between social isolation and loneliness and how they relate to depression and conclude that both were associated with depression.
6 Conclusions and Future Work
The objective of this studyFootnote 6 was to explore the diagnosis of depression disorder in children and adolescents, utilising a SVM classifier. Although there are several studies utilising machine learning classification algorithms to tasks related to depression diagnosis, there are few studies focused on younger individuals. Patients of young age are usually harder to evaluate on a deeper level, reaching complex analysis results that can lead to a more precise diagnosis, hence the need for precise predictions based on high-level data.
Other authors have considered classification metrics above 75% as satisfactory and meaningful [19]. Our model surpasses that threshold, and has a good success rate in discriminating the classes. With the SHAP framework, we were able to analyse the predictions made by the SVM model, and discuss feature importance to an extent.
The model uses the symptomatology feature as its class value. This feature is calculated based on data from the CDI. As future work proposal we suggest the application of classifiers in a dataset that utilises other means of classifying young individuals with depression symptoms, and the discussion of other thresholds to classify the disorder. We also recommend deeper classification analysis, in order to possibly reach more robust models (using other algorithms or parameter combinations). Another possible future direction is to use techniques of rule extraction from SVM models to analyse the predictions made by the classifier, and further understand the diagnosis of depression on children and adolescents.
Notes
- 1.
Available at http://www.who.int/mediacentre/factsheets/fs369/en/.
- 2.
Available at https://www.psychiatry.org/.
- 3.
A complete description of the features of this dataset can be found at https://goo.gl/z2wUKg.
- 4.
- 5.
\(Precision = \frac{TP}{TP+FP}\); \(Recall = \frac{TP}{TP+FN}\); \(F-measure = \frac{2 \times Precision \times Recall }{Precision + Recall}\).
- 6.
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, by Foundation for Research Support of the State of Minas Gerais/Brazil (FAPEMIG) and the Brazilian National Council for Scientific and Technological Development (CNPq).
References
Anderson, R.N., Smith, B.L., et al.: Deaths: leading causes for 2002. Natl. Vital Stat. Rep. 53(17), 1–90 (2005)
Babore, A., Trumello, C., Candelori, C., Paciello, M., Cerniglia, L.: Depressive symptoms, self-esteem and perceived parent-child relationship in early adolescence. Front. Psychol. 7, 982 (2016)
Barakat, N., Diederich, J.: Eclectic rule-extraction from support vector machines. Int. J. Comput. Intell. 2(1), 59–62 (2005)
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth International Group, Monterey (1984)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Byun, H., Lee, S.-W.: Applications of support vector machines for pattern recognition: a survey. In: Lee, S.-W., Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, pp. 213–236. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45665-1_17
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Davis, R., Buchanan, B., Shortliffe, E.: Production rules as a representation for a knowledge-based consultation program. Artif. Intell. 8(1), 15–45 (1977)
Fung, G., Sandilya, S., Rao, R.B.: Rule extraction from linear support vector machines. In: Proceedings of the Eleventh ACM SIGKDD International Conference, pp. 32–40. ACM, New York (2005)
Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR) 51(5), 93 (2018)
Hall, P., Gill, N.: Introduction to Machine Learning Interpretability. O’Reilly Media Incorporated, Sebastopol (2018)
Killian, J.A., Wilder, B., Sharma, A., Choudhary, V., Dilkina, B., Tambe, M.: Learning to prescribe interventions for tuberculosis patients using digital adherence data. arXiv preprint arXiv:1902.01506 (2019)
Kovacs, M.: Children’s Depression Inventory (CDI): technical manual update. Multi-Health Systems (1992)
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems 30. Curran Associates, Inc. (2017)
Matthews, T., et al.: Social isolation, loneliness and depression in young adulthood: a behavioural genetic analysis. Soc. Psychiatry Psychiatr. Epidemiol. 51, 339–348 (2016)
Núñez, H., Angulo, C., Català, A.: Rule-based learning systems for support vector machines. Neural Process. Lett. 24(1), 1–18 (2006)
Orrù, G., Pettersson-Yeo, W., Marquand, A.F., Sartori, G., Mechelli, A.: Using support vector machine to identify imaging biomarkers of neurological and psychiatric disease: a critical review. Neurosci. Biobehav. Rev. 36(4), 1140–1152 (2012)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Sacchet, M.D., Prasad, G., Foland-Ross, L.C., Thompson, P.M., Gotlib, I.H.: Support vector machine classification of major depressive disorder using diffusion-weighted neuroimaging and graph theory. Front. Psychiatry 6, 21 (2015)
Shapley, L.S.: A value for n-person games. Contrib. Theory Games 2(28), 307–317 (1953)
Sunmoo, Y., Basirah, T., et al.: Using a data mining approach to discover behavior correlates of chronic disease: a case study of depression. Stud. Health Technol. Inform. 201, 71 (2014)
Ticusan, M.: Low self-esteem, premise of depression appearance at adolescents. Procedia Soc. Behav. Sci. 69, 1590–1593 (2012). International Conference on Education & Educational Psychology (ICEEPSY 2012)
Yang, J., Li, Y., Li, X., Chen, T., Xie, G., Yang, Y.: An explainable machine learning-based risk prediction model for in-hospital mortality for Chinese STEMI patients: findings from China myocardial infarction registry. J. Am. Coll. Cardiol. 73, 261 (2019)
Yoon, M., Cho, S., Yoon, D.: Child maltreatment and depressive symptomatology among adolescents in out-of-home care: the mediating role of self-esteem. Child. Youth Serv. Rev. 101, 255–260 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Lima, T., Santana, R., Teodoro, M., Nobre, C. (2019). Knowledge Extraction from Vector Machine Support in the Context of Depression in Children and Adolescents. In: Nyström, I., Hernández Heredia, Y., Milián Núñez, V. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2019. Lecture Notes in Computer Science(), vol 11896. Springer, Cham. https://doi.org/10.1007/978-3-030-33904-3_51
Download citation
DOI: https://doi.org/10.1007/978-3-030-33904-3_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33903-6
Online ISBN: 978-3-030-33904-3
eBook Packages: Computer ScienceComputer Science (R0)