Keywords

1 Introduction

The advance of technology, facility of satellite communications, and ubiquity of the Internet [1] have multiplied the extent and impact of ideology-, politically-motivated acts of violence, have expanded their scope beyond specific locales and regions, and have made them a growing threat against humanity, across the world [2]. In such violence, groups or individuals commit acts of unbelievable brutality against a leader, citizens, an entire city, or nation. The motivation behind it is usually a radicalized interpretation of defending a greater good, politics, or extreme ideology [3]. However, such acts of violence are always disturbing to people’s minds and their everyday lives and destabilizing to societies and peace. They are associated with human and economic tolls, and challenge sustainable development in modern and third-world countries [4].

In the age of information, there are more detailed datasets regarding such violent acts across the world. This information includes the human casualties and fatalities, the level of coordination and expertise, the targeted process, and how much the process was adversely affected. Groups and individuals committing such violent acts are usually associated with a Violent Extremist Organization (VEO) [5]. The purpose of this study is to investigate the possibility of recognizing the responsible VEO based on the information available about the violent act. We explore different machine learning models, appropriate based on the sample size, for this purpose. Section 2 describes our dataset and the features used as input to the machine. Section 3 explains the machine learning models selected for this study. Section 4 reports and discusses the results and Sect. 5 provides insight into our results along with future directions.

2 Data and Features

Information about violent acts carried out by VEOs across the world is provided to this study by Radical and Violent Extremism (RAVE) Laboratory at The University of Nebraska Omaha. They developed this dataset by first relying an open-source database on characteristics of extreme acts of violence, called GTD [6]. Violent acts are included in the GTD if they have a political, social, religious, or economic motive, are intended to coerce, intimidate, or publicize the cause, and/or if they violate international humanitarian law. Among other sources for their dataset are: historical accounts described in open-source data gathered from academic and government sources, scholarly case studies, public-records databases (e.g. Lexis-Nexis), and primary documents from VEOs themselves, such as propaganda and websites. Information were gathered by graduate students with expertise in criminology, industrial and organizational psychology, and information science and technology from a cross functional research center. Coders received 20 h of training prior to data collection on the nature of VEOs, extremist recruitment, and related manifestations in the context of extremism as well as on search tactics and filtering information.

There are six features associated with each violent act in our dataset, including: number of casualties, number of fatalities, level of coordination, level of expertise, importance of the process targeted by the violent act, and scope of the impact on that process. All features are numerical. Casualties range from 0 to 1500 with an average of 7, fatalities range from 0 to 1180 with an average of 5, and the other four variables range from 1 to 5.

Pattern recognition models require large number of training samples from each class, usually linear or exponential with respect to the number of features. Thus, VEOs with sample sizes smaller than 50 were removed. We also removed records containing unknown variables. Eventually our dataset contained 5,661 violent acts by 38 different VEOs from Jul 21st, 1972 until Dec 31st, 2016. The histogram in Fig. 1 shows the number of violent acts carried out by each VEO and the histogram in Fig. 2 shows the number of violent acts per year.

Fig. 1.
figure 1

Number of violent acts by different VEOs from Jul 21st, 1972 until Dec 31st, 2016.

Fig. 2.
figure 2

Number of violent acts per year from Jul 21st, 1972 until Dec 31st, 2016.

Table 1 shows the correlation coefficient between pairs of variables. Number of casualties and number of fatalities are not much correlated with each other or with other variables. However, the other four variables are partly correlated with each other. The largest correlation coefficient in Table 1 (0.71) means that when the targeted process is important, the chances are high that the impact on that process will be also high. The remaining correlation coefficients in Table 1 mean that, on one hand, the level of expertise and the level coordination are partly correlated which is not surprising, and on the other hand, when the level of coordination and expertise are high, the violent act usually targets an important process and results in large impacts on that process. In order to be able to investigate the importance of different features in prediction models, we will not use feature generation methods for now, despite the slight correlation among the last four features. However, all features are normalized to have a zero average and unit variance.

Table 1. Correlation coefficient between pairs of features.

Figure 3 shows the boxplots of casualties in each of the 38 classes. Number of casualties is one of the six features. Each group is a class in this plot, represented by an individual box. The boxplot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. More overlap among boxes means that feature is less diverse and less helpful in distinguishing among classes. The boxplots in Fig. 3 show that the values of this feature are well diversified across different classes (little overlap among boxes) and they can be effective in recognizing classes. Figures 4, 5, 6, 7 and 8 represent the same type of plot for the other five features. Little overlap among boxes, observed in these plots, similarly indicates their effectiveness in distinguishing among classes.

Fig. 3.
figure 3

Boxplot of number of casualties for different groups.

Fig. 4.
figure 4

Boxplot of number of fatalities for different groups.

Fig. 5.
figure 5

Boxplot of level of coordination for different groups.

Fig. 6.
figure 6

Boxplot of level of expertise for different groups.

Fig. 7.
figure 7

Boxplot of importance of the process targeted by different groups.

Fig. 8.
figure 8

Boxplot of scope of the impact on processes for different groups.

3 Prediction Models

Decision tree, SVM, least squares, and Perceptron are the four classifiers that we applied for our prediction purposes. Here we briefly explain each of them.

3.1 Least Squares

The output of the least squares (LS) predictor is xTw where w is the extended weight vector to include the threshold or intercept (w0) and x is the extended feature vector to include a 1. The desired output is denoted with yi. The weight vector will be computed so as to minimize the sum of square errors between the desired and true outputs [7], that is:

$$ J(w) = \sum\limits_{i = 1}^{N} {\left( {y_{i} - x_{i}^{T} w} \right)^{2} } $$
(1)

where N is the number of training samples. Minimizing the cost function in Eq. 1 with respect to w results in:

$$ w = \left( {X^{T} X} \right)^{ - 1} X^{T} y $$
(2)

where X is an N × (l + 1) matrix whose rows are the feature vectors with an additional 1, l is the number of features, and y is a vector consisting of the corresponding desired responses:

$$ X = \left[ {\begin{array}{*{20}c} {x_{1}^{T} } \\ {x_{2}^{T} } \\ \vdots \\ {x_{N}^{T} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {x_{11} } & {x_{12} } & \ldots & {x_{1l} } & 1 \\ {x_{21} } & {x_{22} } & \ldots & {x_{2l} } & 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ {x_{N1} } & {x_{N2} } & \ldots & {x_{Nl} } & 1 \\ \end{array} } \right]\;and\;y = \left[ {\begin{array}{*{20}c} {y_{1} } \\ {y_{2} } \\ \vdots \\ {y_{N} } \\ \end{array} } \right] $$
(3)

3.2 Perceptron

The perceptron cost function is defined as [8]:

$$ J(\varvec{w}) = \sum\limits_{i = 1}^{N} {y_{i} w^{T} x_{i} } ,\quad y_{i} = \left\{ {\begin{array}{*{20}c} { + 1} & {if\;wx_{i} > 0\;but\;x_{i} \in \omega_{2} } \\ { - 1} & {if\;wx_{i} < 0\;but\;x_{i} \in \omega_{1} } \\ 0 & {if\;wx_{i} > 0\;and\;x_{i} \in \omega_{1} } \\ 0 & {if\;wx_{i} < 0\;and\;x_{i} \in \omega_{2} } \\ \end{array} } \right. $$
(4)

We can iteratively find the weight vector that minimizes the perceptron cost function using the gradient descent scheme [8, 9]:

$$ \varvec{w}_{t + 1} = \varvec{w}_{t} +\Delta \varvec{w}_{t} = \varvec{w}_{t} - \alpha \frac{{\partial J\left( \varvec{w} \right)}}{{\partial \varvec{w}}}|_{{\varvec{w} = \varvec{w}_{t} }} = \varvec{w}_{t} - \alpha \sum\limits_{i = 1}^{N} {y_{i} \varvec{x}_{i} } $$
(5)

where wt is the weight vector estimate at the t-th iteration and α is the training rate which is a small positive number.

3.3 SVM

SVM [10,11,12] maximizes the margin around the hyperplane separating the two classes. If the two classes are not linearly separable, then it is not possible to find an empty band separating them. Each training sample will have one of the following constraints:

  • it falls outside the band and is correctly classified, i.e., yi(wTxi + w0) > 1,

  • it falls inside the band and is correctly classified, i.e., 0 ≤ yi(wTxi + w0) ≤ 1, or

  • it is misclassified, i.e., yi(wTxi + w0) < 0.

We can summarize the three above constraints in one by introducing the slack variable (ξi) [10]:

$$ y_{i} \left( {w^{T} x_{i} + w_{0} } \right) \ge 1 - \xi_{i} ,\;\left\{ {\begin{array}{*{20}l} {\xi_{i} = 0} \hfill & {if\;x_{i} \;is\;outside\;the\;band\;and\;correctly\;classified} \hfill \\ {0 < \xi_{i} \le 1} \hfill & {if\;x_{i} \;is\;inside\;the\;band\;and\;correctly\;classified} \hfill \\ {\xi_{i} > 1} \hfill & {if\;x_{i} \;is\;misclassified} \hfill \\ \end{array} } \right. $$
(6)

The optimization task is now to maximize the margin (minimize the norm) while minimizing the slack variables [10]. The mathematical formulation for finding w and w0 of the hyperplane follows:

$$ \left\{ {\begin{array}{*{20}l} {\hbox{min} imize\;J(w,w_{0} ,\xi ) = \frac{1}{2}\left\| w \right\|^{2} + C\sum\limits_{i = 1}^{N} {\xi_{i} = \frac{1}{2}w^{T} w + C\sum\limits_{i = 1}^{N} {\xi_{i} } } } \hfill & {({\mathbf{7}})} \hfill \\ {subject\;to\;y_{i} \;(w^{T} x_{i} + w_{0} ) \ge 1 - \xi_{i} ,\quad i = 1,2, \ldots ,N} \hfill & {({\mathbf{8}})} \hfill \\ {\xi_{i} \ge 0,\quad i = 1,2, \ldots ,N} \hfill & {({\mathbf{9}})} \hfill \\ \end{array} } \right. $$

The smoothing parameter C is a positive user-defined constant that controls the trade-off between the two competing terms in the cost function.

3.4 Decision Tree

Ordinary binary decision trees (OBDTs) split the feature space into hyperrectangles with sides parallel to the axes [13]. Nodes in an OBDT are binary questions whose answers are either yes or no and the answer to these questions determines the path to a leaf which is equivalent to a response. Questions at nodes are of the form “is xk  α ?” where xk is the k-th feature and α is a threshold. To predict the response of an irresponsive sample, one needs to answer the question at each node and traverse to the left or right node based on the answer until a leaf is reached.

The best question to ask at a node is the one which maximizes the impurity decrease (ΔI) [13]:

$$ \Delta I = I - \frac{{N_{Y} }}{N}I_{Y} - \frac{{N_{N} }}{N}I_{N} $$
(10)

where I is the impurity of the ancestor node, N is the number of training samples in the ancestor node, NY is the number of training samples in the descendant node corresponding with the answer “yes” to the question, NN is the number of training samples in the descendant node corresponding with the answer “no” to the question, and IY and IN are the impurities of the descendent nodes. Entropy of training samples at a node, in Eq. 11, is a common definition of node impurity in classification tasks (Iclassification) [13], where M is the number of classes, and N(ωi) is the number of training samples from class ωi at this node.

$$ I_{classification} = - \sum\limits_{i = 1}^{M} {\frac{{N\left( {\omega_{i} } \right)}}{N}log_{2} \frac{{N\left( {\omega_{i} } \right)}}{N}} $$
(11)

A node is considered a leaf if the maximum impurity decrease (ΔImax) for that node is less than a user-defined threshold, although other alternative conditions have been used in the literature [13, 14]. The majority rule in case of classification is commonly used to determine the response at that leaf [13].

The relative importance of the k-th feature (R(xk)) is the sum of the impurity decrease (∆I) over all internal nodes (υi, i = 1,…, J) for which xk was chosen as the splitting variable:

$$ R\left( {x_{k} } \right) = \sum\limits_{i = 1}^{J} {\Delta I\left( {\upsilon_{i} = x_{k} } \right)} $$
(12)

4 Results

A 10-fold cross validation of decision tree resulted in a generalization accuracy of 20%. Maximum impurity decrease (ΔImax) was optimized to 0.055 using an internal cross-validation among training samples. This is a major improvement over a random classifier with an accuracy of 2.6%. Table 2 shows the recall and precision for each class. Larger values for both of these metrics indicate higher accuracies. While a small value for recall means that the classifier has not been able to identify all true samples from this class, a small value for precision means that only a small proportion of assigned samples to this class by the classifier truly belong to this class. In other words, a small value for recall means that the classifier is inaccurate and unfairly stingy to assign samples to this class, while a small value for precision means that the classifier is inaccurate and unfairly generous in assigning samples to this class. A zero value for both of these metrics means that no sample has been assigned to this class by the classifier. Rows shown with a bold font in Table 2 highlight the classes for whom the classifier performs more accurately and rows including a * in the beginning of the class name, indicate classes for whom the classifier performs less accurately. Class 4, 9, and 13 achieve the highest recall and precision which means the selected features in this study are very well capable of distinguishing these classes from others. This could also be inferred from the box plots in Figs. 3, 4, 5, 6, 7 and 8.

Table 2. Recall and precision for different classes obtained from 10-fold cross validation of decision tree.

There is no meaningful correlation between recall and precision on one side and the size of classes (see Fig. 1) on the other side, which has also been observed by other researchers [15]. However, investigating the confusion matrix shows that largely represented classes among the training samples (e.g., class 22 and 23) have a higher systematic tendency to eat the samples from other classes during the classification. However, their precision stays at a reasonable rate because of their large size. For example, out of 5,661 test samples, 826 of them are assigned to class 22 by the classifier while only 205 of them truly belong to this class. Yet its recall and precision are 0.26 and 0.25 which is around the average among all classes. 371 samples were assigned to class 23 by the classifier, though only 88 of them truly belong to this class. The low recall for large classes, e.g. class 22 and 23, is surprising because it means many samples that truly belong to these classes (around 75% of them) are wrongly assigned to other classes by the classifier. This is mainly due to the large number of classes (38) and the overlap among classes in the feature space. While having more training samples might be helpful, finding features that are stronger in distinguishing among classes would certainly improve the accuracy. Besides, a larger sample size would allow the application of more nonlinear classifiers such as multi-layer Perceptron and kernel approaches such as kernel SVM and non-parametric Bayesian.

Table 3 shows how useful each feature has been in developing the decision tree. This can help in filtering out useless features or combining less useful features. Interestingly the number of casualties by itself makes a 41% contribution in developing the tree. The number of casualties and fatalities together from 64% of the decision nodes in the decision tree which means they are strong predictors of our classes. However, the other four features, together, only make a 36% contribution in developing the decision tree.

Table 3. Relative importance of different features in developing the decision tree.

Based on the correlation among the last four variables (see Table 2) and their relatively lower importance in the decision tree classifier, we combined them in one feature using principle component analysis (PCA). Table 4 shows the relative importance of new features in developing a decision tree. The new PCA-based feature is almost as good as all four original features combined.

Table 4. Relative importance of different features, after combining the last four features using PCA, in developing the decision tree.

We used the number of casualties, fatalities, and the PCA-based feature to measure the accuracy of five different classifiers using 10-fold cross validation. These accuracies are reported in Table 5. Hyper-parameters for each classifier are optimized using cross-validation among training samples. While all classifiers outperform the random classifier, SVM is the least accurate and decision tree is the most accurate classifier. The fact that the only non-linear classifier, decision tree, outperforms all the other linear classifiers (SVM, least squares, and Perceptron), informs us of the complexity of the class distributions in the feature space which could be best separated by non-linear classifiers.

Table 5. Generalization accuracy of different classifiers obtained from 10-fold cross validation.

5 Conclusions

This study was the first attempt to predict the responsible VEO for acts of violence based on human casualties and fatalities, level of coordination and expertise, importance of the targeted process, and the extent of its impact on the process. The two first features proved the best predictors while the last four features showed slight correlation and less predictive power. While decision tree, a non-linear classifier, outperformed other linear classifiers, its accuracy does not reach above 20% in identifying the correct group among 38 groups. The inability of classifiers to reach higher accuracies is the result of three shortcomings with respect to our dataset: (a) the features are not predictive enough of the classes, (b) the training sample size from different classes is imbalanced, and (c) the low number of training samples does not allow the application and proper training of non-linear classifiers. In our future work, we intend to investigate other features including weapon type, economical damage, location, and time [16] as additional predictors. These features are mostly unknown at the time. Additionally, we are going to apply more flexible classifiers, such as deep networks [17] and kernel methods, as our dataset is expanding.