Keywords

1 Introduction

Pattern recognition is becoming even more important mainly due to the increasing needs from different applications to extract meaningful information from their data. Additionally, the problem gets worse since data is growing fast in both size and complexity. Humans have an innate ability to recognize patterns, but this is rather difficult to replicate on computers. Several techniques have been developed to address this issue, being the most popular ones Artificial Neural Networks (ANNs) [6] and Support Vector Machines (SVMs) [3]. Recently, a new framework to the design of graph-based classifiers named Optimum Path Forest (OPF) has been introduced in the scientific community. Such framework comprises supervised [10,11,12], semi-supervised [1, 2] and unsupervised learning algorithms [14]. As the main advantages, we shall observe some OPF variants are parameterless, and they do not make assumptions about separability of samples [7].

We refer to OPF as a single classifier in this paper, but it is in fact a framework to the design of graph-based classifiers. This means the user can design his/her own optimum-path forest-driven classifier by configuring three main modules: (i) adjacency relation, (ii) methodology to estimate prototypes, and (iii) path-cost function. Since OPF models the problem of pattern recognition as a graph partition task, it requires an adjacency relation to connect nodes (i.e. feature vectors extracted from dataset samples). Further, OPF rules a competition process among prototype samples, which are the most representative samples from each class. Therefore, a careful procedure to estimate them would be wise. Finally, in order to conquer samples, prototypes must offer them rewards, which are encoded by the path-cost function. In this paper, we are considering the OPF classifier proposed by Papa et al. [11, 12], which employs a full connectedness graph, and a path-cost function that computes the maximum arc-weight along a path. For the sake of clarity, we shall refer to this version as OPF only.

Although OPF has obtained recognition results comparable or even more accurate than SVMs and ANNs in a number of different applications, as well as it has been usually much faster for training, it can be time-consuming for very large datasets. Although OPF is parameterless, its training phase takes \(\theta (n^2)\), where n stands for the number of training samples. Truly speaking, this is not that bad, since SVMs usually require a considerably higher computational load. However, there is still room for improvements, and that is the main contribution of this paper: to introduce a different data structure that allows the OPF parallelization. As a matter of fact, the proposed approach is able to produce equivalent results to the ones obtained by original OPF classifier concerning accuracy, though up to five times faster using a simple personal computer hardware.

The remainder of this paper is organized as follows. Section 2 reviews OPF theoretical background, and Sect. 3 presents the modifications that led to the new parallel training algorithm. Section 4 discusses the experiments, and Sect. 5 states conclusions and future works.

2 Supervised Classification Based on Optimum-Path Forest

Let \(\mathcal{Z}\) be a dataset whose correct labels are given by a function \(\lambda (x)\), for each sample \(x\in \mathcal{Z}\). Thus, \(\mathcal Z\) can be partitioned into a training (\(\mathcal{Z}_1\)), validation (\(\mathcal{Z}_2\)) and testing (\(\mathcal{Z}_3\)) set. Also, we can derive a graph \(\mathcal{G}_1=(\mathcal{V}_1,\mathcal{A}_1)\) from the training set, where \(\mathcal{A}_1\) stands for an adjacency relation known as complete graph, i.e. one has a full connectedness graph where each pair of samples in \(\mathcal{Z}_1\) is connected by an edge. Additionally, each node \(\mathbf v ^1_i\in \mathcal{V}_1\) concerns the feature vector extracted from sample \(x^1_i\in \mathcal{Z}_1\). All arcs are weighted by the distance among their corresponding graph nodes. A similar definition can also be applied to the validation and test sets.

The OPF proposed by Papa et al. [12] comprises two distinct phases: (i) training and (ii) testing. The former step is based upon \(\mathcal{Z}_1\), meanwhile the test phase aims at assessing the effectiveness of the classifier learned during the previous phase over the testing set \(\mathcal{Z}_3\). Additionally, a learning algorithm was proposed to improve the quality of samples in \(\mathcal{Z}_1\) by means of an additional set \(\mathcal{Z}_2\). Roughly speaking, the idea is to train an OPF classifier over \(\mathcal{Z}_1\) and then classify \(\mathcal{Z}_2\). Further, we replace non-prototype samples in \(\mathcal{Z}_1\) by misclassified samples in \(\mathcal{Z}_2\), and the very same process is executed once again (i.e. training over \(\mathcal{Z}_1\) and classification over \(\mathcal{Z}_2\)). The above procedure is executed until the accuracy between consecutive iteration does not change.

2.1 Training

The training step aims at building the optimum-path forest upon the graph \(\mathcal{G}_1\) derived from \(\mathcal{Z}_1\). Essentially, the forest is the result of a competition process among prototype samples that ended up partitioning \(\mathcal{G}_1\). Let \(\mathcal{S}\subseteq \mathcal{Z}_1\) be a set of prototypes, which can be chosen at random or using some other specific heuristic. Papa et al. [12] proposed to find the set of prototypes that minimizes the classification error over \(\mathcal{Z}_1\), say that \(\mathcal{S}^*\subseteq \mathcal{Z}_1\). Such set can be found by computing a Minimum Spanning Tree \(\mathcal M\) from \(\mathcal{G}_1\), and then marking as prototypes each pair of samples \((x_1, x_2)\), adjacent in \(\mathcal M\), such that \(\lambda (x_1)\ne \lambda (x_2)\).

Further, the competition process takes place in \(\mathcal{Z}_1\), where nodes in \(\mathcal{S}^*\) try to conquer the remaining samples in \(\mathcal{Z}_1\setminus \mathcal{S}^*\). Basically, such process is based on a reward-compensation procedure, where the prototype that offers the minimum cost is the one that will conquer the sample. The reward is computed based on a path-cost function, which should be smooth according to Falcão et al. [5]. Therefore, Papa et al. [12] proposed to use \(f_{max}\) as the path-cost function, defined as follows:

(1)

where \(\pi _s \cdot (\mathbf s ,\mathbf t )\) stands for the concatenation between path \(\pi _s\) and arc \((\mathbf s ,\mathbf t )\in \mathcal{A}_1\). Also, a path \(\pi _s\) is a sequence of adjacent and distinct nodes in \(\mathcal{G}_1\) with terminus at node \(\mathbf s \in \mathcal{Z}_1\).

In short, by computing Eq. 1 for every sample \(\mathbf s \in \mathcal{Z}_1\), we obtain a collection of optimum-path trees (OPTs) rooted at \(\mathcal{S}^*\), which then originate an optimum-path forest. A sample that belongs to a given OPT means it is more strongly connected to it than to any other in \(\mathcal{G}_1\).

2.2 Testing

In the testing step, each sample \(\mathbf t \in \mathcal{Z}_3\) is classified individually as follows: \(\mathbf t \) is connected to all training nodes from the optimum-path forest learned in the training phase, and it is evaluated the node \(\mathbf s ^*\in \mathcal{Z}_1\) that conquers \(\mathbf t \), i.e. the one that satisfies the following equation:

(2)

The classification step simply assigns the label of \(\mathbf s ^*\) as the label of \(\mathbf t \). Notice that a similar procedure to classify \(\mathcal{Z}_2\) can be employed, too.

3 Parallel-Driven Optimum-Path Forest Training

In this section, we present the proposed approach based on parallel programming to speed up the naïve OPF training algorithm, hereinafter called POPF. Since standard OPF makes use of a priority queue implemented as a binary heap, it does not support multiple access at the same time. Therefore, POPF uses a simpler data structure along with a slightly different (parallel) training process, which is based on three main assumptions, as discussed below.

The first observation concerns the optimum-path computation process for each \(\mathbf t \in \mathcal{Z}_1\), which is independent to other samples. On the other hand, costs need to be updated on a data structure so that a new sample can be selected for the next iteration, in order to expand the optimum path computed already. For this purpose, LibOPF [13] uses a binary heap as suggested in [5]. However, such data structure is not prepared for concurrent updates, i.e. if one attempts to perform the computation of \(f_{max}(\mathbf t )\) for each \(\mathbf t \in \mathcal{Z}_1\), a mutex would be required for each update process in the heap. However, this approach would not scale well if one increases the number of threads. Furthermore, this data structure introduces a \(\mathcal{O}(\log (n))\) overhead in each update, where \(n=|\mathcal{Z}_1|\).

The second observation concerns the graph, which is fully connected, implying that, at each iteration, all nodes need to be explored. Therefore, the computation of \(f_{max}\) for all \(s \in \mathcal{Z}_1\) takes \(\theta (n)^2\) operations in total. In order to overcome such quadratic complexity, we can implement the priority queue as a standard array, but exploring the set of nodes in parallel and performing a parallel linear-search. At each iteration, each thread \(\delta _i, \forall i = 1, \ldots , m\), explores a subset \(\mathcal{W}_{(s,i)}\), such that \(\mathcal{W}_s = \mathcal{W}_{(s,1)} \cup \cdots \cup \mathcal{W}_{(s,m)}\) is the set of neighbors of s, thus performing two tasksFootnote 1: (1) to update the costs of each \(t\in \mathcal{W}_{(s,i)}\) according to \(f_{max}\) using arc \((\mathbf s ,\mathbf t )\), and (2) to compute the node \(s^{(*,i)} \in \mathcal{W}_{(s,i)}\) with minimum cost for \(\mathcal{W}_{(s,i)}\). Afterwards, the main thread finds the node \(s^*\) with minimum cost among all \(s^{(*,i)},\forall i=1,\cdots ,m\). Such node \(s^*\) will be the first one to come out of the priority queue in the next iteration. Therefore, by using m threads, the overall time complexity of the training algorithm is reduced to \(\theta (n^2/m)\).

Finally, the third observation is related to the Prim’s algorithm, which is used to calculate the Minimum Spanning Tree over \(\mathcal{Z}_1\). As a matter of fact, we can use the very same OPF algorithm with a different path-cost function to compute the MST. Therefore, the aforementioned ideas can be applied to compute the MST too, taking advantage of parallelism in all the steps of the training process.

Algorithm 1 summarizes the ideas presented in this section. Note that even though parallelization takes place during the searching process for the best predecessor only, it is be better to start all threads only once at the beginning of the algorithm. The proposed approach was efficiently implemented using OpenMP [4], a well-known API for shared-memory parallel programming. OpenMP pragmas used in the implementation are included as comments.

figure a

4 Experiments and Results

In this section, we present the methodology used to assess the robustness of the proposed approach, as well as the experimental results. Table 1 presents the description of the datasets used in this work, which were taken from UCI Machine Learning Repository [8]. We intentionally chose datasets with numeric features, to avoid extra pre-processing, and with different orders of magnitude, to better describe the scalability of our approach.

Table 1. Description of the datasets and percentages used for \(\mathcal{Z}_1\), \(\mathcal{Z}_2\) and \(\mathcal{Z}_3\).

We compared POPF against naïve OPF using a microcomputer equipped with a 3.1 GHz Intel Core i7 processor, 8 GB of RAM, and running Linux 3.16. The programs were compiled using GCC 5.0, which implements OpenMP 4 specification. Also, we varied the number of threads concerning POPF according to the maximum concurrency allowed by the processor. For each experiment, we executed a hold-out-based partition of the dataset over 10 executions for mean accuracy and computational load computation purposes.

Table 2 presents the results regarding execution time, number of learning iterations and classification accuracy for \(\mathcal{Z}_3\) – as defined by Papa et al. in [12], where POPF-m stands for POPF executed with m threads. Clearly, POPF maintained OPF accuracy for all number of threads, meaning the classifier obtained through the proposed approach preserves the same properties of the original one. Only a slight variation concerning MiniBooNE dataset can be observed. This is explained by the fact that same-cost samples could be stored in a different order in \(\mathcal{Z}_1'\), which will change the evaluation order during the classification process over \(\mathcal{Z}_3\) and will assign a different class when ties occur.

Table 2. Comparison against OPF and POPF with different number of threads.

In Table 2 we also include parallel performance measures: speedup (S) – measuring gain in running time – and efficiency (E) – measuring thread utilization. They are defined as follows [9]:

$$\begin{aligned} S = \frac{T_{s}}{T_{p}} \ \ \ \ \ \text {and} \ \ \ \ \ E = \frac{S}{m}=\frac{T_{s}}{m \cdot T_{p}}, \end{aligned}$$
(3)

where \(T_s\) and \(T_p\) stand for the execution time of traditional and parallel OPF, respectively, and m denotes the number of threads.

We can observe that maximum speedup is obtained using 8 threads, being about five times faster than traditional OPF. Furthermore, speedup improves when the size of the dataset increases. Another worth noticing observation is that for the largest dataset, when using 2 threads, the efficiency obtained was greater than 100%. This confirms that POPF is considerably more efficient than traditional OPF, not only because of the parallel implementation, but also due to its asymptotic improvement. Figure 1 presents charts for S and E.

Regarding the overall parallel efficiency, it is important to stress the efficiency results obtained for both 4 and 8 threads. On one hand, we obtained an efficiency between 77% and 87% considering 4 threads, which is an outstanding result for any parallel implementation. On the other hand, the efficiency considering 8 threads was between 57% and 66%, which is a good thread utilization considering the fact that the processor used has only 4 physical cores (implementing 8 threads with HyperThreading\(^{\textregistered }\) technology).

Fig. 1.
figure 1

Parallel performance measures for POPF learning process: (a) speedup (S); and (b) efficiency (E).

5 Conclusions and Future Work

In this work, we were able to parallelize the OPF training algorithm, and we demonstrated its efficiency concerning classification tasks. This new approach is based on three important observations: (i) the optimum-path computation process for each training sample is independent to each other; (ii) the full connectedness training graph allows us to replace the binary heap by a simple list (suitable for parallelization); and (iii) the computation of the MST during the training phase can also be performed in parallel. These changes allow to reduce the asymptotic complexity of the implementation and also turns the parallelization feasible.

We have observed that POPF preserves the accuracy of the original algorithm, but it is able to perform the learning phase at least five times faster using commodity hardware. Thus, an OPF with hundreds of thousands of nodes can be calculated in less than an hour. As such, POPF allows to perform classification of very large datasets when timing restrictions are present, and it brings closer the possibility of performing nearly real-time classification for reasonable sized-datasets even on a single computer or mobile device.

However, such real-time implementation still needs improvements in the classification algorithm. Thus, we are considering the use of spatial data structures to index the optimum path-forest obtained during training, so that a fewer amount of nodes are considered in classification, thus improving its running time.