Parallel pattern classification utilizing GPU-based kernelized Slackmin algorithm

doi:10.1016/j.jpdc.2016.09.001

Journal of Parallel and Distributed Computing

Volume 99, January 2017, Pages 90-99

https://doi.org/10.1016/j.jpdc.2016.09.001 Get rights and content

Highlights

•
GPU Implementation of the kernelized Slackmin algorithm.
•
Fast pattern classification in low cost GPU devices.
•
High scalability.
•
Low price-performance ratio.

Abstract

This paper introduces a parallel implementation of the kernelized Slackmin algorithm able to tackle medium scale data in pattern classification applications. Initially, the main principles of the serial Slackmin algorithm are described, with emphasis to its parallel nature making its parallelization a straightforward task. The parallelization is achieved by utilizing the parallel processing capabilities of the CUDA architecture of a low cost NVIDIA GPU card. The resulted GPU-based Slackmin algorithm named cuKSlackmin is able to classify medium scale data in a reasonable time without sacrificing its classification performance. A detailed comparison with some established GPU-based classification algorithms, widely used in machine learning, has proved the high performance of the proposed scheme as an alternative tool for medium scale data classification.

Introduction

Nowadays, there is an increased interest in developing smart devices and advanced services making use of some kind of intelligence. The embodiment of intelligent capabilities in any human–machine interaction is achieved by incorporating machine learning techniques and algorithms that enable the interaction of a device with its surrounding environment.

A crucial part of a modern intelligent system is the classification module (classifier), which is responsible to categorize (classify) unknown data to specific predefined categories (classes). A supervised classifier is initially trained on a set of data with known class labels (training data) and then it is used to classify data with unknown class information (testing data). Although an increase to the size of the training data does not always lead to an increase of the classification performance, it is commonly accepted that the more data used for training data, the more possibilities a classifier has for learning.

With the rapid growth of the data interchanged and transmitted through the internet, in conjunction with the wide usage of digital devices in the everyday life there is a need to process a large amount of data. In the fields of machine learning and computational intelligence the classification of massive data has led to novel classification schemes as well as new high-performance computation architectures. Towards this direction, Support Vector Machines (SVMs) [6], Deep Belief Networks (DBNs) [13] and Convolutional Neural Networks (CNNs) [12], among others, have been proposed as classification methods capable of handling large data although their convergence rate can be quite low. For this reason, many parallel approaches [29], [32] e.g. GPU [1], [19], [25], MPI/OpenMP [31], MapReduce [35] to accelerate the training stage of the abovementioned methods, have been proposed in the last years.

Recently, the Slackmin algorithm and some variants were proposed [11], [15], [33], [34], in an attempt to develop a simple yet efficient classification model for medium scale data. The algorithm has shown a competitive classification performance with less complexity and fast convergence rate. In order to improve the medium scale data processing of the Slackmin algorithm, the authors proposed [24] its parallel implementation for the linear case, called cuLSlackmin, using the CUDA (Compute Unified Device Architecture) architecture of a GPU (Graphics Processing Unit).

Although the linear Slackmin algorithm and its parallel version cuLSlackmin have shown satisfactory performance, its kernelized form reveals useful properties and enhanced classification capabilities. This work proposes an efficient parallel implementation of the kernelized Slackmin algorithm, as a way to overcome the classification deficiency of the cuLSlackmin algorithm and the low convergence rates of the serial kernelized Slackmin algorithm. The proposed algorithm, called cuKSlackmin inherits the data representation capabilities owing to the used kernel functions and utilizes the high performance computing of a GPU. The resulting parallel algorithm takes the advantages of the GPU resources in order to be trained with many data and thus the only applied limitation is coming from the computing capabilities of the used GPU card.

It is worth noting that the proposed algorithm shows some limitations concerning the size of shared memory being available in the used GPU hardware. More precisely, the number of the support vectors used in the algorithm controls the amount of shared memory needed by the algorithm and thus it affects not only its accuracy but also its training speed.

The rest of the paper is organized as follows: Section 2 describes in detail the linear as well as the kernelized versions of the Slackmin algorithm. Section 3 highlights the main principles of GPU programming with emphasis to the GPU implementation of machine learning algorithms. Section 4 presents the proposed GPU-based kernelized Slackmin algorithm, while its performance is thoroughly studied experimentally, in Section 5. Finally, Section 6 summarizes the main conclusions derived from the conducted experiments and defines future work towards the improvement of the proposed algorithm to more demanding classification problems.

Section snippets

Slackmin Algorithm

A short description of Slackmin algorithm along with some basic definitions is provided in this section, whereas more theoretical details can be found in [11], [15].

A binary classification problem is defined as the task of data categorization into two predefined classes. The data is represented by a set $χ$ of feature vectors $x$ and the corresponding class labels $t$ as follows: $χ = {(x (i), t (i)) | x (i) \in R^{n}}_{i = 1}^{N} .$

In Eq. (1) $x (i)$ and $t (i) \in {- 1, 1}$ is the $i$ th feature vector of dimension $n$ and the corresponding

GPU programming

The main contribution of this manuscript is the computational enhancement of the kernelized Slackmin algorithm by utilizing the parallel programming framework provided by the CUDA architecture of a GPU. Therefore, it would be constructive to make a brief discussion of some fundamental concepts related to GPU programming and its impact in machine learning as well.

GPU-based kernelized Slackmin Algorithm

The promising results of the cuLSlackmin [24] have motivated the authors to proceed with the parallelization of the kernelized Slackmin algorithm (Section 2.2) in this section, in order to deal with highly nonlinear classification problems as well as large amount of data.

Experimental study

In order to study the convergence rate of the proposed cuKSlackmin algorithm as well as its classification performance, a set of experiments has been conducted. In these experiments the low cost mid-range Nvidia GT440 GPU card having the specifications of Table 1, is used. The card is mounted in a desktop computer equipped with Intel i5-750 2.67 GHz 64-bit CPU (4 cores, 4 threads, L2-Cache: 4×256 kB, L3-Cache: 8 MB shared) serving as the host, 4 GB RAM and Windows 8.1 OS, whereas the C/C++

Conclusion and future work

A parallel implementation of the kernelized Slackmin algorithm was presented in the previous sections. The proposed cuKSlackmin algorithm utilizes the parallel computing capabilities of a GPU card following the CUDA programming framework. The efficiency of the introduced cuKSlackmin algorithm was examined using four benchmark pattern classification datasets and its performance was compared with that of some established serial and parallel SVM-like algorithms. The experimental results

Acknowledgments

This research has been co-financed by the European Union (European Social Fund ESF) and Greek national funds through the Operational Program “Education and Lifelong Learning” of the National Strategic Reference Framework (NSRF)—Research Funding Program: THALES (MIS 380292). Investing in knowledge society through the European Social Fund.

References (37)

M. Kotti et al.
Efficient binary classification through energy minimisation of slack variables
Neurocomputing
(2015)
B.C.C. Lai et al.
Self adaptable multithreaded object detection on embedded multicore systems
J. Parallel Distrib. Comput.
(2015)
Q. Li et al.
Parallel multitask cross validation for support vector machine using GPU
J. Parallel Distrib. Comput.
(2013)
S.R. Upadhyaya
Parallel approaches to machine learning–a comprehensive survey
J. Parallel Distrib. Comput.
(2013)
A. Athanasopoulos, A. Dimou, V. Mezaris, I. Kompatsiaris, GPU acceleration for support vector machines, in: 12th...
A. Bordes et al.
Fast kernel classifiers with online and active learning
J. Mach. Learn. Res.
(2005)
B. Catanzaro, N. Sundaram, K. Keutzer, Fast support vector machine training and classification on graphics processors,...
C.C. Chang et al.
LIBSVM: a library for support vector machines
ACM Trans. Intell. Syst. Technol.
(2011)
O. Chapelle
Training a support vector machine in the primal
Neural Comput.
(2007)
C. Cortes et al.
Support-vector networks
Mach. Learn.
(1995)

A. Cotter, N. Srebro, J. Keshet, A GPU-tailored approach for training kernelized SVMs, in: 17th ACM SIGKDD, 2011, pp....

CUBLAS, 2015. Cuda basic linear algebra subroutine. https://developer.nvidia.com/cuBLAS (Accessed...

L. Dagum et al.

OpenMP: an industry standard API for shared-memory programming

IEEE Comput. Sci. Eng.

(1998)

J. Dean et al.

MapReduce: simplified data processing on large clusters

Commun. ACM

(2008)

K.I. Diamantaras, M. Kotti, Binary classification by minimizing the mean squared slack, in: IEEE ICASSP, 2012, pp....

K. Fukushima

Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

Biol. Cybernet.

(1980)

G. Hinton et al.

A fast learning algorithm for deep belief nets

Neural Comput.

(2006)

C.W. Hsu et al.

A comparison of methods for multiclass support vector machines

IEEE Trans. Neural Netw.

(2002)

Cited by (6)

A Parallel Multilevel Feature Selection algorithm for improved cancer classification
2020, Journal of Parallel and Distributed Computing
Citation Excerpt :
As Spark could handle only continuous values, the nominal class labels such as Active (non-cancerous) and Inactive (cancerous) were converted to 1 and 0 respectively for further processing. In order to study the transcriptional activity of P53 mutants, multiple site dataset was separated into site-specific data from 1-site to 6-site with the help of P53 mutants’ instance tags [36]. Finally, the file header was incorporated in each site-wise data which provided attribute nomenclature such as V1 to V5409.
Biological data is prone to grow exponentially, which consumes more resources, time and manpower. Parallelization of algorithms could reduce overall execution time. There are two main challenges in parallelizing computational methods. (1) Biological data is multi-dimensional in nature. (2). Parallel algorithms reduce execution time, but with the penalty of reduced prediction accuracy. This research paper targets these two issues and proposes the following approaches. (1) Vertical partitioning of data along feature space and horizontal partitioning along samples in order to ease the task of data parallelism. (2) Parallel Multilevel Feature Selection (M-FS) algorithm to select optimal and important features for improved classification of cancer sub-types. The selected features are evaluated using parallel Random Forest on Spark, compared with previously reported results and also with the results of sequential execution of same algorithms. The proposed parallel M-FS algorithm was compared with existing parallel feature selection algorithms in terms of accuracy and execution time. The results reveal that parallel multilevel feature selection algorithm improved cancer classification resulting into prediction accuracy ranging from $\sim$ 85% to $\sim$ 99% with very high speed up in terms of seconds. On the other hand, existing sequential algorithms yielded prediction accuracy of $\sim$ 65% to $\sim$ 99% with execution time of more than 24 hours.
Skew Class-Balanced Re-Weighting for Unbiased Scene Graph Generation
2023, Machine Learning and Knowledge Extraction
Machine Learning-Based Approach for Airfare Forecasting
2023, Lecture Notes in Networks and Systems
Skew Class-Balanced Re-Weighting for Unbiased Scene Graph Generation
2023, arXiv
Comprehensive Analysis of the Uses of GPU and CUDA in Soft-Computing Techniques
2019, 2019 6th International Conference on Signal Processing and Integrated Networks, SPIN 2019
Airfare prices prediction using machine learning techniques
2017, 25th European Signal Processing Conference, EUSIPCO 2017

G.A. Papakostas received the diploma in Electrical and Computer Engineering in 1999 and the M.Sc. and Ph.D. degrees in Electrical and Computer Engineering in 2002 and 2007, respectively, from the Democritus University of Thrace (DUTH), Greece. From 2007 to 2010 he served as an Adjunct Lecturer at the Department of Production Engineering and Management of DUTH. Dr. Papakostas currently serves as a full Professor at the Department of Computer and Informatics Engineering. He has (co)authored more than 80 publications in indexed journals, international conferences and book chapters. His research interests include pattern recognition, computer/machine vision, computational intelligence, machine learning, feature extraction, evolutionary optimization, parallel and distributed computing, signal and image processing. Dr. Papakostas served as a reviewer in numerous journals and conferences and he is a member of the IAENG, MIR Labs, EUCogIII and the Technical Chamber of Greece (TEE).

K.I. Diamantaras received the Diploma from the National Technical University of Athens, Greece and the Ph.D. degree in Electrical Engineering from Princeton University in 1992. Subsequently, he joined Siemens Corporate Research, Princeton, NJ, as a Post-Doctoral Researcher and then the Department of Electrical and Computer Engineering, Aristotelian University of Thessaloniki, Greece. Since 1998, he is with the Department of Information Technology, Technological Education Institute (T.E.I.) of Thessaloniki, where he currently holds the position of Professor. His research interests include machine learning, signal processing, and image processing. He is a Senior Member IEEE. He is the co-author of the book “Principal Component Neural Networks: Theory and Applications”, Wiley, New York, 1996. He currently serves as an editor for the IEEE Transactions on Signal Processing, the IEEE Signal Processing Letters and the Journal of Signal Processing Systems (Springer). In the past, he was also an Associate Editor for the IEEE Transactions on Neural Networks. In 1997, he received the IEEE Best Paper Award in the area of Neural Networks for Signal Processing for the paper “Adaptive Principal Component Extraction (APEX) and Applications”. He has been a member of the technical committee for various machine learning, signal processing and neural networks conferences.

T. Papadimitriou was born in Thessaloniki, Greece, in 1972. He received the Diploma degree in mathematics from the Aristotle University of Thessaloniki, Greece, and the D.E.A. A.R.A.V.I.S (Automatique, Robotique, Algorithmique, Vision, Image, Signale) degree from the University of Nice-Sophia Antipolis, France, both in 1996 and the Ph.D. degree from the Aristotle University of Thessaloniki in 2000. In 2001, he joined the Department of Economics, Democritus University of Thrace, Komotini, Greece, where, he served as a lecturer (2002–2008), assistant professor (2008–2013). Currently he holds the position of Associate Professor in the same department. Dr. Papadimitriou co-authored more than 80 journal papers, conference papers and book chapters combined. He served as a reviewer for various publications and as a member to scientific committees for Conferences and Workshops. Theophilos Papadimitriou current research interests include complex network, machine learning, and data analysis.

View full text

Parallel pattern classification utilizing GPU-based kernelized Slackmin algorithm

Highlights

Abstract

Introduction

Section snippets

Slackmin Algorithm

GPU programming

GPU-based kernelized Slackmin Algorithm

Experimental study

Conclusion and future work

Acknowledgments

Neurocomputing

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput.

Fast kernel classifiers with online and active learning

J. Mach. Learn. Res.

LIBSVM: a library for support vector machines

ACM Trans. Intell. Syst. Technol.

Training a support vector machine in the primal

Neural Comput.

Support-vector networks

Mach. Learn.

OpenMP: an industry standard API for shared-memory programming

IEEE Comput. Sci. Eng.

MapReduce: simplified data processing on large clusters

Commun. ACM

Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

Biol. Cybernet.

A fast learning algorithm for deep belief nets

Neural Comput.

A comparison of methods for multiclass support vector machines

IEEE Trans. Neural Netw.