Elsevier

Pattern Recognition Letters

Volume 32, Issue 3, 1 February 2011, Pages 540-545
Pattern Recognition Letters

New separating hyperplane method with application to the optimisation of direct marketing campaigns

https://doi.org/10.1016/j.patrec.2010.11.007Get rights and content

Abstract

In this article we present a new class of separating hyperplane methods for the binary classification task. Our hyperplanes have a very low Vapnik–Chervonenkis dimension, so they generalise well. Geometrically, our approach is based on searching of a proper pair of observations from different classes of the explained variable. Once this pair is found the discriminant hyperplane becomes orthogonal to the line connecting these observations. This method allows the direct optimisation of any prediction criterion, not necessary the fraction of correctly classified observations. Models generated by this technique have low computational complexity and allow fast classification. We illustrate the performance of our method by applying it to the problem of optimisation of direct marketing campaigns, where the natural measure of the prediction performance is the lift curve.

Research Highlights

Separating hyperplanes with low Vapnik–Chervonenkis dimension are constructed. ► They allow fast classification of new observations. ► They allow direct optimisation of any classification criterion. ► The performance is illustrated on direct marketing campaign from mobile telephony.

Introduction

Using concepts of the Vapnik–Chervonenkis theory (Vapnik, 1998), the performance of the classifier depends on the error rate on the train sample and on the Vapnik–Chervonenkis (VC) dimension of the classifier. VC dimension of the classifier is defined as the maximal number of points it can shatter. In order to generalise well and achieve a low misclassification rate on the test sample, the classifier should have a low misclassification rate on the train sample and have a low VC dimension. This idea is the basis of the support vector machines (svm). The functional form of the separating hyperplane generated by svm is a sum of dot products of feature vectors of some observations in the sample. These observations are called support vectors. The dot product can be also expressed in terms of kernel functions, allowing non-linear classification function. This is known as the kernel trick.

The above idea of Vapnik (1998) was then extended in numerous ways and there is a huge number of articles about svm. Many techniques have been developed to reduce the complexity (measured by the number of support vectors) of the generated svm model. For example Zhan and Shen (2005) proposed a four-step algorithm which excludes the support vectors which make the separating hyperplane highly convoluted and then further approximates the separating hyperplane with a subset of the support vectors. Keerthi et al. (2006) used a variant of the forward selection method to determine which observations should form the basis functions. Their algorithm stops when the desired number of the support vectors is included. Renjifo et al. (2008) also addressed this problem. Their technique – incremental asymmetric proximal support vector machine – employs a greedy search across the training data to select the basis vectors of the classifier. They also point out that svms with a small number of support vectors are preferable in some applications – when the prediction of new observations must be done at run time using limited resources (computational power and memory of the computer) for example when using embedded devices. The time and memory needed to compute the prediction for new observations increases when the number of support vectors increases, so there is a need to construct svms with limited number of support vectors. In their scenario, the training of the svm model may be done off-line with unlimited resources, only the resources for prediction are restricted. See also Renjifo et al. (2008) for the further review of methods that lead to the reduced number of support vectors.

In our empirical example we use datasets from the UCI Machine Learning Repository (Asuncion and Newman, 2007) and the data about clients of the Polish telecommunication company. In telecommunication companies, the predictive models are usually implemented in the database. The databases are very large – each record represents one client and there are millions of clients. The number of features that describe client’s behaviour is also large. So also in this case it is important to implement the scoring mechanism effectively without unnecessary burden of the database.

In this article we present a class of classifiers which are also expressed in terms of dot products of some observations from the train sample. Our approach is constructed with the extremely small number of “support vectors” – only two. So the computational complexity related to prediction of new observations is very small.

Our hyperplanes also allow the usage of the kernel trick.

Our approach has an additional feature that makes is extremely useful in some applications. Our method allows direct optimisation of any prediction criterion. Usually the performance of the model is measured by the fraction of correctly classified observations. However, in some applications, especially in direct marketing and credit scoring, the main issue is the ability of the method to sort observations with respect to the probability of the class membership. For example if there are only 1% of observations from the first class and 99% from the second class in the sample, then the model that classifies observations to the majority class has a fraction of correctly classified observations equal to 99%, but this model is useless. In direct marketing we are usually interested in separating a small group of clients with the maximal probability of the minority class membership. These clients are then the target group of the marketing campaign. This is usually done by sorting observations with respect to the estimated probability and then the observations from the top decile of this score form the target group. So the performance of the model can be measured by the fraction of observations from the minority class in the top decile, say 0.1, of the score. If we want to compare a few models, we may sort the sample by the score of each model and then compare fractions of the observations from the minority class in each top decile. Sometimes this measure is expressed graphically as a lift curve. The OX axis represents the order of the quantile of the score and OY axis represents the fraction of observations from a certain class in this quantile. Usually, when building a model, we have a little control over this measure. For example, the logistic regression is based on maximisation of the likelihood. This criterion do not necessary lead to the good performance in terms of the lift curve, because the likelihood is based on the average score across the sample and we are interested only in the right tail of the distribution. Our approach can directly maximise this type of prediction criterion.

This article is organised as follows. In the next section we derive our class of separating hyperplanes and describe its statistical properties. Then we discuss numerical issues concerning its computer implementation. In the fourth section we illustrate its performance using data about the direct marketing campaign of one of the Polish telecommunication companies and using UCI Machine Learning Repository datasets. The last section concludes the article.

Section snippets

The classifier

We introduce the following class of classifiersd(x)=sign(yixiTx-yjxjTx-b),where b=yixiTxk-yjxjTxk,yiyj=-1. Labels of the class y are +1, −1. Our hyperplanes are expressed in terms of the dual form of the classical svm. In order to find the optimal decision rule d it is necessary to find i, j and k. Of course the dot product xiTx can be replaced by kernel function K(xi, x) to allow the non-linear classification.

The geometrical interpretation is as follows: we consider only separating hyperplanes

Implementation

We show how to modify this algorithm to diminish computational burden corresponding to the number of features.

Note that for i = 1,  , n1, j = 1,  , n−1, k = 1,  , n we haveaTxk=(x1i-x-1j)Txk=x1iTxk-x-1iTxk.Let us consider Xn×k matrix which ith row consists of the vector of features for ith observation, Xi·=xiT. ThenA=XXT=[aij]aij=xiTxjsoaTxk=A[1i,k]-A[-1j,k].So in order to calculate scores aTx it is sufficient to calculate the matrix A at the beginning of the computation and use its elements in the inner

Methods

In our experiment we used our separating hyperplane algorithm and the following methods as benchmarks: regularized linear discriminant analysis (example of a linear classifier), standard svm (implementation of Chapelle, 2007) and svm with reduced complexity (implementation of Keerthi et al., 2006). We did not use the kernel trick. As a measure of the model performance we used fraction of correctly classified observations and the lift criterion: we measured P(Y = 1) in the top quantile of the

Conclusions

In this article we presented a new class of separating hyperplanes with a very simple form. Our approach has a very low VC dimension and is a good classification tool for selecting target groups for the marketing campaigns. It has no additional parameters and needs no tuning in linear case. It can also be easily implemented. Our implementation is very simple, but it can be easily parallelised which may be crucial for large datasets.

References (7)

There are more references available in the full text version of this article.

Cited by (0)

View full text