Evaluation of k-Nearest Neighbor classifier performance for direct marketing

https://doi.org/10.1016/j.eswa.2009.04.055Get rights and content

Abstract

Text data mining is a process of exploratory data analysis. Classification maps data into predefined groups or classes. It is often referred to as supervised learning because the classes are determined before examining the data. This paper describes the proposed k-Nearest Neighbor classifier that performs comparative cross-validation for the existing k-Nearest Neighbor classifier. The feasibility and the benefits of the proposed approach are demonstrated by means of data mining problem: direct marketing. Direct marketing has become an important application field of data mining. Comparative cross-validation involves estimation of accuracy by either stratified k-fold cross-validation or equivalent repeated random subsampling. While the proposed method may have a high bias; its performance (accuracy estimation in our case) may be poor due to a high variance. Thus the accuracy with the proposed k-Nearest Neighbor classifier was less than that with the existing k-Nearest Neighbor classifier, and the smaller the improvement in runtime the larger the improvement in precision and recall. In our proposed method we have determined the classification accuracy and prediction accuracy where the prediction accuracy is comparatively high.

Introduction

Direct marketing has become an important application field for data mining. In direct marketing (Madeira & Sousa, 2002) companies or organizations try to establish and maintain a direct relationship with their customers in order to target them individually for specific product offers or for fund-raising. Large databases of customer and market data are maintained for this purpose. The customers or clients who are to be targeted in a specific campaign are selected from the database, given the different types of information such as demographic information and information on the customer’s personal characteristics such as profession, age and purchase history (Bauer, 1988).

Classification is the problem of automatically assigning an object to one of the several predefined categories based on the attributes of the object. It has been recognized as one of the core tasks in data mining, field that is concerned with the extraction of knowledge or patterns from databases through the building of predictive or descriptive models (Sousa, Kaymak, & Madeira, 2002). For example, insurance companies might want to classify a customer (Lee & Cho, 2007) as being either high-risk or low-risk using various attributes of the customer – such as credit history, annual income, and age. More examples of the problem include hand-written digit recognition, text classification, intrusion detection, and credit card fraud detection.

The problem is usually formulated as follows: a training set of objects (also called instances or records) and their attributes (also called features) as well as the categories or classes to which these objects belong to are given. The attributes of each record can either be categorical or be continuous. Then building of a classifier that makes use of the training set to build a model to predict the class of a new record is tried, given the attributes of the new record. Because a training dataset is given, the classification problem is also known as supervised induction or supervised learning.

Classification (Dietterich, 1998) is one of the primary data mining tasks. The input to a classification system consists of example tuples, called training set, with each tuple having several attributes. Attributes can be continuous, coming from an ordered domain, or categorical, coming from an unordered domain. A special class attribute indicates the label or category to which an example belongs to. The goal of classification is to induce a model from the training set, which can be used to predict the class of a new tuple.

The paper presents k-Nearest Neighbor classifiers for direct marketing (Blake & Merz, 1998). The proposed k-NN algorithm is based on comparative cross-validation. The rest of the paper is organized as follows. In Section 1 we describe the direct marketing and classification methodology. In Section 2 we describe the state of art of the work. In Section 3 we describe the proposed k-Nearest Neighbor classifier using comparative cross-validation. In Section 4 we provide the performance evaluation, and in Section 5 we provide the experimental results. We conclude with a summary in Section 6.

Section snippets

State of the art

In this section, the state of the art concerning comparative cross-validation of k-NN algorithm is investigated. The results of this survey will motivate a new approach.

Existing k-Nearest Neighbor algorithm (Ek-NN)

  • Input:

  •  T// Training data

  •  K// Number of neighbors

  •  t// Input tuple to classify

  • Output:

  •  c// class to which t is assigned KNN Algorithm

  •  N = 0;// Find set of neighbors,

  • N for t

  • For each dT do

  •  if |N|K, then

  •  N = N U{d}

  •  else

  •  if ∃uN such that

  •  sim(t,u)  sim (t,d) then

  •  begin

  •   N = N-{u};

  •   N = N U{d};

  •  end

  • // Find class for classification

  • c = class to which the most uN are classified;

Proposed k-Nearest Neighbor algorithm (Pk-NN)

This paper describes the proposed k-Nearest Neighbor classifier that performs comparative cross-validation for the existing k-Nearest Neighbor classifier.

Performance evaluation

In this section a detailed performance evaluation of proposed k-Nearest Neighbor algorithm is done.

Experimental results

In this section we demonstrate the properties and advantages of our approach by means of a directing marketing dataset and we also present the performance of PKNN. The performance of the classification algorithms is usually examined by evaluating the accuracy of the classification (see Table 3). However, since classification (Cover & Hart, 1967) is often a fuzzy problem, the correct answer may depend on the user. Traditional algorithm (Jovanovic, Milutinovic, & Obradovic, 2002) evaluation

Conclusion

Classification is an important problem in data mining. In this work we developed one text mining classifier using k-Nearest Neighbor algorithm to measure the training time, classification accuracy, precision and recall for direct marketing dataset. First, we utilized our developed text mining algorithm, including text mining techniques based on classification of data upon one dataset. After that, we employed the existing k-Nearest Neighbor algorithm to deal with the measurement of training

Acknowledgements

The authors gratefully acknowledge the authorities of Annamalai University for the facilities offered and encouragement to carry out this work. This part of work is supported in part by the first author who got Career Award for Young Teachers (CAYT) grant from All India Council for Technical Education, New Delhi. They would also like to thank the reviewers for their valuable remarks

References (28)

  • C.L. Bauer

    A direct mail customer purchase model

    Journal of Direct Marketing

    (1988)
  • H. Lee et al.

    Focusing on non-respondents: Response modeling with novelty detectors

    Expert System with Applications

    (2007)
  • Blake, C., & Merz, C. (1998). UCI repository of machine learning databases....
  • T. Cover et al.

    Nearest neighbor pattern classification

    IEEE Transactions on Information Theory

    (1967)
  • B. Dasarathy

    Nearest neighbor pattern classification techniques

    (1991)
  • S. Deerwester et al.

    Indexing by latent semantic analysis

    Journal of the American Society for Information Science

    (1990)
  • T. Dietterich

    Approximate statistical tests for comparing supervised classification learning algorithms

    Neural Computation

    (1998)
  • Dunham, M. H. (2003). Data mining – Introductory and advanced topics (pp. 90–92). Pearson...
  • J. Friedman et al.

    An algorithm for finding best matches in logarithmic expected time

    ACM Transactions on Mathematical Software

    (1977)
  • Han, J., & Kamber, M. (2003). Data mining – Concepts and techniques (pp. 359–365)....
  • Hotta, S., Kiyasu, S., & Miyahara, S. (2004). Pattern recognition using average patterns of categorical k-nearest...
  • lshiil, N., Tsuchiya, E., Bao, Y., & Yamaguchi, N. (2005). Combining classification improvements by ensemble...
  • Jacobs, C., Finkelstein, A., & Salesin, D. (1995). Fast multiresolution image querying. In Proceedings of SIGGRAPH 95...
  • Jovanovic, N., Milutinovic, V., & Obradovic, Z. (2002). Member, IEEE. Foundations of predictive data...
  • Cited by (41)

    • COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus

      2021, Computers and Industrial Engineering
      Citation Excerpt :

      In the grid search approach, possible values of each hyperparameter shown in Section 3.1 are determined, and the grid search tries all possible combinations of the parameters. Next, k-fold cross-validation technique is employed to evaluate the machine learning model considering k = 10 (Govindarajan & Chandrasekaran, 2010) because of its relatively low bias and variance (Sulistiana & Muslim, 2020; Lin et al., 2008; Merghadi et al., 2020; Aggarwal, 2015). Thus, the dataset is separated into ten equal size random groups for training and testing.

    • Voronoi classified and clustered data constellation: A new 3D data structure for geomarketing strategies

      2020, ISPRS Journal of Photogrammetry and Remote Sensing
      Citation Excerpt :

      In geomarketing applications, single-nearest-neighbour search is often used for finding direct targets. Usually, a direct target will receive a specific promotion or offer (Govindarajan and Chandrasekaran, 2010). This type of query can also be combined with information on the customer’s personal characteristics, such as profession, age and purchase history.

    • A ranking-based feature selection approach for handwritten character recognition

      2019, Pattern Recognition Letters
      Citation Excerpt :

      According to this approach, an unknown sample is labeled with the most common label among its k nearest neighbors in the training set. The rationale behind the k-NN classifier is that, given an unknown sample x to be assigned to one of the ci classes of the problem at hand, the a-posteriori probabilities p(ci|x) in the neighborhood of x may be estimated by looking at the class labels of the k nearest neighbors of x. Despite its simplicity, K-NN has shown itself able to provide good results [19,30,39]. The following results were achieved by using the Mahalanobis distance, which, in a preliminary set of experiments, proved to be more effective than the Euclidean one.

    View all citing articles on Scopus
    View full text