Evaluation of k-Nearest Neighbor classifier performance for direct marketing
Introduction
Direct marketing has become an important application field for data mining. In direct marketing (Madeira & Sousa, 2002) companies or organizations try to establish and maintain a direct relationship with their customers in order to target them individually for specific product offers or for fund-raising. Large databases of customer and market data are maintained for this purpose. The customers or clients who are to be targeted in a specific campaign are selected from the database, given the different types of information such as demographic information and information on the customer’s personal characteristics such as profession, age and purchase history (Bauer, 1988).
Classification is the problem of automatically assigning an object to one of the several predefined categories based on the attributes of the object. It has been recognized as one of the core tasks in data mining, field that is concerned with the extraction of knowledge or patterns from databases through the building of predictive or descriptive models (Sousa, Kaymak, & Madeira, 2002). For example, insurance companies might want to classify a customer (Lee & Cho, 2007) as being either high-risk or low-risk using various attributes of the customer – such as credit history, annual income, and age. More examples of the problem include hand-written digit recognition, text classification, intrusion detection, and credit card fraud detection.
The problem is usually formulated as follows: a training set of objects (also called instances or records) and their attributes (also called features) as well as the categories or classes to which these objects belong to are given. The attributes of each record can either be categorical or be continuous. Then building of a classifier that makes use of the training set to build a model to predict the class of a new record is tried, given the attributes of the new record. Because a training dataset is given, the classification problem is also known as supervised induction or supervised learning.
Classification (Dietterich, 1998) is one of the primary data mining tasks. The input to a classification system consists of example tuples, called training set, with each tuple having several attributes. Attributes can be continuous, coming from an ordered domain, or categorical, coming from an unordered domain. A special class attribute indicates the label or category to which an example belongs to. The goal of classification is to induce a model from the training set, which can be used to predict the class of a new tuple.
The paper presents k-Nearest Neighbor classifiers for direct marketing (Blake & Merz, 1998). The proposed k-NN algorithm is based on comparative cross-validation. The rest of the paper is organized as follows. In Section 1 we describe the direct marketing and classification methodology. In Section 2 we describe the state of art of the work. In Section 3 we describe the proposed k-Nearest Neighbor classifier using comparative cross-validation. In Section 4 we provide the performance evaluation, and in Section 5 we provide the experimental results. We conclude with a summary in Section 6.
Section snippets
State of the art
In this section, the state of the art concerning comparative cross-validation of k-NN algorithm is investigated. The results of this survey will motivate a new approach.
Existing k-Nearest Neighbor algorithm (Ek-NN)
Input:
T// Training data
K// Number of neighbors
t// Input tuple to classify
Output:
c// class to which t is assigned KNN Algorithm
N = 0;// Find set of neighbors,
N for t
For each d∈T do
if , then
N = N U{d}
else
if ∃u∈N such that
sim(t,u) ⩽ sim (t,d) then
begin
N = N-{u};
N = N U{d};
end
// Find class for classification
c = class to which the most u∈N are classified;
Proposed k-Nearest Neighbor algorithm (Pk-NN)
This paper describes the proposed k-Nearest Neighbor classifier that performs comparative cross-validation for the existing k-Nearest Neighbor classifier.
Performance evaluation
In this section a detailed performance evaluation of proposed k-Nearest Neighbor algorithm is done.
Experimental results
In this section we demonstrate the properties and advantages of our approach by means of a directing marketing dataset and we also present the performance of PKNN. The performance of the classification algorithms is usually examined by evaluating the accuracy of the classification (see Table 3). However, since classification (Cover & Hart, 1967) is often a fuzzy problem, the correct answer may depend on the user. Traditional algorithm (Jovanovic, Milutinovic, & Obradovic, 2002) evaluation
Conclusion
Classification is an important problem in data mining. In this work we developed one text mining classifier using k-Nearest Neighbor algorithm to measure the training time, classification accuracy, precision and recall for direct marketing dataset. First, we utilized our developed text mining algorithm, including text mining techniques based on classification of data upon one dataset. After that, we employed the existing k-Nearest Neighbor algorithm to deal with the measurement of training
Acknowledgements
The authors gratefully acknowledge the authorities of Annamalai University for the facilities offered and encouragement to carry out this work. This part of work is supported in part by the first author who got Career Award for Young Teachers (CAYT) grant from All India Council for Technical Education, New Delhi. They would also like to thank the reviewers for their valuable remarks
References (28)
A direct mail customer purchase model
Journal of Direct Marketing
(1988)- et al.
Focusing on non-respondents: Response modeling with novelty detectors
Expert System with Applications
(2007) - Blake, C., & Merz, C. (1998). UCI repository of machine learning databases....
- et al.
Nearest neighbor pattern classification
IEEE Transactions on Information Theory
(1967) Nearest neighbor pattern classification techniques
(1991)- et al.
Indexing by latent semantic analysis
Journal of the American Society for Information Science
(1990) Approximate statistical tests for comparing supervised classification learning algorithms
Neural Computation
(1998)- Dunham, M. H. (2003). Data mining – Introductory and advanced topics (pp. 90–92). Pearson...
- et al.
An algorithm for finding best matches in logarithmic expected time
ACM Transactions on Mathematical Software
(1977) - Han, J., & Kamber, M. (2003). Data mining – Concepts and techniques (pp. 359–365)....
Cited by (41)
In-phase matrix profile: A novel method for the detection of major depressive disorder
2024, Biomedical Signal Processing and ControlCOVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus
2021, Computers and Industrial EngineeringCitation Excerpt :In the grid search approach, possible values of each hyperparameter shown in Section 3.1 are determined, and the grid search tries all possible combinations of the parameters. Next, k-fold cross-validation technique is employed to evaluate the machine learning model considering k = 10 (Govindarajan & Chandrasekaran, 2010) because of its relatively low bias and variance (Sulistiana & Muslim, 2020; Lin et al., 2008; Merghadi et al., 2020; Aggarwal, 2015). Thus, the dataset is separated into ten equal size random groups for training and testing.
Voronoi classified and clustered data constellation: A new 3D data structure for geomarketing strategies
2020, ISPRS Journal of Photogrammetry and Remote SensingCitation Excerpt :In geomarketing applications, single-nearest-neighbour search is often used for finding direct targets. Usually, a direct target will receive a specific promotion or offer (Govindarajan and Chandrasekaran, 2010). This type of query can also be combined with information on the customer’s personal characteristics, such as profession, age and purchase history.
Identification of the unique attributes and topics within Smart Things Open Innovation Communities
2019, Technological Forecasting and Social ChangeA ranking-based feature selection approach for handwritten character recognition
2019, Pattern Recognition LettersCitation Excerpt :According to this approach, an unknown sample is labeled with the most common label among its k nearest neighbors in the training set. The rationale behind the k-NN classifier is that, given an unknown sample x to be assigned to one of the ci classes of the problem at hand, the a-posteriori probabilities p(ci|x) in the neighborhood of x may be estimated by looking at the class labels of the k nearest neighbors of x. Despite its simplicity, K-NN has shown itself able to provide good results [19,30,39]. The following results were achieved by using the Mahalanobis distance, which, in a preliminary set of experiments, proved to be more effective than the Euclidean one.
Reliable writer identification in medieval manuscripts through page layout features: The “Avila” Bible case
2018, Engineering Applications of Artificial Intelligence