The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing

https://doi.org/10.1016/j.ejor.2005.07.023Get rights and content

Abstract

Corporate data mining faces the challenge of systematic knowledge discovery in large data streams to support managerial decision making. While research in operations research, direct marketing and machine learning focuses on the analysis and design of data mining algorithms, the interaction of data mining with the preceding phase of data preprocessing has not been investigated in detail. This paper investigates the influence of different preprocessing techniques of attribute scaling, sampling, coding of categorical as well as coding of continuous attributes on the classifier performance of decision trees, neural networks and support vector machines. The impact of different preprocessing choices is assessed on a real world dataset from direct marketing using a multifactorial analysis of variance on various performance metrics and method parameterisations. Our case-based analysis provides empirical evidence that data preprocessing has a significant impact on predictive accuracy, with certain schemes proving inferior to competitive approaches. In addition, it is found that (1) selected methods prove almost as sensitive to different data representations as to method parameterisations, indicating the potential for increased performance through effective preprocessing; (2) the impact of preprocessing schemes varies by method, indicating different ‘best practice’ setups to facilitate superior results of a particular method; (3) algorithmic sensitivity towards preprocessing is consequently an important criterion in method evaluation and selection which needs to be considered together with traditional metrics of predictive power and computational efficiency in predictive data mining.

Introduction

In competitive consumer markets, data mining faces the growing challenge of systematic knowledge discovery in large datasets to achieve operational, tactical and strategic competitive advantages. As a consequence, the support of corporate decision making through data mining has received increasing interest and importance in operational research and industry. As an example, direct marketing campaigns aiming to sell products by means of catalogues or mail offers [1] are restricted to contacting a certain number of customers due to budget constraints. The objective of data mining is to select the customer subset most likely to respond in a mailing campaign, predicting the occurrence or probability of purchase incident, purchase amount or interpurchase time for each customer [2], [3] based upon observable customer attributes of varying scale. Traditionally, response modelling has utilised transactional data consisting of continues variables to predict purchase incident focusing on the recency of the last purchase, the frequency of purchases and the overall monetary purchase amount, referred to as recency, frequency and monetary value (RFM)-analysis [2]. The continuous scale of these attributes together with their limited number has facilitated the use of conventional statistical methods, such as logistic regression.

Recently, progress in computational and storage capacity has enabled the accumulation of ordinal, nominal, binary and unary demographic and psychographic customer centric data, inducing large, rich datasets of heterogeneous scales. On the one hand, this has advanced the application of data driven methods like decision trees (DT) [4], artificial neural networks (NN) [2], [5], [6], and support vector machines (SVM) [7], capable of mining large datasets. On the other hand, the enhanced data has created particular challenges in transforming attributes of different scales into a mathematically feasible and computationally suitable format. Essentially, each customer attribute may require special treatment for each algorithm, such as discretisation of numerical features, rescaling of ordinal features and encoding of categorical ones. Applying a variety of different methods, the phase of data preprocessing (DPP) represents a complex prerequisite for data mining in the process of knowledge discovery in databases [8].

Aiming to maximise the predictive accuracy of data mining, research in management science and machine learning is largely devoted to enhancing competing classifiers and the effective tuning of algorithm parameters. Classification algorithms are routinely tested in extensive benchmark experiments, evaluating the impact on predictive accuracy and computational efficiency, using preprocessed datasets; e.g. [9], [10], [11]. In contrast to this, research in DPP focuses on the development of algorithms for particular DPP tasks. While feature selection [12], [13], [14], resampling [15], [16] and the discretisation of continuous attributes [17], [18] are analysed in some detail, few publications investigate the impact of data projection for categorical attributes and scaling [19], [20]. More importantly, interactions on predictive accuracy in data mining are not been analysed in detail, especially not within the domain of corporate direct marketing.

To narrow this gap in research and practice, we seek to investigate the potential of DPP in a real world scenario of response modelling, predicting purchase incident to identify those customers most likely to respond to a mailing campaign in the publishing industry. We analyse the impact of different DPP schemes across a selection of established data mining methods. Due to the questionable usefulness of traditional statistical techniques in large scale data mining settings [21], [22] and mixed scaling levels of customer attributes, we confine our analysis to data driven methods of C4.5 DT, NN and SVM.

The remainder of the paper is organised as follows: We begin with a short overview of the classification methods of DT, NN and SVM used. Next, the task of DPP for competing methods for scaling, sampling and coding is discussed in Section 3. Conducting a structured literature review, we exemplify that the influence of DPP is widely overlooked to motivate our further analysis. This is followed by the case study setup of purchase incident modelling for direct marketing in Section 4 and the experimental results providing empirical evidence for the significant impact of DPP on classification performance in Section 5. Conclusions are given in Section 6.

Section snippets

Multilayer perceptrons

NN represent a class of statistical methods capable of universal function approximation, learning non-linear relationships between independent and dependent variables directly from the data without previous assumptions about the statistical distributions [23]. Multilayer perceptrons (MLP) represent a prominent class of NN [24], [25], [26], implementing a paradigm of supervised learning methods which is routinely used in academic and empirical classification and data mining tasks [27], [28], [29]

Current research in data preprocessing

The application of each data mining algorithm requires the presence of data in a mathematically feasible format, achieved through DPP. Consequently, DPP represents a prerequisite phase for data mining in the process of knowledge discovery in databases. DPP tasks are distinguished in data reduction, aiming at decreasing the size of the dataset by means of instance selection and/or feature selection, and data projection, altering the representation of the data, e.g. mapping continuous variables

Experimental setup

We analyse the impact of individual DPP choices on classification performance in a structured experiment, based upon the characteristics of an empirical dataset from a previous direct mailing campaign conducted in the publishing industry. The objective is to evaluate customers for cross-selling, identifying those most likely to buy an additional magazine subscription from all customers already subscribed to at least one periodical. The original campaign contacted 300,000 customers, of which

Impact of data preprocessing across classification methods

We calculate the lift index of SVM, NN and DT across 32 experimental designs of different DPP variants and across three datasets of training, validation and test data, visualised in Fig. 3.

To quantify the impact and significance of each DPP candidate on the classification performance of different methods, we conduct a multifactorial analysis of variance with extended multi comparison tests of estimated marginal means across all methods and for each of the three methods separately. The

Conclusions

We investigate the impact of different DPP techniques of attribute scaling, sampling, coding of categorical and continuous attributes on classifier performance of NN, SVM and DT in a case-based evaluation of a direct marketing mailing campaign. Supported by a multifactorial analysis of variance, we provide empirical evidence that DPP has a significant impact on predictive accuracy. While certain DPP schemes of undersampling prove consistently inferior across classification methods and

References (62)

  • R. Kohavi et al.

    Wrappers for feature subset selection

    Artificial Intelligence

    (1997)
  • E.L. Nash

    The Direct Marketing Handbook

    (1992)
  • S. Viaene et al.

    Wrapped input selection using multilayer perceptrons for repeat-purchase modeling in direct marketing

    International Journal of Intelligent Systems in Accounting, Finance and Management

    (2001)
  • D. Haughton et al.

    Direct marketing modeling with CART and CHAID

    Journal of Direct Marketing

    (1999)
  • J. Zahavi et al.

    Issues and problems in applying neural computing to target marketing

    Journal of Direct Marketing

    (1999)
  • J. Zahavi et al.

    Applying neural computing to target marketing

    Journal of Direct Marketing

    (1999)
  • S. Viaene et al.

    Knowledge discovery in a direct marketing case using least squares support vector machines

    International Journal of Intelligent Systems

    (2001)
  • D. Pyle

    Data Preparation for Data Mining

    (1999)
  • T.-S. Lim et al.

    A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms

    Machine Learning

    (2000)
  • B. Baesens et al.

    Benchmarking state-of-the-art classification algorithms for credit scoring

    Journal of the Operational Research Society

    (2003)
  • S. Viaene et al.

    A comparison of state-of-the-art classification techniques for expert automobile insurance claim fraud detection

    Journal of Risk and Insurance

    (2002)
  • Y.S. Kim et al.

    Customer targeting: A neural network approach guided by genetic algorithms

    Management Science

    (2005)
  • J. Yang, S. Olafsson, Optimization-based feature selection with adaptive instance sampling, Computers and Operations...
  • N.V. Chawla et al.

    SMOTE: Synthetic minority over-sampling technique

    Journal of Artificial Intelligence Research

    (2002)
  • M. Kubat, S. Matwin, Addressing the curse of imbalanced training sets: One-sided selection, in: Proceedings of the 14th...
  • P. Berka et al.

    Empirical comparison of various discretization procedures

    International Journal of Pattern Recognition and Artificial Intelligence

    (1998)
  • U.M. Fayyad et al.

    On the handling of continuous-valued attributes in decision tree generation

    Machine Learning

    (1992)
  • W.S. Sarle, Neural Network FAQ, 2004, Downloadable from website...
  • S. Zhang et al.

    Data preparation for data mining

    Applied Artificial Intelligence

    (2003)
  • C.M. Bishop

    Neural Networks for Pattern Recognition

    (1995)
  • J.A.K. Suykens et al.

    Nonlinear Modeling: Advanced Black-box Techniques

    (1998)
  • Cited by (0)

    View full text