A mixed-integer programming approach to the clustering problem with an application in customer segmentation

https://doi.org/10.1016/j.ejor.2005.04.048Get rights and content

Abstract

This paper presents a mathematical programming based clustering approach that is applied to a digital platform company’s customer segmentation problem involving demographic and transactional attributes related to the customers. The clustering problem is formulated as a mixed-integer programming problem with the objective of minimizing the maximum cluster diameter among all clusters. In order to overcome issues related to computational complexity of the problem, we developed a heuristic approach that improves computational times dramatically without compromising from optimality in most of the cases that we tested. The performance of this approach is tested on a real problem. The analysis of our results indicates that our approach is computationally efficient and creates meaningful segmentation of data.

Introduction

In recent years, companies have concentrated on understanding the needs and expectations of their customers and grouping the existing and potential customers into classes with the purpose of improving the efficiency of their marketing strategies and increasing their market share. The abundance of large data collections and the need to extract hidden knowledge within them has triggered the development of algorithms to detect unknown patterns in data sets. Clustering analysis is a data mining technique developed for the purpose of identifying groups of entities that are similar to each other with respect to certain similarity measures.

There is a large number of approaches to the clustering problem, including optimization based methods that involve mathematical programming models for developing efficient and meaningful clustering schemes. Exact and heuristic algorithms for these models have been proposed. However, most of these algorithms suffer from efficiency as the size and the dimension of the data set increases.

In a clustering problem, given a data set with n data items (instances) in m dimensions (attributes); the aim is to find an exact partitioning of data items into k clusters. One of the key points is the definition of the term “similarity”, which helps define the form of the objective function of the optimization model [21]. The objective function of the clustering problem can be defined in several ways such as minimization of the sum of within-cluster distances and minimization of the maximum within-cluster distances within-cluster [13]. Fisher [20] modeled the first objective function studied by Rao [13] for the single-dimension case and proposed a least-squares algorithm without a stopping criterion. In later years, the K-Means algorithm is implemented by applying this criterion as an error function. In the widely-known K-Means approach, the iterative objective is to minimize the summation of 2-norm distances between each data point and the center of the cluster which it belongs to [15]. Another consideration in clustering is that the resulting clusters are expected to be homogeneous and compact with respect to certain characteristics. In addition, one should decide how the clusters are constructed. The clusters could be ‘exclusive’, ‘overlapping’ or ‘probabilistic’ [10]. In the latter case, a data point belongs to a particular cluster with a certain probability, hence fuzzy clustering is achieved.

In this paper, we propose a mixed-integer programming model to partition the data set into exclusive clusters, where we assume that the number of desired clusters k is known a priori. The objective function of the model is to minimize the maximum diameter of the generated clusters with the goal of obtaining evenly compact clusters. The original formulation turns out to be computationally demanding. Moreover, there exist alternative optimal solutions since the objective function of the model is insensitive to assignments except for the ones that occur in the “largest” cluster. Hence, we develop a heuristic approach that is based on solving the model with initial seeds in order to improve the solution time, followed by a reassignment heuristic aimed at improving the cluster quality of the model by incorporating sum of within cluster distance averages as a measure.

We used real data from a satellite broadcasting company, Digiturk, in our computational experiments. The company, founded in 1999, is a private digital platform operating in Turkey. The firm has around 800,000 customers and provides five product packages, three pay-per-view services and also various channels, interactive channels and events to its customers. Digiturk is eager to find out the opportunities in customer relationship marketing, such as one-to-one marketing. The company would like to segment its customers based on the transactional factors, such as their package subscriptions, pay-per-view purchases, and interactive event interests. We give an interpretation of the customer segments obtained by our clustering algorithm applied to Digiturk data in Section 6.

This paper is organized as follows. We survey related work on clustering and optimization based clustering methods in Section 2. We present our proposed model in Section 3. The proposed heuristic clustering approach is given in Section 4. In Section 5, we explain the solution of the algorithm on an illustrative example and compare the results with the results of the K-Means algorithm. Then, in Section 6, we apply the proposed algorithm to the Digiturk problem to analyze its performance and efficiency, and we compare our findings with the solution of the K-Means algorithm. Finally, the paper is concluded by presenting the results of the study and discussing ideas for future work in the domain of optimization and clustering.

Section snippets

Literature review

Data mining (DM) has been an integral part of customer relationship management (CRM) studies, with the premise that companies can achieve successful customer relations if they understand their customers’ characteristics and desires as also pointed out by Nemati and Barko [9]. Rygielski et al. [4] gave an overview of data mining; its applications in industry and the techniques used under this topic, and explained the relation and interaction between DM and CRM applications from various aspects.

A mathematical programming model for clustering

We now present a mathematical formulation for the clustering problem with the objective of minimizing the maximum cluster diameter. Similar formulations have been studied by Rao [13] and Brusco [12].

Given a data set of n data items in m-dimensions, i.e. a set of n points in Rm, the goal of the proposed mathematical model is to find the optimal partitioning of the data set into k exclusive clusters assuming that the number of desired clusters is known a priori. In the model, the objective

An improved algorithm

The idea of fixing the assignment of some instances to certain clusters has been used in clustering algorithms before with the goal of improving computational efficiency. These fixed assignments typically improve the computational performance of the algorithm; however, a new question of how to best determine the instances to be fixed, the seeds, is raised. In particular, we wish to select an initial seed for each cluster in such a way to ensure that the seeds are separated well from each other.

Illustrative example

In this part of the study, we applied the proposed algorithm on a set of 81 data points given in Fig. 1. For the purpose of illustration, data points are represented in a 2-dimensional space and the distance between any two points is calculated by the Euclidean distance measure.

We performed experiments on this synthetic data set which consists of four distinct clusters to show how MIP-Diameter model may fail to reach an acceptable solution, to evaluate the accuracy and the performance of the

Evaluation of proposed models on a real data set

In this part of the study, the performance and accuracy of the proposed mathematical programming model and the proposed clustering algorithm are examined on a real data set. Some interpretations derived from the data set are given and the solution of the proposed algorithm is compared with the solution of the K-Means algorithm with respect to discussed clustering indicators and interpretability of the formed clusters. In later parts of this section, we report results of various computational

Conclusions

In this paper, we presented one mathematical programming based segmentation model and a heuristic clustering algorithm that are applied to a digital platform company’s customer database. The MIP-Diameter model forms clusters by minimizing the maximum diameter of the generated clusters. The model is nonhierarchical in the sense that the number of the clusters is assumed to be known a priori. The accuracy of the proposed algorithm is compared with the results of the K-Means algorithm on an

Acknowledgements

We wish to thank Salih Eren, Devrim Melek Tunç and Kemal Özden at the Information Technologies Department of Digiturk for providing the data for the experiments. We also appreciate the anonymous referees for their many helpful suggestions.

References (21)

  • A. Likas et al.

    The global K-Means clustering algorithm

    Pattern Recognition

    (2003)
  • C. Rygielski et al.

    Data mining techniques for customer relationship management

    Technology in Society

    (2002)
  • B.S. Everitt et al.

    Cluster Analysis

    (2001)
  • B. Padmanabhan et al.

    On the use of optimization for data mining: Theoretical interactions and eCRM opportunities

    Management Science

    (2003)
  • D.S. Hochbaum et al.

    A unified approach to approximation algorithms for bottleneck problems

    Journal of the Association for Computing Machinery

    (1986)
  • G. Diehr

    Evaluation of a branch and bound algorithm for clustering

    SIAM Journal on Scientific and Statistical Computing

    (1985)
  • G.P. Babu et al.

    A near-optimal initial seed value selection in K-means algorithm using a genetic algorithm

    Pattern Recognition Letters

    (1993)
  • H.D. Vinod

    Integer programming and the theory of grouping

    Journal of the American Statistical Association

    (1969)
  • H.R. Nemati et al.

    Enhancing enterprise decisions through organizational data mining

    Journal of Computer Information Systems Summer

    (2002)
  • I.H. Witten et al.

    Data Mining: Practical Machine Learning Tools and Techniques with Java Implementation

    (2000)
There are more references available in the full text version of this article.

Cited by (0)

View full text