Churn prediction in telecom using Random Forest and PSO based data balancing in combination with various feature selection strategies

https://doi.org/10.1016/j.compeleceng.2012.09.001Get rights and content

Abstract

The telecommunication industry faces fierce competition to retain customers, and therefore requires an efficient churn prediction model to monitor the customer’s churn. Enormous size, high dimensionality and imbalanced nature of telecommunication datasets are main hurdles in attaining the desired performance for churn prediction. In this study, we investigate the significance of a Particle Swarm Optimization (PSO) based undersampling method to handle the imbalance data distribution in collaboration with different feature reduction techniques such as Principle Component Analysis (PCA), Fisher’s ratio, F-score and Minimum Redundancy and Maximum Relevance (mRMR). Whereas Random Forest (RF) and K Nearest Neighbour (KNN) classifiers are employed to evaluate the performance on optimally sampled and reduced features dataset. Prediction performance is evaluated using sensitivity, specificity and Area under the curve (AUC) based measures. Finally, it is observed through simulations that our proposed approach based on PSO, mRMR, and RF termed as Chr-PmRF, performs quite well for predicting churners and therefore can be beneficial for highly competitive telecommunication industry.

Highlights

► Telecom industry faces fierce completion to retain customers. ► Enormous size, imbalanced dataset and high dimensionality make churn prediction in telecom a challenging problem. ► Our proposed approach named Chr-PmRF, employs PSO based balancing, mRMR feature reduction and Random Forest as a classifier. ► Chr-PmRF efficiently predicts churners and might be beneficial for highly competitive telecommunication industry.

Introduction

Telecommunication is one of the industries, where customer base plays a significant role in maintaining stable revenues and thus a serious attention is devoted to retain customers. The customers’ appetency to switch over to any other viable network varies for different reasons such as, call quality, more complimentary competitors’ pricing plan, customers’ billing problems, etc. The telecommunication industry always faces threat of financial loss from potential churners therefore, an efficient churn prediction model not only secures the revenues but also provides hints to management for targeting potential churners by reducing the market-relevant shortcomings. Hence, customer relationship management in a telecommunication company desires an efficient churn prediction model for predicting the potential churners.

The efficiency of churn prediction model, based on classification system relies on learning acquired through the available dataset. The appropriately preprocessed dataset helps the classifier to attain the required training level, which ultimately turns into a desirable performance. Telecommunication companies archive data by acquiring a lot of information about customers. Unfortunately, such a data has high dimensionality and imbalanced class distribution. Generally, information regarding demographics, contract nature, billing and payments, call details, services log etc. are maintained that eventually leads to the high dimensionality. Similarly, the number of churners in telecommunication industry is usually far less compared to non-churners and consequently, it results in an imbalanced dataset. This imbalance distribution in the dataset might cause weak learning by a classifier. Therefore, the preprocessing phase essentially requires a proper sampling and feature reduction strategy for accomplishing good learning by the classifier.

Principle Component Analysis (PCA) and Independent Component Analysis (ICA) [1] are mostly used feature selection strategies, which linearly operate to select the useful and discriminating features present in a dataset. PCA is based on data covariance while ICA uses higher order statistics for achieving data independence, along with reducing the dimensionality of the data. Similarly, some well-known sampling techniques are Random Oversampling (ROS) and Random Undersampling (RUS) [2], where instances of the minority class are duplicated and majority class are discarded, respectively. Due to the random selection, involved in duplicating and discarding the data values, these approaches lack consistency and show varying performances. In addition, the RUS can discard some useful instances and ROS can lead to overfitting owing to replication. Similarly, One Sided Selection (OSS) removes the noisy and boundary line majority class instances, but it is slow when used on large datasets for using Tomek Links [3], which are proven costly. Cluster based oversampling identifies rare cases from the dataset and resamples the instances, but considered to be effective [4], [5] for small sized training dataset. Synthetic Minority Oversampling Technique is an intelligent oversampling method, where new minority class samples are added synthetically, but it involves high computational cost [6] and thus is not suitable for large sized dataset. Data Boost-IM [7] is another approach used for sampling, where the predictive occurrences of both minority and majority classes are increased using synthetic data generation, this approach also involves high computation cost and therefore is not appropriate for large sized dataset. Most of the sampling techniques either use random selection for undersampling, which consequently introduces bias, or synthetic generation of minority class samples, which are proven costly. Therefore, an optimized sampling technique can be employed for sampling dataset, which can effectively mitigate the imbalance in data distribution.

Besides the appropriate feature selection and sampling techniques required to handle the imbalanced telecommunication dataset, the classification models are the real tools, which perform the customer churn prediction. Researchers have used Decision Trees [8], [9], [10], Logistic Regression [10], [11], Genetic Programming [12], [26], Neural Network [13], [14], [15], [16], Random Forest [17], Adaboost [19], Naive based algorithms [11] for various classification problems including churn prediction. Some of the techniques have also used nonlinear kernel methods in Support Vector Machines for churn prediction but they suffer from the high dimensionality of a dataset [8]. Other classification models such as SVM [20], [27] and KNN [11], also show deteriorated performances in case of telecommunication churn prediction, because of the imbalanced nature of dataset [11]. Although some approaches, based on ensemble of KNN and logistic regression [18], additive grooves with multiple counts features evaluation [19] and hybrid two phased feature selection [20], have been suggested but the classification models could not achieve the needed performance. These ensemble approaches, primarily curtail the data dimensionality by selecting features and introduce data balancing in the due course, but the classification performance suffers due to the loss of information resulting from application of improper sampling and feature reduction methods.

Realizing the challenges, being faced in customer churn prediction due to large size, high dimensionality and imbalanced nature of the telecommunication dataset, we initially analyzed RUS and PSO based [23] undersampling methods separately. The PSO based undersampling method initially subsamples the dataset and then evaluates each subsample against KNN and Random Forest on the basis of AUC. Once an optimal subsample is selected then PCA, F-score, Fisher’s ratio and mRMR are applied separately and analyzed with RF and KNN classifiers. It is finally observed that our proposed approach based on PSO, mRMR and RF termed as Chr-PmRF provides best results among the other combinations of sampling, feature reduction and classification techniques.

The rest of the manuscript first presents the proposed churn prediction approach in Section 2. Next, Section 3 analyzes the simulated results and gives corresponding discussions. Finally, the conclusions are drawn in Section 4.

Section snippets

Material and methods

The telecommunication datasets generally face the problems of skewed data distribution and high dimensionality. This causes the classification algorithms to perform poorly for customers churn prediction. Therefore, in Chr-PmRF approach, we concentrate in handling these problems. The basic block diagram shown in Fig. 1 highlights various steps involved in Chr-PmRF.

We initially preprocess the dataset in order to handle the problems of missing values and nominal values present in the dataset. RUS

Proposed Chr-PmRF approach

Besides various combinations of sampling, feature selection and classification methodologies employed, we have observed that PSO based undersampling in combination with mRMR based feature selection and RF classifier yields best churn prediction results. Therefore, in what follows, we will focus on this particular combination denoted as Chr-PmRF. Our proposed Chr-PmRF efficiently utilizes a PSO based undersampling method, which not only undersamples the dataset but also optimizes chosen

Results and discussion

The proposed Chr-PmRF approach is validated with the comprehensive experimentation conducted employing various combinations of sampling, feature selection and classification methodologies. The 10 folds cross validation testing is adopted for analyzing the performance attained during the experimentation using AUC, sensitivity and specificity based performance measures.

Conclusions

This work validates the claim as regards classification that appropriate preprocessing and establishing the proper data distribution is vital for classification. The PSO based optimal sampling approach not only undersamples the data but optimizes the samples selection on the basis of AUC measure, to attain better classification performance. The discriminating power of the optimally selected samples is further explored by employing appropriate feature selection strategies. Where mRMR returns a

Acknowledgement

This work is supported by the Higher Education Commission of Pakistan (HEC) as per Award No. 17-5-6 (Ps6-002)/HEC/Sch/2010

Adnan Idris received his M.S. degree in Computer System Engineering from GIK Institute of Engineering Sciences and Technology Topi, Pakistan in 2006. Prior to that he has earned his master degree in software engineering from COMSTATS Institute of I.T, Islamabad in 2002. Further he has 7 years research and teaching experience at university level. Currently he is doing PhD from Pakistan Institute of Eng. & Applied Sciences, Islamabad. His research areas include Customer Churn Prediction, Machine

References (29)

  • H. Guo et al.

    Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

    ACM SIGKDD Explorat Newslett

    (2004)
  • Guyon I, Lemaire V, Boulle M, Dror G, Vogel D. Analysis of kddcup2009: Fast scoring on a large orange customer...
  • Huang B Q, Kechadi M-T, Buckley B. Customer churn prediction for broadband internet services. In: Proceedings of the...
  • J. Haden et al.

    Computer assisted customer churn management: state-of-the-art and future trends

    Comput Oper Res

    (2007)
  • Cited by (103)

    • Swarm intelligence goal-oriented approach to data-driven innovation in customer churn management

      2021, International Journal of Information Management
      Citation Excerpt :

      rule discovery (Amin et al., 2016; Verbeke et al., 2011), decision trees and random forests (Idris, Rizwan, & Khan, 2012; Höppner, Stripling, Baesens, vanden Broucke, & Verdonck, 2020; Nie et al., 2011), deep neural networks (Mena, De Caigny, Coussement, De Bock, & Lessmann, 2019; De Caigny, Coussement, De Bock, & Lessmann, 2019),

    View all citing articles on Scopus

    Adnan Idris received his M.S. degree in Computer System Engineering from GIK Institute of Engineering Sciences and Technology Topi, Pakistan in 2006. Prior to that he has earned his master degree in software engineering from COMSTATS Institute of I.T, Islamabad in 2002. Further he has 7 years research and teaching experience at university level. Currently he is doing PhD from Pakistan Institute of Eng. & Applied Sciences, Islamabad. His research areas include Customer Churn Prediction, Machine Learning and Evolutionary algorithms.

    Muhammad Rizwan has completed his B.S. (CIS) degree from Pakistan Institute of Engineering and Applied sciences, Islamabad. His research interest includes computer programming, Machine Learning and Pattern Recognition.

    Asifullah Khan received his M.S. and Ph.D. degrees in Computer Systems Engineering from GIK Institute of Engineering Sciences and Technology Topi, Pakistan, in 2003 and 2006, respectively. He has spent 2-years as Post-Doc Research Fellow at Department of Mechatronics, GIST South Korea. He is currently working as Associate Professor in Department of Computer and Information Sciences at PIEAS. His research areas include Digital Watermarking, Pattern Recognition, Image Processing, Evolutionary Algorithms, Bioinformatics, Machine Learning, and Computational Materials Science.

    Reviews processed and approved for publication by Editor-in-Chief Dr. Manu Malek.

    View full text