A Comparison of Two Oversampling Techniques (SMOTE vs MTDF) for Handling Class Imbalance Problem: A Case Study of Customer Churn Prediction

Amin, Adnan; Rahim, Faisal; Ali, Imtiaz; Khan, Changez; Anwar, Sajid

doi:10.1007/978-3-319-16486-1_22

Adnan Amin⁶,
Faisal Rahim⁶,
Imtiaz Ali⁶,
Changez Khan⁶ &
…
Sajid Anwar⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 353))

4202 Accesses
11 Citations

Abstract

Predicting the behavior of customer is at great importance for a project manager. Data driven industries such as telecommunication industries have advantage of various data mining techniques to extract meaningful information regarding customer’s future behavior. However, the prediction accuracy of these data mining techniques is significantly affected if the real world data is highly imbalanced. In this study, we investigate and compare the predictive performance of two well-known oversampling techniques Synthetic Minority Oversampling Technique (SMOT) and Megatrend Diffusion Function (MTDF) and four different rule generation algorithms (Exhaustive, Genetic, Covering, and LEM2) based on rough set classification using publicly available data sets. As useful feature extraction can play a vital role not only in improving the classification performance, but also to reduce the computational cost and complexity by eliminating unnecessary features from the dataset. Minimum Redundancy Maximum Relevance (mRMR) technique has been used in the proposed study for feature extraction which not only selects the best feature subset but also reduces the features space. The results clearly demonstrate the predictive performance of both oversampling techniques and rules generation algorithms that will help the decision makers/researcher to select the ultimate one.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 369.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ruparel, N.H., Shahane, N.M., Bhamare, D.P.: Learning from Small Data Set to Build Classification Model: A Survey. Int. Conf. Recent Trends Eng. Technol. 2013, 975–8887 (2013)
Google Scholar
Burez, J., Van den Poel, D.: Handling class imbalance in customer churn prediction. Expert Syst. Appl. 36, 4626–4636 (2009)
Article Google Scholar
Gupta, S., Hanssens, D., Hardie, B., Kahn, W., Kumar, V., Lin, N., Ravishanker, N., Sriram, S.: Modeling Customer Lifetime Value. J. Serv. Res. 9, 139–155 (2006)
Article Google Scholar
Weiss, G.M.: Mining with Rarity: A Unifying Framework. SIGKDD Explor 6, 7–19 (2004)
Article Google Scholar
Peng, H.: Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005)
Article Google Scholar
Tang, Y., Krasser, S., Alperovitch, D., Judge, P.: Spam Sender Detection with Classification Modeling on Highly Imbalanced Mail Server Behavior Data. In: 2006 8th Int. Conf. on Signal Process, vol. 3, pp. 174–180 (2008)
Google Scholar
Wu, G., Chang, E.Y.: KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans. Knowl. Data Eng. 17, 786–795 (2005)
Article Google Scholar
Probost, F.: Machine Learning from Imbalanced Data Sets 101 Extended Abstract. Invit. Pap. AAAI 2000 Work. Imbalanced Data Sets (2000)
Google Scholar
Chawla, N.V., Japkowicz, N., Drive, P.: Editorial: Special Issue on Learning from Imbalanced Data Sets Aleksander Ko l cz. ACM SIGKDD Explor 6, 2000–(2004)
Article Google Scholar
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6, 20 (2004)
Article Google Scholar
Guo, H.: Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach. ACM SIGKDD Explor 6, 30–39 (2004)
Article Google Scholar
Li, D.-C., Wu, C.-S., Tsai, T.-I., Lina, Y.-S.: Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput. Oper. Res. 34, 966–982 (2007)
Article MATH Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Wang, J., Xu, M., Wang, H., Zhang, J.: Classification of Imbalanced Data by Using the SMOTE Algorithm and Locally Linear Embedding. In: 2006 8th Int. Conf. Signal Process, vol. 3, pp. 1–4 (2006)
Google Scholar
Pawlak, Z.: Rough sets. Int. J. Comput. Inf. Sci. 11, 341–356 (1982)
Article MathSciNet MATH Google Scholar
Pawlak, Z.: Rough Sets, Rough Relations and Rough Functions - Fundamenta Informaticae, vol. 27(2-3). IOS Press (1996), http://iospress.metapress.com/content/vr21hm11p17k3uh0/
Nguyen, S.H., Nguyen, H.S.: Analysis of STULONG Data by Rough Set Exploration System ( RSES ). In: Proc. ECML/PKDD Work, pp. 71–82 (2003)
Google Scholar
Bazan, J.G., Nguyen, H.S., Nguyen, S.H., Synak, P., Wróblewski, J.: Rough set algorithms in classification problem, pp. 49–88 (2000)
Google Scholar
Wróblewski, J.: Genetic Algorithms in Decomposition and Classification Problems. Rough Sets Knowl. Discov. 2(19), 471–487 (1998)
Article Google Scholar
Grzymala-Busse, J.W.: A New Version of the Rule Induction System LERS. Informaticae 31, 27–39 (1997)
MATH Google Scholar
Bazan, J., Szczuka, M.S.: The Rough Set Exploration System. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 37–56. Springer, Heidelberg (2005)
Chapter Google Scholar
Dataset Source, http://www.sgi.com/tech/mlc/db/
Holmes, G., Donkin, A., Witten, I.H.: WEKA: a machine learning workbench. In: Proceedings of ANZIIS 1994 - Australian New Zealnd Intelligent Information Systems Conference, pp. 357–361 (1994)
Google Scholar
STANDARDIZE function, http://office.microsoft.com/en-001/excel-help/standardize-function-HP010342919.aspx
He, F., Wang, X., Liu, B.: Attack Detection by Rough Set Theory in Recommendation System. In: 2010 IEEE International Conference on Granular Computing, pp. 692–695. IEEE (2010)
Google Scholar
Bellazzi, R., Zupan, B.: Predictive data mining in clinical medicine: current issues and guidelines. Int. J. Med. Inform. 77, 81–97 (2008)
Article Google Scholar
Amin, A., Shehzad, S., Khan, C., Ali, I., Anwar, S.: Churn Prediction in Telecommunication Industry Using Rough Set Approach. In: Camacho, D., Kim, S.-W., Trawiński, B. (eds.) ICCCI 2014, pp. 83–95. Springer International Publishing, Switzerland (2015)
Google Scholar
Amin, A., Khan, C., Ali, I., Anwar, S.: Customer Churn Prediction in Telecommunication Industry: With and without Counter-Example (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Management Sciences, Hayatabad Phase 7, Peshawar, Pakistan, Zip Code: 25000
Adnan Amin, Faisal Rahim, Imtiaz Ali, Changez Khan & Sajid Anwar

Authors

Adnan Amin
View author publications
You can also search for this author in PubMed Google Scholar
Faisal Rahim
View author publications
You can also search for this author in PubMed Google Scholar
Imtiaz Ali
View author publications
You can also search for this author in PubMed Google Scholar
Changez Khan
View author publications
You can also search for this author in PubMed Google Scholar
Sajid Anwar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adnan Amin .

Editor information

Editors and Affiliations

DEI/FCT, Universidade de Coimbra, Coimbra, Portugal
Alvaro Rocha
Campus de Campolide, ISEGI, Lisboa, Portugal
Ana Maria Correia
DEIS, Università della Calabria, Arcavacata di Rende, Italy
Sandra Costanzo
DIS, Universidade do Minho, Guimarães, Portugal
Luis Paulo Reis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Amin, A., Rahim, F., Ali, I., Khan, C., Anwar, S. (2015). A Comparison of Two Oversampling Techniques (SMOTE vs MTDF) for Handling Class Imbalance Problem: A Case Study of Customer Churn Prediction. In: Rocha, A., Correia, A., Costanzo, S., Reis, L. (eds) New Contributions in Information Systems and Technologies. Advances in Intelligent Systems and Computing, vol 353. Springer, Cham. https://doi.org/10.1007/978-3-319-16486-1_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-16486-1_22
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16485-4
Online ISBN: 978-3-319-16486-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics