Skip to main content

A Comparison of Two Oversampling Techniques (SMOTE vs MTDF) for Handling Class Imbalance Problem: A Case Study of Customer Churn Prediction

  • Conference paper
New Contributions in Information Systems and Technologies

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 353))

Abstract

Predicting the behavior of customer is at great importance for a project manager. Data driven industries such as telecommunication industries have advantage of various data mining techniques to extract meaningful information regarding customer’s future behavior. However, the prediction accuracy of these data mining techniques is significantly affected if the real world data is highly imbalanced. In this study, we investigate and compare the predictive performance of two well-known oversampling techniques Synthetic Minority Oversampling Technique (SMOT) and Megatrend Diffusion Function (MTDF) and four different rule generation algorithms (Exhaustive, Genetic, Covering, and LEM2) based on rough set classification using publicly available data sets. As useful feature extraction can play a vital role not only in improving the classification performance, but also to reduce the computational cost and complexity by eliminating unnecessary features from the dataset. Minimum Redundancy Maximum Relevance (mRMR) technique has been used in the proposed study for feature extraction which not only selects the best feature subset but also reduces the features space. The results clearly demonstrate the predictive performance of both oversampling techniques and rules generation algorithms that will help the decision makers/researcher to select the ultimate one.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 369.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ruparel, N.H., Shahane, N.M., Bhamare, D.P.: Learning from Small Data Set to Build Classification Model: A Survey. Int. Conf. Recent Trends Eng. Technol. 2013, 975–8887 (2013)

    Google Scholar 

  2. Burez, J., Van den Poel, D.: Handling class imbalance in customer churn prediction. Expert Syst. Appl. 36, 4626–4636 (2009)

    Article  Google Scholar 

  3. Gupta, S., Hanssens, D., Hardie, B., Kahn, W., Kumar, V., Lin, N., Ravishanker, N., Sriram, S.: Modeling Customer Lifetime Value. J. Serv. Res. 9, 139–155 (2006)

    Article  Google Scholar 

  4. Weiss, G.M.: Mining with Rarity: A Unifying Framework. SIGKDD Explor 6, 7–19 (2004)

    Article  Google Scholar 

  5. Peng, H.: Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005)

    Article  Google Scholar 

  6. Tang, Y., Krasser, S., Alperovitch, D., Judge, P.: Spam Sender Detection with Classification Modeling on Highly Imbalanced Mail Server Behavior Data. In: 2006 8th Int. Conf. on Signal Process, vol. 3, pp. 174–180 (2008)

    Google Scholar 

  7. Wu, G., Chang, E.Y.: KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans. Knowl. Data Eng. 17, 786–795 (2005)

    Article  Google Scholar 

  8. Probost, F.: Machine Learning from Imbalanced Data Sets 101 Extended Abstract. Invit. Pap. AAAI 2000 Work. Imbalanced Data Sets (2000)

    Google Scholar 

  9. Chawla, N.V., Japkowicz, N., Drive, P.: Editorial: Special Issue on Learning from Imbalanced Data Sets Aleksander Ko l cz. ACM SIGKDD Explor 6, 2000–(2004)

    Article  Google Scholar 

  10. Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6, 20 (2004)

    Article  Google Scholar 

  11. Guo, H.: Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach. ACM SIGKDD Explor 6, 30–39 (2004)

    Article  Google Scholar 

  12. Li, D.-C., Wu, C.-S., Tsai, T.-I., Lina, Y.-S.: Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput. Oper. Res. 34, 966–982 (2007)

    Article  MATH  Google Scholar 

  13. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  14. Wang, J., Xu, M., Wang, H., Zhang, J.: Classification of Imbalanced Data by Using the SMOTE Algorithm and Locally Linear Embedding. In: 2006 8th Int. Conf. Signal Process, vol. 3, pp. 1–4 (2006)

    Google Scholar 

  15. Pawlak, Z.: Rough sets. Int. J. Comput. Inf. Sci. 11, 341–356 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  16. Pawlak, Z.: Rough Sets, Rough Relations and Rough Functions - Fundamenta Informaticae, vol. 27(2-3). IOS Press (1996), http://iospress.metapress.com/content/vr21hm11p17k3uh0/

  17. Nguyen, S.H., Nguyen, H.S.: Analysis of STULONG Data by Rough Set Exploration System ( RSES ). In: Proc. ECML/PKDD Work, pp. 71–82 (2003)

    Google Scholar 

  18. Bazan, J.G., Nguyen, H.S., Nguyen, S.H., Synak, P., Wróblewski, J.: Rough set algorithms in classification problem, pp. 49–88 (2000)

    Google Scholar 

  19. Wróblewski, J.: Genetic Algorithms in Decomposition and Classification Problems. Rough Sets Knowl. Discov. 2(19), 471–487 (1998)

    Article  Google Scholar 

  20. Grzymala-Busse, J.W.: A New Version of the Rule Induction System LERS. Informaticae 31, 27–39 (1997)

    MATH  Google Scholar 

  21. Bazan, J., Szczuka, M.S.: The Rough Set Exploration System. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 37–56. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  22. Dataset Source, http://www.sgi.com/tech/mlc/db/

  23. Holmes, G., Donkin, A., Witten, I.H.: WEKA: a machine learning workbench. In: Proceedings of ANZIIS 1994 - Australian New Zealnd Intelligent Information Systems Conference, pp. 357–361 (1994)

    Google Scholar 

  24. STANDARDIZE function, http://office.microsoft.com/en-001/excel-help/standardize-function-HP010342919.aspx

  25. He, F., Wang, X., Liu, B.: Attack Detection by Rough Set Theory in Recommendation System. In: 2010 IEEE International Conference on Granular Computing, pp. 692–695. IEEE (2010)

    Google Scholar 

  26. Bellazzi, R., Zupan, B.: Predictive data mining in clinical medicine: current issues and guidelines. Int. J. Med. Inform. 77, 81–97 (2008)

    Article  Google Scholar 

  27. Amin, A., Shehzad, S., Khan, C., Ali, I., Anwar, S.: Churn Prediction in Telecommunication Industry Using Rough Set Approach. In: Camacho, D., Kim, S.-W., Trawiński, B. (eds.) ICCCI 2014, pp. 83–95. Springer International Publishing, Switzerland (2015)

    Google Scholar 

  28. Amin, A., Khan, C., Ali, I., Anwar, S.: Customer Churn Prediction in Telecommunication Industry: With and without Counter-Example (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adnan Amin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Amin, A., Rahim, F., Ali, I., Khan, C., Anwar, S. (2015). A Comparison of Two Oversampling Techniques (SMOTE vs MTDF) for Handling Class Imbalance Problem: A Case Study of Customer Churn Prediction. In: Rocha, A., Correia, A., Costanzo, S., Reis, L. (eds) New Contributions in Information Systems and Technologies. Advances in Intelligent Systems and Computing, vol 353. Springer, Cham. https://doi.org/10.1007/978-3-319-16486-1_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-16486-1_22

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-16485-4

  • Online ISBN: 978-3-319-16486-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics