Skip to main content
Log in

Imbalanced classification in sparse and large behaviour datasets

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Recent years have witnessed a growing number of publications dealing with the imbalanced learning issue. While a plethora of techniques have been investigated on traditional low-dimensional data, little is known on the effect thereof on behaviour data. This kind of data reflects fine-grained behaviours of individuals or organisations and is characterized by sparseness and very large dimensions. In this article, we investigate the effects of several over-and undersampling, cost-sensitive learning and boosting techniques on the problem of learning from imbalanced behaviour data. Oversampling techniques show a good overall performance and do not seem to suffer from overfitting as traditional studies report. A variety of undersampling approaches are investigated as well and show the performance degrading effect of instances showing odd behaviour. Furthermore, the boosting process indicates that the regularization parameter in the SVM formulation acts as a weakness indicator and that a combination of weak learners can often achieve better generalization than a single strong learner. Finally, the EasyEnsemble technique is presented as the method outperforming all others. By randomly sampling several balanced subsets, feeding them to a boosting process and subsequently combining their hypotheses, a classifier is obtained that achieves noise/outlier reduction effects and simultaneously explores the majority class space efficiently. Furthermore, the method is very fast since it is parallelizable and each subset is only twice as large as the minority class size.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. In Sect. 5 we will investigate the effect of several base learners.

  2. The last requirement is not strictly necessary when we talk about behaviour data in our study (the sparseness and high-dimensionality properties are sufficient).

  3. In this sparse setting, instance removal or feature selection are in a certain sense equivalent to one another.

  4. Also called support values.

  5. In Sect. 5 we will consider different types of base learners.

  6. We provide traditional measures (sensitivity, specificity, G-means, F-measure) in our online repository:  http://www.applieddatamining.com/cms/?q=software.

  7. For each majority class instance, we no longer need to sort the similarities with all minority instances in determining the K largest values.

  8. This subject is more commonly known as community detection in bipartite graphs.

  9. Modularity-based approaches attempt to optimize a quality function known as modularity for finding community structures in networks and rely on the use of heuristics due to the complexity of the problem.

  10. Note that they also made use of the flow-based algorithm Infomap (Rosvall and Bergstrom 2008) that shows excellent results on the LFR-benchmark.

  11. The distinction between weak/strong learners is loosely ‘defined’ in Schapire (1999). A weak learner corresponds with a hypothesis that performs just slightly better than random guessing. A strong learner is able to generate a hypothesis with an arbitrary low error rate, given enough data. We adopt these definitions but consider the distinction between weak/strong based on training set error. In a SVM context, it is quite typical that error levels on training data drop with increasing C-values (Suykens et al. 2002). A learner that is ’too strong’ means that, even though its performance on training data is very high, it fails to generalize well and the test set error increases due to overfitting.

  12. Flickr and KDD will be excluded in the comparative study of Sect. 4.6. This is because some methods are too computationally intensive - especially in combination with the large number of possible parameter combinations—to be applied on these very large data sources. Furthermore, our statistical evidence is already sufficiently strong to conclude significance without these datasets. Having said this, these data sources will be included in the analysis of Sect. 5.

  13. MovieLens 1M dataset from http://grouplens.org/datasets/movielens/.

  14. MovieLens 10M dataset from http://grouplens.org/datasets/movielens/.

  15. https://webscope.sandbox.yahoo.com/.

  16. http://www.bigdatalab.ac.cn/benchmark/bm/dd?data=Ta-Feng.

  17. Active features represent features that are present for at least one instance in the dataset. A non-active feature corresponds with a column of zeros in the matrix representation and would not contribute to the model.

  18. This means that if we start from a balanced set, only the training data will show artificial imbalance according to the imbalance ratio p. The validation and test data would remain balanced. Since AUC (and some other metrics) is independent of class skew, it would be unwise to make these sets imbalanced as well because that would lead to discarding minority class instances that are relevant for performance assessment.

  19. We sometimes make use of T as a symbol to indicate the boosting iteration number instead of t.

  20. The SVM regularization parameter C controls for overfitting. Additionally, the influence of majority class noise/outliers (see Sect. 3.2) on the learned hyperplane decreases as we oversample the minority class. This hyperplane is more sensitive towards minority instances and a change in its direction/orientation towards these majority class noise/outliers will be more heavily penalized.

  21. if \(\alpha _i=0\), then \(y_i(w^T x_i+b) \ge 1 \). For noise/outliers, the term \(y_i(w^T x_i+b)\) is negative, hence \(\alpha _i \ne 0\).

  22. A tie occurs in the situation where the absolute difference in AUC is smaller or equal to 0.5.

  23. The BL technique trains single SVMs on the imbalanced training data.

  24. The larger the amount of comparisons, the higher the proportion of null-hypothesis that are wrongfully rejected due to random chance.

  25. \(\alpha ^{comp}\) adjusts the value of \(\alpha \) to compensate for multiple comparisons.

  26. This is not entirely true, since Holm’s method requires Far_Knn to pass the test first before proceeding, which only occurs at the 0.088662 significance level.

  27. For an accessible introduction, see the chapter on “Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression” provided at the following link: http://www.cs.cmu.edu/~tom/NewChapters.html.

  28. This statement only relates to the EE-version. In the baseline case, the timings of the NB-version are roughly of the same order of magnitude than SVM and LR (though this heavily depends on the chosen regularization parameter C). With EE, each subset has a size limited to twice the minority class training size (a relatively low number compared to the size of the majority class). This works in favour of SVM/LR (e.g. SVM has a computational complexity of O(\(m^2\)), with m the number of training instances). Also note that the NB-version is indeed optimized for sparse data. Internally, the NB-implementation writes a fairly large model file to a specified directory (sizes of 1 GB can be encountered there). This writing operation slows down the training stage (especially in case of EE, where \(S \times T = 15 \times 20 = 300\) model files are constructed).

References

  • Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Machine learning: ECML 2004: 15th European conference on machine learning, Pisa, Italy, September 20–24, 2004. Proceedings, Springer, Berlin, pp 39–50. doi:10.1007/978-3-540-30115-8_7

  • Ali A, Shamsuddin SM, Ralescu AL (2015) Classification with class imbalance problem: a review. Int J Adv Soft Comput Appl 7(3):176–204

    Google Scholar 

  • Alzahrani T, Horadam KJ (2016) Community detection in bipartite networks: algorithms and case studies. In: Complex systems and networks: dynamics, controls and applications. Springer, Berlin, pp 25–50. doi:10.1007/978-3-662-47824-0_2

  • Bachner J (2013) Predictive policing: preventing crime with data and analytics. IBM Center for the Business of Government

  • Baesens B, Van Gestel T, Viaene S, Stepanova M, Suykens J, Vanthienen J (2003) Benchmarking state-of-the-art classification algorithms for credit scoring. J Oper Res Soc 54(6):627–635. doi:10.1057/palgrave.jors.2601545

    Article  MATH  Google Scholar 

  • Barandela R, Snchez J, Garca V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36(3):849–851. doi:10.1016/S0031-3203(02)00257-1

    Article  Google Scholar 

  • Barber MJ (2007) Modularity and community detection in bipartite networks. Phys Rev E 76(066):102. doi:10.1103/PhysRevE.76.066102

    MathSciNet  Google Scholar 

  • Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425. doi:10.1109/TKDE.2012.232

    Article  Google Scholar 

  • Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29. doi:10.1145/1007730.1007735

    Article  Google Scholar 

  • Beckett SJ (2016) Improved community detection in weighted bipartite networks. R Soc Open Sci 3(1). doi:10.1098/rsos.140536

  • Bekkar M, Djemaa HK, Alitouche TA (2013) Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl 3(10):27–38

    Google Scholar 

  • Bhattacharyya S, Jha S, Tharakunnel K, Westland JC (2011) Data mining for credit card fraud: a comparative study. Decis Support Syst 50(3):602–613. doi:10.1016/j.dss.2010.08.008

    Article  Google Scholar 

  • Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 10:P10,008

    Article  Google Scholar 

  • Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classif Regres Trees. Taylor & Francis, London

    Google Scholar 

  • Brozovsky L, Petricek V (2007) Recommender system for online dating service. In: Proceedings of Znalosti 2007 Conference, VSB, Ostrava

  • Cha M, Mislove A, Gummadi KP (2009) A measurement-driven analysis of information propagation in the Flickr social network. In: Proceedings of the 18th international conference on world wide web, ACM, New York. WWW ’09, pp 721–730. doi:10.1145/1526709.1526806

  • Chawla NV (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Boston, pp 853–867

    Chapter  Google Scholar 

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  • Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: Lavrač N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD. Springer, Berlin, pp 107–119

    Google Scholar 

  • Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor Newsl 6(1):1–6. doi:10.1145/1007730.1007733

    Article  Google Scholar 

  • Chen M, Mao S, Liu Y (2014) Big data: a survey. Mob Netw Appl 19(2):171–209. doi:10.1007/s11036-013-0489-0

    Article  Google Scholar 

  • Chyi YM (2003) Classification analysis techniques for skewed class distribution problems. Master thesis, Department of Information management, National Sun Yat-Sen University

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30

  • Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the Seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, KDD ’01, pp 269–274. doi:10.1145/502512.502550

  • Drummond C, Holte RC (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the ICML ’03 Workshop on learning from imbalanced datasets

  • Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  • Fan W, Stolfo SJ, Zhang J, Chan PK (1999) AdaCost: Misclassification cost-sensitive boosting. In: Proceedings of the sixteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, ICML ’99, pp 97–105

  • Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874. doi:10.1016/j.patrec.2005.10.010

    Article  MathSciNet  Google Scholar 

  • Finch H (2005) Comparison of distance measures in cluster analysis with dichotomous data. J Data Sci 3(1):85–100

    Google Scholar 

  • Fortunato S (2010) Community detection in graphs. Phys Rep 486(35):75–174. doi:10.1016/j.physrep.2009.11.002

    Article  MathSciNet  Google Scholar 

  • Frasca M, Bertoni A, Re M, Valentini G (2013) A neural network algorithm for semi-supervised node label learning from unbalanced data. Neural Netw 43:84–98. doi:10.1016/j.neunet.2013.01.021

    Article  MATH  Google Scholar 

  • Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701

    Article  MATH  Google Scholar 

  • García E, Lozano F (2007) Boosting support vector machines. In: Machine learning and data mining in pattern recognition, 5th international conference, MLDM 2007, Leipzig, Germany, July 18–20, Post Proceedings, IBaI Publishing, pp 153–167

  • Goldstein M, Uchida S (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4):1–31. doi:10.1371/journal.pone.0152173

    Google Scholar 

  • Gonzlez PC, Velsquez JD (2013) Characterization and detection of taxpayers with false invoices using data mining techniques. Exp Syst Appl 40(5):1427–1436. doi:10.1016/j.eswa.2012.08.051

    Article  Google Scholar 

  • Guimerà R, Sales-Pardo M, Amaral LAN (2007) Module identification in bipartite and directed networks. Phys Rev E 76(036):102. doi:10.1103/PhysRevE.76.036102

    Google Scholar 

  • Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explor Newsl 6(1):30–39. doi:10.1145/1007730.1007736

    Article  Google Scholar 

  • Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 2008 fourth international conference on natural computation, IEEE, vol 4, pp 192–201. doi:10.1109/ICNC.2008.871

  • Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D, Zhang X-P, Huang G-B (eds) Advances in intelligent computing. Springer, Berlin, pp 878–887

    Chapter  Google Scholar 

  • He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284. doi:10.1109/TKDE.2008.239

    Article  Google Scholar 

  • He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), IEEE, pp 1322–1328. doi:10.1109/IJCNN.2008.4633969

  • Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70

    MathSciNet  MATH  Google Scholar 

  • Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425. doi:10.1109/72.991427

    Article  Google Scholar 

  • Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008). Christchurch, New Zealand, pp 49–56

  • Iman RL, Davenport JM (1980) Approximations of the critical region of the Friedman statistic. Commun Stat Theory Methods 9(6):571–595

    Article  MATH  Google Scholar 

  • Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explor Newsl 6(1):40–49. doi:10.1145/1007730.1007737

    Article  Google Scholar 

  • Junqué de Fortuny E, Martens D, Provost F (2014a) Predictive modeling with big data: is bigger really better? Big Data 1(4):215–226. doi:10.1089/big.2013.0037

    Article  Google Scholar 

  • Junqué de Fortuny E, Stankova M, Moeyersoms J, Minnaert B, Provost F, Martens D (2014b) Corporate residence fraud detection. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, KDD ’14, pp 1650–1659. doi:10.1145/2623330.2623333

  • Jutla IS, Jeub LG, Mucha PJ (2011–2016) A generalized louvain method for community detection implemented in MATLAB. http://netwiki.amath.unc.edu/GenLouvain

  • Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 179–186

  • Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(056):117. doi:10.1103/PhysRevE.80.056117

    Google Scholar 

  • Larremore DB, Clauset A, Jacobs AZ (2014) Efficiently inferring community structure in bipartite networks. Phys Rev E Stat Nonlinear Soft Matter Phys 90(012):805. doi:10.1103/PhysRevE.90.012805

    Google Scholar 

  • Li J, Fine JP (2010) Weighted area under the receiver operating characteristic curve and its application to gene selection. J R Stat Soc Series C (Appl Stat) 59(4):673–692. doi:10.1111/j.1467-9876.2010.00713.x

    MathSciNet  Google Scholar 

  • Li X, Wang L, Sung E (2008) AdaBoost with SVM-based component classifiers. Eng Appl Artif Intell 21(5):785–795. doi:10.1016/j.engappai.2007.07.001

    Article  Google Scholar 

  • Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Liu W, Chawla S, Cieslak DA, Chawla NV (2010) A robust decision tree algorithm for imbalanced data sets. Proc Tenth SIAM Int Conf Data Mining SIAM Phila 10:766–777

    Google Scholar 

  • Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B (Cybern) 39(2):539–550. doi:10.1109/TSMCB.2008.2007853

    Article  Google Scholar 

  • Luts J, Ojeda F, Van de Plas R, De Moor B, Van Huffel S, Suykens JA (2010) A tutorial on support vector machine-based methods for classification problems in chemometrics. Anal Chim Acta 665(2):129–145. doi:10.1016/j.aca.2010.03.030

    Article  Google Scholar 

  • Macskassy SA, Provost F (2007) Classification in networked data: a toolkit and a univariate case study. J Mach Learn Res 8(May):935–983

    Google Scholar 

  • Martens D, Provost F (2014) Explaining data-driven document classifications. MIS Q 38(1):73–100

    Article  Google Scholar 

  • Martens D, Provost F, Clark J, Junqué de Fortuny E (2016) Mining massive fine-grained behavior data to improve predictive analytics. MIS Q 40(4):869–888

    Article  Google Scholar 

  • Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(23):427–436. doi:10.1016/j.neunet.2007.12.031

    Article  Google Scholar 

  • Mease D, Wyner AJ, Buja A (2007) Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 8:409–439

    MATH  Google Scholar 

  • Nemenyi P (1963) Distribution-free multiple comparisons. Dissertation, Princeton University

  • Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(026):113. doi:10.1103/PhysRevE.69.026113

    Google Scholar 

  • Ng AY (2004) Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, USA, ICML ’04, p 78. doi:10.1145/1015330.1015435

  • Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14. MIT Press, pp 841–848

  • Ngai E, Hu Y, Wong Y, Chen Y, Sun X (2011) The application of data mining techniques in financial fraud detection: a classification framework and an academic review of literature. Decis Support Syst 50(3):559–569. doi:10.1016/j.dss.2010.08.006

    Article  Google Scholar 

  • Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola AJ, Bartlett P, Schoelkopf B, Schuurmans D (eds) Advances in large-margin classifiers. MIT Press, pp 61–74

  • Porter MA, Onnela JP, Mucha PJ (2009) Communities in networks. Not Am Math Soc 56(9):1082–1097

    MathSciNet  MATH  Google Scholar 

  • Provost F, Fawcett T (2013) Data science for business: what you need to know about data mining and data-analytic thinking. O’Reilly Media, Inc

  • Provost F, Dalessandro B, Hook R, Zhang X, Murray A (2009) Audience selection for on-line brand advertising: privacy-friendly social network targeting. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’09, pp 707–716. doi:10.1145/1557019.1557098

  • Raskutti B, Kowalczyk A (2004) Extreme re-balancing for SVMs: a case study. SIGKDD Explor Newsl 6(1):60–69. doi:10.1145/1007730.1007739

    Article  Google Scholar 

  • Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci 105(4):1118–1123. doi:10.1073/pnas.0706851105

    Article  Google Scholar 

  • Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the 16th international joint conference on artificial intelligence—volume 2. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, IJCAI’99, pp 1401–1406

  • Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336. doi:10.1023/A:1007614523901

    Article  MATH  Google Scholar 

  • Shmueli G (2017) Analyzing behavioral big data: methodological, practical, ethical, and moral issues. Qual Eng 29(1):57–74. doi:10.1080/08982112.2016.1210979

    Google Scholar 

  • Sobhani P, Viktor H, Matwin S (2015) Learning from imbalanced data using ensemble methods and cluster-based undersampling. In: New frontiers in mining complex patterns: third international workshop, NFMCP 2014, Held in Conjunction with ECML-PKDD 2014, Nancy, France, September 19, 2014, Revised Selected Papers, Springer International Publishing, Cham, pp 69–83. doi:10.1007/978-3-319-17876-9_5

  • Stankova M (2016) Classification within network data with a bipartite structure. Dissertation, University of Antwerp

  • Stankova M, Martens D, Provost F (2015) Classification over bipartite graphs through projection. Working papers 2015001, University of Antwerp, Faculty of Applied Economics

  • Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378. doi:10.1016/j.patcog.2007.04.009

    Article  MATH  Google Scholar 

  • Suykens JA, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J, Suykens J, Van Gestel T (2002) Least squares support vector machines. World Scientific, Singapore

    Book  MATH  Google Scholar 

  • Tang B, He H (2015) Enn: Extended nearest neighbor method for pattern recognition [research frontier]. IEEE Comput Intell Mag 10(3):52–60. doi:10.1109/MCI.2015.2437512

    Article  Google Scholar 

  • Tang Y, Zhang YQ, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern B (Cybern) 39(1):281–288. doi:10.1109/TSMCB.2008.2002909

    Article  Google Scholar 

  • Tobback E, Moeyersoms J, Stankova M, Martens D (2016) Bankruptcy prediction for SMEs using relational data. Working paper 2016004, University of Antwerp, Faculty of Applied Economics

  • Verbeke W, Dejaeger K, Martens D, Hur J, Baesens B (2012) New insights into churn prediction in the telecommunication sector: a profit driven data mining approach. Eur J Oper Res 218(1):211–229. doi:10.1016/j.ejor.2011.09.031

    Article  Google Scholar 

  • Veropoulos K, Campbell I, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on artificial intelligence, Stockholm, Sweden (IJCAI99), pp 55–60

  • Whitrow C, Hand DJ, Juszczak P, Weston D, Adams NM (2009) Transaction aggregation as a strategy for credit card fraud detection. Data Min Knowl Discov 18(1):30–55. doi:10.1007/s10618-008-0116-z

    Article  MathSciNet  Google Scholar 

  • Wickramaratna J, Holden SB, Buxton BF (2001) Performance degradation in boosting. In: Proceedings of the second international workshop on multiple classifier systems, Springer, London, UK, MCS ’01, pp 11–21

  • Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Exp Syst Appl 36(3, Part 1):5718–5727. doi:10.1016/j.eswa.2008.06.108

    Article  Google Scholar 

  • Yu HF, Lo HY, Hsieh HP, Lou JK, McKenzie TG, Chou JW, Chung PH, Ho CH, Chang CF, Wei YH, et al (2010) Feature engineering and classifier ensemble for kdd cup 2010. In: Proceedings of the KDD Cup 2010 Workshop, pp 1–16

  • Zha H, He X, Ding C, Simon H, Gu M (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the tenth international conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’01, pp 25–32. doi:10.1145/502585.502591

  • Zhang J, Mani I (2003) Knn approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of the ICML’2003 workshop on learning from imbalanced datasets, Washington DC

  • Ziegler CN, McNee SM, Konstan JA, Lausen G (2005) Improving recommendation lists through topic diversification. In: Proceedings of the 14th international conference on world wide web, ACM, New York, NY, USA, WWW ’05, pp 22–32. doi:10.1145/1060745.1060754

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jellis Vanhoeyveld.

Additional information

Responsible editor: Hendrik Blockeel.

Appendices

Appendix A: Oversampling experiments

See Table 11.

Table 11 Oversampling experiments

Appendix B: Undersampling experiments

See Table 12.

Table 12 Undersampling experiments

Appendix C: Boosting experiments

This section presents the results of each of the boosting experiments from Sect. 4.5 on the remaining data sources. Each of the figures below depicts the average tenfold AUC-performance on test data (with \(\mu ~=~100\)) with respect to the number of boosting iterations for (left) Adaboost (AB), AdaCost (AC) and EasyEnsemble (EE) with C chosen according to highest validation set AUC-performance (over all possible boosting rounds) and (right) AB and EE (with \(S=15\)) with varying C-levels (Figs. 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17).

Fig. 6
figure 6

Mov_G(\(p=1\)) dataset

Fig. 7
figure 7

Mov_Th(\(p=[]\)) dataset

Fig. 8
figure 8

Yahoo_A(\(p=1\)) dataset

Fig. 9
figure 9

Yahoo_A(\(p=25\)) dataset

Fig. 10
figure 10

Yahoo_G(\(p=1\)) dataset

Fig. 11
figure 11

Yahoo_G(\(p=25\)) dataset

Fig. 12
figure 12

TaFeng(\(p=1\)) dataset

Fig. 13
figure 13

Book(\(p=1\)) dataset

Fig. 14
figure 14

LST(\(p=1\)) dataset

Fig. 15
figure 15

Adver(\(p=[]\)) dataset

Fig. 16
figure 16

Adver(\(p=1\)) dataset

Fig. 17
figure 17

CRF(\(p=[]\)) dataset

Appendix D: Final comparison

See Fig. 18.

Fig. 18
figure 18

Average rank AUC versus average rank Time (see Table 9) across the large datasets from Table 2 (CRF and Bank). Regarding the former, an extra row EE_par is added having the same AUC-value as EE(\(S=15\)). Ranks for each dataset are subsequently obtained via the procedure outlined in Sect. 4.6.1. Points occurring in the upper-right region are preferred

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vanhoeyveld, J., Martens, D. Imbalanced classification in sparse and large behaviour datasets. Data Min Knowl Disc 32, 25–82 (2018). https://doi.org/10.1007/s10618-017-0517-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-017-0517-y

Keywords

Navigation