Imbalanced classification in sparse and large behaviour datasets

Vanhoeyveld, Jellis; Martens, David

doi:10.1007/s10618-017-0517-y

Imbalanced classification in sparse and large behaviour datasets

Published: 22 June 2017

Volume 32, pages 25–82, (2018)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

2168 Accesses
30 Citations
2 Altmetric
Explore all metrics

Abstract

Recent years have witnessed a growing number of publications dealing with the imbalanced learning issue. While a plethora of techniques have been investigated on traditional low-dimensional data, little is known on the effect thereof on behaviour data. This kind of data reflects fine-grained behaviours of individuals or organisations and is characterized by sparseness and very large dimensions. In this article, we investigate the effects of several over-and undersampling, cost-sensitive learning and boosting techniques on the problem of learning from imbalanced behaviour data. Oversampling techniques show a good overall performance and do not seem to suffer from overfitting as traditional studies report. A variety of undersampling approaches are investigated as well and show the performance degrading effect of instances showing odd behaviour. Furthermore, the boosting process indicates that the regularization parameter in the SVM formulation acts as a weakness indicator and that a combination of weak learners can often achieve better generalization than a single strong learner. Finally, the EasyEnsemble technique is presented as the method outperforming all others. By randomly sampling several balanced subsets, feeding them to a boosting process and subsequently combining their hypotheses, a classifier is obtained that achieves noise/outlier reduction effects and simultaneously explores the majority class space efficiently. Furthermore, the method is very fast since it is parallelizable and each subset is only twice as large as the minority class size.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Severely imbalanced Big Data challenges: investigating data sampling approaches

Article Open access 30 November 2019

Tawfiq Hasanin, Taghi M. Khoshgoftaar, … Richard A. Bauder

Learning from Imbalanced Data: A Comparative Study

Resampling strategies for imbalanced regression: a survey and empirical analysis

Article Open access 04 March 2024

Juscimara G. Avelino, George D. C. Cavalcanti & Rafael M. O. Cruz

Notes

In Sect. 5 we will investigate the effect of several base learners.
The last requirement is not strictly necessary when we talk about behaviour data in our study (the sparseness and high-dimensionality properties are sufficient).
In this sparse setting, instance removal or feature selection are in a certain sense equivalent to one another.
Also called support values.
In Sect. 5 we will consider different types of base learners.
We provide traditional measures (sensitivity, specificity, G-means, F-measure) in our online repository: http://www.applieddatamining.com/cms/?q=software.
For each majority class instance, we no longer need to sort the similarities with all minority instances in determining the K largest values.
This subject is more commonly known as community detection in bipartite graphs.
Modularity-based approaches attempt to optimize a quality function known as modularity for finding community structures in networks and rely on the use of heuristics due to the complexity of the problem.
Note that they also made use of the flow-based algorithm Infomap (Rosvall and Bergstrom 2008) that shows excellent results on the LFR-benchmark.
The distinction between weak/strong learners is loosely ‘defined’ in Schapire (1999). A weak learner corresponds with a hypothesis that performs just slightly better than random guessing. A strong learner is able to generate a hypothesis with an arbitrary low error rate, given enough data. We adopt these definitions but consider the distinction between weak/strong based on training set error. In a SVM context, it is quite typical that error levels on training data drop with increasing C-values (Suykens et al. 2002). A learner that is ’too strong’ means that, even though its performance on training data is very high, it fails to generalize well and the test set error increases due to overfitting.
Flickr and KDD will be excluded in the comparative study of Sect. 4.6. This is because some methods are too computationally intensive - especially in combination with the large number of possible parameter combinations—to be applied on these very large data sources. Furthermore, our statistical evidence is already sufficiently strong to conclude significance without these datasets. Having said this, these data sources will be included in the analysis of Sect. 5.
MovieLens 1M dataset from http://grouplens.org/datasets/movielens/.
MovieLens 10M dataset from http://grouplens.org/datasets/movielens/.
https://webscope.sandbox.yahoo.com/.
http://www.bigdatalab.ac.cn/benchmark/bm/dd?data=Ta-Feng.
Active features represent features that are present for at least one instance in the dataset. A non-active feature corresponds with a column of zeros in the matrix representation and would not contribute to the model.
This means that if we start from a balanced set, only the training data will show artificial imbalance according to the imbalance ratio p. The validation and test data would remain balanced. Since AUC (and some other metrics) is independent of class skew, it would be unwise to make these sets imbalanced as well because that would lead to discarding minority class instances that are relevant for performance assessment.
We sometimes make use of T as a symbol to indicate the boosting iteration number instead of t.
The SVM regularization parameter C controls for overfitting. Additionally, the influence of majority class noise/outliers (see Sect. 3.2) on the learned hyperplane decreases as we oversample the minority class. This hyperplane is more sensitive towards minority instances and a change in its direction/orientation towards these majority class noise/outliers will be more heavily penalized.
if \(\alpha _i=0\), then \(y_i(w^T x_i+b) \ge 1 \). For noise/outliers, the term \(y_i(w^T x_i+b)\) is negative, hence \(\alpha _i \ne 0\).
A tie occurs in the situation where the absolute difference in AUC is smaller or equal to 0.5.
The BL technique trains single SVMs on the imbalanced training data.
The larger the amount of comparisons, the higher the proportion of null-hypothesis that are wrongfully rejected due to random chance.
\(\alpha ^{comp}\) adjusts the value of \(\alpha \) to compensate for multiple comparisons.
This is not entirely true, since Holm’s method requires Far_Knn to pass the test first before proceeding, which only occurs at the 0.088662 significance level.
For an accessible introduction, see the chapter on “Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression” provided at the following link: http://www.cs.cmu.edu/~tom/NewChapters.html.
This statement only relates to the EE-version. In the baseline case, the timings of the NB-version are roughly of the same order of magnitude than SVM and LR (though this heavily depends on the chosen regularization parameter C). With EE, each subset has a size limited to twice the minority class training size (a relatively low number compared to the size of the majority class). This works in favour of SVM/LR (e.g. SVM has a computational complexity of O(\(m^2\)), with m the number of training instances). Also note that the NB-version is indeed optimized for sparse data. Internally, the NB-implementation writes a fairly large model file to a specified directory (sizes of 1 GB can be encountered there). This writing operation slows down the training stage (especially in case of EE, where \(S \times T = 15 \times 20 = 300\) model files are constructed).

References

Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Machine learning: ECML 2004: 15th European conference on machine learning, Pisa, Italy, September 20–24, 2004. Proceedings, Springer, Berlin, pp 39–50. doi:10.1007/978-3-540-30115-8_7
Ali A, Shamsuddin SM, Ralescu AL (2015) Classification with class imbalance problem: a review. Int J Adv Soft Comput Appl 7(3):176–204
Google Scholar
Alzahrani T, Horadam KJ (2016) Community detection in bipartite networks: algorithms and case studies. In: Complex systems and networks: dynamics, controls and applications. Springer, Berlin, pp 25–50. doi:10.1007/978-3-662-47824-0_2
Bachner J (2013) Predictive policing: preventing crime with data and analytics. IBM Center for the Business of Government
Baesens B, Van Gestel T, Viaene S, Stepanova M, Suykens J, Vanthienen J (2003) Benchmarking state-of-the-art classification algorithms for credit scoring. J Oper Res Soc 54(6):627–635. doi:10.1057/palgrave.jors.2601545
Article MATH Google Scholar
Barandela R, Snchez J, Garca V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36(3):849–851. doi:10.1016/S0031-3203(02)00257-1
Article Google Scholar
Barber MJ (2007) Modularity and community detection in bipartite networks. Phys Rev E 76(066):102. doi:10.1103/PhysRevE.76.066102
MathSciNet Google Scholar
Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425. doi:10.1109/TKDE.2012.232
Article Google Scholar
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29. doi:10.1145/1007730.1007735
Article Google Scholar
Beckett SJ (2016) Improved community detection in weighted bipartite networks. R Soc Open Sci 3(1). doi:10.1098/rsos.140536
Bekkar M, Djemaa HK, Alitouche TA (2013) Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl 3(10):27–38
Google Scholar
Bhattacharyya S, Jha S, Tharakunnel K, Westland JC (2011) Data mining for credit card fraud: a comparative study. Decis Support Syst 50(3):602–613. doi:10.1016/j.dss.2010.08.008
Article Google Scholar
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 10:P10,008
Article Google Scholar
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classif Regres Trees. Taylor & Francis, London
Google Scholar
Brozovsky L, Petricek V (2007) Recommender system for online dating service. In: Proceedings of Znalosti 2007 Conference, VSB, Ostrava
Cha M, Mislove A, Gummadi KP (2009) A measurement-driven analysis of information propagation in the Flickr social network. In: Proceedings of the 18th international conference on world wide web, ACM, New York. WWW ’09, pp 721–730. doi:10.1145/1526709.1526806
Chawla NV (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Boston, pp 853–867
Chapter Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: Lavrač N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD. Springer, Berlin, pp 107–119
Google Scholar
Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor Newsl 6(1):1–6. doi:10.1145/1007730.1007733
Article Google Scholar
Chen M, Mao S, Liu Y (2014) Big data: a survey. Mob Netw Appl 19(2):171–209. doi:10.1007/s11036-013-0489-0
Article Google Scholar
Chyi YM (2003) Classification analysis techniques for skewed class distribution problems. Master thesis, Department of Information management, National Sun Yat-Sen University
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the Seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, KDD ’01, pp 269–274. doi:10.1145/502512.502550
Drummond C, Holte RC (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the ICML ’03 Workshop on learning from imbalanced datasets
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Fan W, Stolfo SJ, Zhang J, Chan PK (1999) AdaCost: Misclassification cost-sensitive boosting. In: Proceedings of the sixteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, ICML ’99, pp 97–105
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874. doi:10.1016/j.patrec.2005.10.010
Article MathSciNet Google Scholar
Finch H (2005) Comparison of distance measures in cluster analysis with dichotomous data. J Data Sci 3(1):85–100
Google Scholar
Fortunato S (2010) Community detection in graphs. Phys Rep 486(35):75–174. doi:10.1016/j.physrep.2009.11.002
Article MathSciNet Google Scholar
Frasca M, Bertoni A, Re M, Valentini G (2013) A neural network algorithm for semi-supervised node label learning from unbalanced data. Neural Netw 43:84–98. doi:10.1016/j.neunet.2013.01.021
Article MATH Google Scholar
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Article MATH Google Scholar
García E, Lozano F (2007) Boosting support vector machines. In: Machine learning and data mining in pattern recognition, 5th international conference, MLDM 2007, Leipzig, Germany, July 18–20, Post Proceedings, IBaI Publishing, pp 153–167
Goldstein M, Uchida S (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4):1–31. doi:10.1371/journal.pone.0152173
Google Scholar
Gonzlez PC, Velsquez JD (2013) Characterization and detection of taxpayers with false invoices using data mining techniques. Exp Syst Appl 40(5):1427–1436. doi:10.1016/j.eswa.2012.08.051
Article Google Scholar
Guimerà R, Sales-Pardo M, Amaral LAN (2007) Module identification in bipartite and directed networks. Phys Rev E 76(036):102. doi:10.1103/PhysRevE.76.036102
Google Scholar
Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explor Newsl 6(1):30–39. doi:10.1145/1007730.1007736
Article Google Scholar
Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 2008 fourth international conference on natural computation, IEEE, vol 4, pp 192–201. doi:10.1109/ICNC.2008.871
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D, Zhang X-P, Huang G-B (eds) Advances in intelligent computing. Springer, Berlin, pp 878–887
Chapter Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284. doi:10.1109/TKDE.2008.239
Article Google Scholar
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), IEEE, pp 1322–1328. doi:10.1109/IJCNN.2008.4633969
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70
MathSciNet MATH Google Scholar
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425. doi:10.1109/72.991427
Article Google Scholar
Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008). Christchurch, New Zealand, pp 49–56
Iman RL, Davenport JM (1980) Approximations of the critical region of the Friedman statistic. Commun Stat Theory Methods 9(6):571–595
Article MATH Google Scholar
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explor Newsl 6(1):40–49. doi:10.1145/1007730.1007737
Article Google Scholar
Junqué de Fortuny E, Martens D, Provost F (2014a) Predictive modeling with big data: is bigger really better? Big Data 1(4):215–226. doi:10.1089/big.2013.0037
Article Google Scholar
Junqué de Fortuny E, Stankova M, Moeyersoms J, Minnaert B, Provost F, Martens D (2014b) Corporate residence fraud detection. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, KDD ’14, pp 1650–1659. doi:10.1145/2623330.2623333
Jutla IS, Jeub LG, Mucha PJ (2011–2016) A generalized louvain method for community detection implemented in MATLAB. http://netwiki.amath.unc.edu/GenLouvain
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 179–186
Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(056):117. doi:10.1103/PhysRevE.80.056117
Google Scholar
Larremore DB, Clauset A, Jacobs AZ (2014) Efficiently inferring community structure in bipartite networks. Phys Rev E Stat Nonlinear Soft Matter Phys 90(012):805. doi:10.1103/PhysRevE.90.012805
Google Scholar
Li J, Fine JP (2010) Weighted area under the receiver operating characteristic curve and its application to gene selection. J R Stat Soc Series C (Appl Stat) 59(4):673–692. doi:10.1111/j.1467-9876.2010.00713.x
MathSciNet Google Scholar
Li X, Wang L, Sung E (2008) AdaBoost with SVM-based component classifiers. Eng Appl Artif Intell 21(5):785–795. doi:10.1016/j.engappai.2007.07.001
Article Google Scholar
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Liu W, Chawla S, Cieslak DA, Chawla NV (2010) A robust decision tree algorithm for imbalanced data sets. Proc Tenth SIAM Int Conf Data Mining SIAM Phila 10:766–777
Google Scholar
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B (Cybern) 39(2):539–550. doi:10.1109/TSMCB.2008.2007853
Article Google Scholar
Luts J, Ojeda F, Van de Plas R, De Moor B, Van Huffel S, Suykens JA (2010) A tutorial on support vector machine-based methods for classification problems in chemometrics. Anal Chim Acta 665(2):129–145. doi:10.1016/j.aca.2010.03.030
Article Google Scholar
Macskassy SA, Provost F (2007) Classification in networked data: a toolkit and a univariate case study. J Mach Learn Res 8(May):935–983
Google Scholar
Martens D, Provost F (2014) Explaining data-driven document classifications. MIS Q 38(1):73–100
Article Google Scholar
Martens D, Provost F, Clark J, Junqué de Fortuny E (2016) Mining massive fine-grained behavior data to improve predictive analytics. MIS Q 40(4):869–888
Article Google Scholar
Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(23):427–436. doi:10.1016/j.neunet.2007.12.031
Article Google Scholar
Mease D, Wyner AJ, Buja A (2007) Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 8:409–439
MATH Google Scholar
Nemenyi P (1963) Distribution-free multiple comparisons. Dissertation, Princeton University
Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(026):113. doi:10.1103/PhysRevE.69.026113
Google Scholar
Ng AY (2004) Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, USA, ICML ’04, p 78. doi:10.1145/1015330.1015435
Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14. MIT Press, pp 841–848
Ngai E, Hu Y, Wong Y, Chen Y, Sun X (2011) The application of data mining techniques in financial fraud detection: a classification framework and an academic review of literature. Decis Support Syst 50(3):559–569. doi:10.1016/j.dss.2010.08.006
Article Google Scholar
Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola AJ, Bartlett P, Schoelkopf B, Schuurmans D (eds) Advances in large-margin classifiers. MIT Press, pp 61–74
Porter MA, Onnela JP, Mucha PJ (2009) Communities in networks. Not Am Math Soc 56(9):1082–1097
MathSciNet MATH Google Scholar
Provost F, Fawcett T (2013) Data science for business: what you need to know about data mining and data-analytic thinking. O’Reilly Media, Inc
Provost F, Dalessandro B, Hook R, Zhang X, Murray A (2009) Audience selection for on-line brand advertising: privacy-friendly social network targeting. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’09, pp 707–716. doi:10.1145/1557019.1557098
Raskutti B, Kowalczyk A (2004) Extreme re-balancing for SVMs: a case study. SIGKDD Explor Newsl 6(1):60–69. doi:10.1145/1007730.1007739
Article Google Scholar
Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci 105(4):1118–1123. doi:10.1073/pnas.0706851105
Article Google Scholar
Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the 16th international joint conference on artificial intelligence—volume 2. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, IJCAI’99, pp 1401–1406
Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336. doi:10.1023/A:1007614523901
Article MATH Google Scholar
Shmueli G (2017) Analyzing behavioral big data: methodological, practical, ethical, and moral issues. Qual Eng 29(1):57–74. doi:10.1080/08982112.2016.1210979
Google Scholar
Sobhani P, Viktor H, Matwin S (2015) Learning from imbalanced data using ensemble methods and cluster-based undersampling. In: New frontiers in mining complex patterns: third international workshop, NFMCP 2014, Held in Conjunction with ECML-PKDD 2014, Nancy, France, September 19, 2014, Revised Selected Papers, Springer International Publishing, Cham, pp 69–83. doi:10.1007/978-3-319-17876-9_5
Stankova M (2016) Classification within network data with a bipartite structure. Dissertation, University of Antwerp
Stankova M, Martens D, Provost F (2015) Classification over bipartite graphs through projection. Working papers 2015001, University of Antwerp, Faculty of Applied Economics
Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378. doi:10.1016/j.patcog.2007.04.009
Article MATH Google Scholar
Suykens JA, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J, Suykens J, Van Gestel T (2002) Least squares support vector machines. World Scientific, Singapore
Book MATH Google Scholar
Tang B, He H (2015) Enn: Extended nearest neighbor method for pattern recognition [research frontier]. IEEE Comput Intell Mag 10(3):52–60. doi:10.1109/MCI.2015.2437512
Article Google Scholar
Tang Y, Zhang YQ, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern B (Cybern) 39(1):281–288. doi:10.1109/TSMCB.2008.2002909
Article Google Scholar
Tobback E, Moeyersoms J, Stankova M, Martens D (2016) Bankruptcy prediction for SMEs using relational data. Working paper 2016004, University of Antwerp, Faculty of Applied Economics
Verbeke W, Dejaeger K, Martens D, Hur J, Baesens B (2012) New insights into churn prediction in the telecommunication sector: a profit driven data mining approach. Eur J Oper Res 218(1):211–229. doi:10.1016/j.ejor.2011.09.031
Article Google Scholar
Veropoulos K, Campbell I, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on artificial intelligence, Stockholm, Sweden (IJCAI99), pp 55–60
Whitrow C, Hand DJ, Juszczak P, Weston D, Adams NM (2009) Transaction aggregation as a strategy for credit card fraud detection. Data Min Knowl Discov 18(1):30–55. doi:10.1007/s10618-008-0116-z
Article MathSciNet Google Scholar
Wickramaratna J, Holden SB, Buxton BF (2001) Performance degradation in boosting. In: Proceedings of the second international workshop on multiple classifier systems, Springer, London, UK, MCS ’01, pp 11–21
Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Exp Syst Appl 36(3, Part 1):5718–5727. doi:10.1016/j.eswa.2008.06.108
Article Google Scholar
Yu HF, Lo HY, Hsieh HP, Lou JK, McKenzie TG, Chou JW, Chung PH, Ho CH, Chang CF, Wei YH, et al (2010) Feature engineering and classifier ensemble for kdd cup 2010. In: Proceedings of the KDD Cup 2010 Workshop, pp 1–16
Zha H, He X, Ding C, Simon H, Gu M (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the tenth international conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’01, pp 25–32. doi:10.1145/502585.502591
Zhang J, Mani I (2003) Knn approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of the ICML’2003 workshop on learning from imbalanced datasets, Washington DC
Ziegler CN, McNee SM, Konstan JA, Lausen G (2005) Improving recommendation lists through topic diversification. In: Proceedings of the 14th international conference on world wide web, ACM, New York, NY, USA, WWW ’05, pp 22–32. doi:10.1145/1060745.1060754

Download references

Author information

Authors and Affiliations

Department of Engineering Management, Prinsstraat 13, 2000, Antwerp, Belgium
Jellis Vanhoeyveld & David Martens

Authors

Jellis Vanhoeyveld
View author publications
You can also search for this author in PubMed Google Scholar
David Martens
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jellis Vanhoeyveld.

Additional information

Responsible editor: Hendrik Blockeel.

Appendices

Appendix A: Oversampling experiments

See Table 11.

Table 11 Oversampling experiments

Full size table

Appendix B: Undersampling experiments

See Table 12.

Table 12 Undersampling experiments

Full size table

Appendix C: Boosting experiments

This section presents the results of each of the boosting experiments from Sect. 4.5 on the remaining data sources. Each of the figures below depicts the average tenfold AUC-performance on test data (with \(\mu ~=~100\)) with respect to the number of boosting iterations for (left) Adaboost (AB), AdaCost (AC) and EasyEnsemble (EE) with C chosen according to highest validation set AUC-performance (over all possible boosting rounds) and (right) AB and EE (with \(S=15\)) with varying C-levels (Figs. 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17).

Appendix D: Final comparison

See Fig. 18.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vanhoeyveld, J., Martens, D. Imbalanced classification in sparse and large behaviour datasets. Data Min Knowl Disc 32, 25–82 (2018). https://doi.org/10.1007/s10618-017-0517-y

Download citation

Received: 31 August 2016
Accepted: 05 June 2017
Published: 22 June 2017
Issue Date: January 2018
DOI: https://doi.org/10.1007/s10618-017-0517-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced classification in sparse and large behaviour datasets

Abstract

Access this article

Similar content being viewed by others

Severely imbalanced Big Data challenges: investigating data sampling approaches

Learning from Imbalanced Data: A Comparative Study

Resampling strategies for imbalanced regression: a survey and empirical analysis

Notes

References