Abstract
This paper presents the main results of our on-going work, one month before the deadline, on the 2009 UC San Diego data mining contest. The tasks of the contest are to rank the samples in two e-commerce transaction anomaly datasets according to the probability each sample has a positive label. The performance is evaluated by the lift at 20% on the probability of the two datasets. A main difficulty for the tasks is that the data is highly imbalanced, only about 2% of data are labeled as positive, for both tasks. We first preprocess the data on the categorical features and normalize all the features. Here, we present our initial results on several popular classifiers, including Support Vector Machines, Neural Networks, AdaBoosts, and Logistic Regression. The objective is to get benchmark results of these classifiers without much modification, so it will help us to select a classifier for future tuning. Further, based on these results, we observe that the area under the ROC curve (AUC) is a good indicator to improve the lift score, we then propose an ensemble method to combine the above classifiers aiming at optimizing the AUC score and obtain significant better results. We also discuss with some treatment on the imbalance data in the experiment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bell, R.M., Haffner, P.G., Volinsky, J.C.: Modifying boosted trees to improve performance on task 1 of the 2006 kdd challenge cup. ACM SIGKDD Explorations Newsletter 2, 47–52 (2006)
Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1996)
Bradley, A.: The use of the area under the ROC curve in the evaluation of machine learning algorithm. Pattern Recognition 30(7), 1145–1159 (1997)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Fawcett, T.: An introduction to roc analysis. Pattern Recognition Letters 27, 861–874 (2006)
Freund, Y., Schapire, R.E.: Game theory, on-line prediction and boosting. In: Proc. of the Ninth Annual Conference on Computational Learning Theory, pp. 325–332 (1996)
GarcÃa-Pedrajas, N., GarcÃa-Osorio, C., Fyfe, C.: Nonlinear boosting projections for ensemble construction. Journal of Machine Learning Research 8, 1–33 (2007)
Hosmer, D.W., Lemeshow, S.: Applied logistic regression, 2nd edn. Wiley-Interscience Publication, Hoboken (2000)
Huang, K., Yang, H., King, I., Lyu, M.R.: Imbalanced learning with biased minimax probability machine. IEEE Transactions on System, Man, and Cybernetics Part B 36, 913–923 (2006)
Huang, K., Yang, H., King, I., Lyu, M.R.: Maximizing sensitivity in medical diagnosis using biased minimax probability machine. IEEE Transactions on Biomedical Engineering 53, 821–831 (2006)
Huang, K., Yang, H., King, I., Lyu, M.R., Chan, L.: The minimum error minimax probability machine. Journal of Machine Learning Research 5, 1253–1286 (2004)
Joachims, T.: A support vector method for multivariate performance measures. In: ICML, pp. 377–384 (2005)
Juditsky, A., Rigollet, P., Tsybakov, A.B.: Learning by mirror averaging. Annals of Statistics 36, 2183–2206 (2008)
Kégl, B., Busa-Fekete, R.: Boosting products of base classifiers. In: ICML, p. 63 (2009)
Maloof, M.A., Langley, P., Binford, T.O., Nevatia, R., Sage, S.: Improved rooftop detection in aerial images with machine learning. Machine Learning 53, 157–191 (2003)
Nabney, I.T.: Netlab: Algorithms for Pattern Recognition. Springer, Heidelberg (2004)
Platt, J.C., Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (1999)
Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics 26, 1651–1686 (1998)
Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37(3), 297–336 (1999)
G.A.M. Toolbox, http://research.graphicon.ru/machine-learning/gml-adaboost-matlab-toolbox.html
Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York (1999)
Vezhnevets, A., Vezhnevets, V.: Modest adaboost – teaching adaboost to generalize better. Graphicon (2005)
Wahba, G.: Spline Models for Observational Data, volume 59. In: CBMS-NSF Regional Conference Series in Applied Mathematics, vol. 59. SIAM, Philadelphia (1990)
Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explorations 6(1), 7–19 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yang, H., King, I. (2009). Ensemble Learning for Imbalanced E-commerce Transaction Anomaly Classification. In: Leung, C.S., Lee, M., Chan, J.H. (eds) Neural Information Processing. ICONIP 2009. Lecture Notes in Computer Science, vol 5863. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10677-4_98
Download citation
DOI: https://doi.org/10.1007/978-3-642-10677-4_98
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10676-7
Online ISBN: 978-3-642-10677-4
eBook Packages: Computer ScienceComputer Science (R0)