ABSTRACT
Label propagation is a well-explored family of methods for training a semi-supervised classifier where input data points (both labeled and unlabeled) are connected in the form of a weighted graph. For binary classification, the performance of these methods starts degrading considerably whenever input dataset exhibits following characteristics - (i) one of the class label is rare label or equivalently, class imbalance (CI) is very high, and (ii) degree of supervision (DoS) is very low -- defined as fraction of labeled points. These characteristics are common in many real-world datasets relating to network fraud detection. Moreover, in such applications, the amount of class imbalance is not known a priori. In this paper, we have proposed and justified the use of an alternative formulation for graph label propagation under such extreme behavior of the datasets. In our formulation, objective function is the difference of two convex quadratic functions and the constraints are box constraints. We solve this program using Concave-Convex Procedure (CCCP). Whenever the problem size becomes too large, we suggest to work with a k-NN subgraph of the given graph which can be sampled by using Locality Sensitive Hashing (LSH) technique. We have also discussed various issues that one typically faces while sampling such a k-NN subgraph in practice. Further, we have proposed a novel label flipping method on top of the CCCP solution, which improves the result of CCCP further whenever class imbalance information is made available a priori. Our method can be easily adopted for a MapReduce platform, such as Hadoop. We have conducted experiments on 11 datasets comprising a graph size of up to 20K nodes, CI as high as 99:6%, and DoS as low as 0:5%. Our method has resulted up to 19:5-times improvement in F-measure and up to 17:5-times improvement in AUC-PR measure against baseline methods.
- A. Agovic and A. Banerjee. A unified view of graph-based semi-supervised learning: Label propagation, graph-cuts, and embeddings. Technical Report TR 09-012, University of Minnesota, 2009.Google Scholar
- A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):117--122, 2008. Google ScholarDigital Library
- M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 7:2399--2434, 2006. Google ScholarDigital Library
- Y. Bengio, O. Delalleau, and N. Le Roux. Label propagation and quadratic criterion. In O. Chapelle, B. Schölkopf, and A. Zien, editors, Semi-Supervised Learning, pages 193--216. MIT Press, 2006.Google Scholar
- V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41(3):1--58, 2009. Google ScholarDigital Library
- N. V. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations Newsletter, 6(1):1--6, 2004. Google ScholarDigital Library
- F. R. K. Chung. Spectral Graph Theory. American Mathematical Society, 1997.Google Scholar
- J. Davis and M. Goadrich. The relationship between precision-recall and ROC curves. In ICML, pages 233--240, 2006. Google ScholarDigital Library
- W. Dong, C. Moses, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In WWW, pages 577--586, 2011. Google ScholarDigital Library
- M. Egele, G. Stringhini, C. Kruegel, and G. Vigna. COMPA: Detecting compromised accounts on social networks. In NDSS, 2013.Google Scholar
- W. Fithian and T. Hastie. Local case-control sampling: Efficient subsampling in imbalanced data sets. arXiv:1306.3706, 2013.Google Scholar
- J. Gao, H. Cheng, and P.-N. Tan. Semi-supervised outlier detection. In Symposium on Applied Computing, pages 635--636, 2006. Google ScholarDigital Library
- T. Joachims. Transductive inference for text classification using support vector machines. In ICML, pages 200--209, 1999. Google ScholarDigital Library
- B. K. Sriperumbudur and G. R. G. Lanckriet. On the convergence of the concave-convex procedure. In NIPS, 2009.Google Scholar
- S. Li and I. W. Tsang. Maximum margin/volume outlier detection. In IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pages 385--392, 2011. Google ScholarDigital Library
- W. Liu and S. Chawla. A quadratic mean based supervised learning model for managing data skewness. In SDM, pages 188--198, 2011.Google ScholarCross Ref
- D. G. Luenberger. Linear and Nonlinear Programming. Springer, 2003.Google Scholar
- U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395--416, 2007. Google ScholarDigital Library
- A. K. Menon, H. Narasimhan, S. Agarwal, and S. Chawla. On the statistical consistency of algorithms for binary classification under class imbalance. In ICML, 2013.Google Scholar
- J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1997.Google Scholar
- J. Norstad. A MapReduce algorithm for matrix multiplication, 2009. http://www.norstad.org/matrix-multiply/index.html.Google Scholar
- S. Pandit, D. H. Chau, S. Wang, and C. Faloutsos. Netprobe: A fast and scalable system for fraud detection in online auction networks. In WWW, 2007. Google ScholarDigital Library
- I. N. C. S. Report. Mobile payments - a growing threat. Technical report, Bureau of International Narcotics and Law Enforcement Affairs, U.S. Department of State, 2008, URL: http://www.test.org/doe/.Google Scholar
- J. Wang, T. Jebara, and S.-F. Chang. Graph transduction via alternating minimization. In ICML, pages 1144--1151, 2008. Google ScholarDigital Library
- H. Yu, P. B. Gibbons, M. Kaminsky, and F. Xiao. Sybillimit: A near-optimal social network defense against sybil attacks. In IEEE Symposium on Security and Privacy, pages 3--17, 2008. Google ScholarDigital Library
- A. L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 12:915--936, 2003. Google ScholarDigital Library
- D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In NIPS, 2004.Google ScholarDigital Library
- D. Zhou and B. Schölkopf. A regularization framework for learning from graph data. In ICML Workshop on Statistical Relational Learning, pages 132--137, 2004.Google Scholar
- X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, pages 912--919, 2003.Google ScholarDigital Library
Index Terms
- Learning to Propagate Rare Labels
Recommendations
Discriminatory Label-specific Weights for Multi-label Learning with Missing Labels
AbstractClass labels in multi-label datasets are only associated with a very small fraction of the data instances leading to a class imbalance problem. There exist multi-label learning algorithms that handle the datasets’ class imbalance issue by ...
Semi-supervised partial label learning algorithm via reliable label propagation
AbstractPartial label learning (PLL) is a weakly supervised learning method that is able to predict one label as the correct answer from a given candidate label set. In PLL, when all possible candidate labels are as signed to real-world training examples, ...
Semisupervised Learning Using Negative Labels
The problem of semisupervised learning has aroused considerable research interests in the past few years. Most of these methods aim to learn from a partially labeled dataset, i.e., they assume that the exact labels of some data are already known. In ...
Comments