ABSTRACT
Recent years witnesses the rampancy of telephone fraud along with the development of modern communication technology. The challenges from telephone fraud identification mainly exist in two aspects: (1) the telephone fraud records are typical imbalanced data due to the characteristic of heterogeneous spatial-temporal distribution, leading to bias towards predicting the majority class; (2) traditional evaluation metrics in imbalanced learning mainly rely on accuracy or precision, neglecting the completeness of telephone fraud identification in real-world implementations.
In response to the limitations of traditional methods, we propose the Stacked-SVM framework based on heterogeneous ensemble learning and support vector machines (SVMs). We first employ both edited nearest neighbors (ENN) and adaptive synthetic sampling (ADASYN) to alleviate the high dimensional curse in imbalanced data resampling; secondly, we propose the optimal linear combination strategy in the iteration of Stacked-SVM and demonstrate its validity with the help of Kullback-Leibler divergence. Finally, we construct the Stacked-SVM framework with respect to the constraints of the loss function in SVM. We further compare the performance under different evaluation metrics (i.e., accuracy, precision, recall, F1-score, and AUC value) with other four traditional telephone fraud identification methods, namely Logistic Regression, Isolation Forest, SVM with random parameter settings, and optimized SVM.
We implement Stacked-SVM with a list of experiments based on real telephone fraud data sets in the form of calling detail records (CDRs) from a Chinese domestic telecom operator. The experimental results show that the proposed Stacked-SVM holds a 93.83% recall value and an 82.96% accuracy in telephone fraud identification, behaving more precise and robust than other models.
- Communications Fraud Control Association (CFCA). 2017 Global Fraud Loss Surveys, 2017.Google Scholar
- Josh Jia-Ching Ying, Ji Zhang, Che-Wei Huang, Kuan-Ta Chen, and Vincent S. Tseng. FrauDetector+: An Incremental Graph-Mining Approach for Efficient Fraudulent Phone Call Detection. ACM Trans. Knowl. Discov. Data, 12(6):1--35, 2018.Google ScholarDigital Library
- 360 Internet security center. 2016 China telecom fraud situation analysis report. http://zt.360.cn/1101061855.php?dtid=1101061451&did=490024605Google Scholar
- D. Ramyachitra, P. Manikandan, Imbalanced dataset classification and solutions: a review.Int. J. Comput. Bus. Res. 5, 2014.Google Scholar
- Y. Sun, A.K.C. Wong, M.S. Kamel, Classification of imbalanced data: A review, Int. J. Pattern Recogn. Artif. Intell. 23(4):687--719, 2009.Google ScholarCross Ref
- P. Branco, L. Torgo, R.P. Ribeiro. A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. 49(2):1--50, 2016.Google ScholarDigital Library
- N.V. Chawla, K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res., 16:321--357, 2002.Google ScholarDigital Library
- H He, Y Bai, E A Garcia, and S Li. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In: IEEE International Joint Conference on Neural Networks, 1322--1328, 2008.Google Scholar
- D. Wilson. Asymptotic properties of nearest neighbor rules using edited data. Systems, Man and Cybernetics, IEEE Transactions on, 408--421, 1972.Google Scholar
- C. Penrod, T. Wagner. Another look at the edited nearest neighbor rule. IEEE Trans. Syst. Man, Cybern. 7:92--94, 1977.Google ScholarCross Ref
- J Zhang and I Mani. KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In: ICML '2003, 2003.Google Scholar
- Romero F.A.B. de Morais, Germano C. Vasconcelos. Boosting the performance of over-sampling algorithms through under-sampling the minority class. Neurocomputing, 343:3--18, 2019.Google ScholarDigital Library
- J. Gao, B. Ding, W. Fan, J. Han, P.S. Yu, Classifying data streams with skewed class distributions and concept drifts, IEEE Internet Comput. 12:37--49, 2008.Google ScholarDigital Library
- M.G. Kelly, D.J. Hand, N.M. Adams. The impact of changing populations on classifier performance. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 367--371, 1999.Google ScholarDigital Library
- Richard A. Becker, Chris Volinsky, and Allan R. Wilks. Fraud detection in telecommunications: History and lessons learned. Technimetrics, 52(1):20--33, 2010.Google ScholarCross Ref
- D.A. Cieslak, T.R. Hoens, N.V. Chawla, W.P. Kegelmeyer. Hellinger distance decision trees are robust and skew-insensitive. Data Mining Knowl. Discov. 24(1):136--158, 2012.Google ScholarDigital Library
- ElahehArabmakki, Mehmed Kantardzic. SOM-based partial labeling of imbalanced data stream. Neurocomputing, 262:120--133, 2017.Google ScholarCross Ref
- R.M. Cruz, R. Sabourin, G.D. Cavalcanti. Dynamic classifier selection: Recent advances and perspectives. Inf. Fus.41:195--216, 2018.Google ScholarDigital Library
- G. Fung and O.L. Mangasarian. Multicategory Proximal Support Vector Machine Classifiers. Machine Learning, 59:77--97, 2005.Google ScholarDigital Library
- Y.H. Liu and Y.T. Chen. Total Margin Based Adaptive Fuzzy Support Vector Machines for Multiview Face Recognition. In: Proc. Int'l Conf. Systems, Man and Cybernetics, 1704--1711, 2005.Google Scholar
- Jayadeva, Himanshu Pant, Mayank Sharma, SumitSoman. Twin Neural Networks for the classification of large unbalanced datasets. Neurocomputing, 343:34--49, 2019.Google ScholarDigital Library
- H. Sun and M. Guo. Credit risk assessment model of small and medium-sized enterprise based on logistic regression. In: 2015 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), Singapore, 1714--1717, 2015.Google ScholarCross Ref
- Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In: ICDM'08, 2008.Google Scholar
- Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. Isolation-based anomaly detection. TKDD, 6(1)1--39, 2012.Google ScholarDigital Library
- M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera. A review on ensembles for the class imbalance problem: bagging, boosting, and hybrid based approaches. IEEE Trans. Syst. Man, Cybern. C: Appl. Rev, 42:463--484, 2012.Google ScholarDigital Library
- R.M.O. Cruz, R. Sabourin, G.D.C. Cavalcanti, T.I. Ren. META-DES: a dynamic ensemble selection framework using meta-learning. Pattern Recognit. 48(5):1925--1935, 2015.Google ScholarDigital Library
- Xiliang Liu, Kang Liu, Mingxiao Li, Feng Lu, Mengdi Liao, and Ren Yang. SHE: Stepwise Heterogeneous Ensemble Method for Citywide Traffic Analysis. In: Proceedings of the 1st ACM SIGSPATIAL Workshop on Prediction of Human Mobility (PredictGIS'17). ACM, New York, NY, USA, 2017.Google Scholar
- https://www.in.gov/oucc/2418.htm.Google Scholar
- https://www.telegraph.co.uk/business/business-reporter/tollring/Google Scholar
- C.L. Castro, A.P. Braga. Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 24 (6):888--899, 2013.Google ScholarCross Ref
- SovanSamanta, Madhumangal Pal. Telecommunication System Based on Fuzzy Graphs. J TelecommunSyst Manage, 03(01), 2013.Google Scholar
- M. Weatherford. Mining for fraud. IEEE Intelligent Systems 17(4): 4--6, 2002.Google ScholarDigital Library
- Dominik Olszewski. A probabilistic approach to fraud detection in telecommunications. Knowledge-Based Systems, 26:246--258, 2012.Google ScholarDigital Library
- Somasundaram A, Reddy US. Modelling a stable classifier for handling large scale data with noise and imbalance. In: Computational intelligence in data science (ICCIDS), 1--6, 2017.Google ScholarCross Ref
Index Terms
- Stacked-SVM: A Dynamic SVM Framework for Telephone Fraud Identification from Imbalanced CDRs
Recommendations
Angle-based multicategory distance-weighted SVM
Classification is an important supervised learning technique with numerous applications. We develop an angle-based multicategory distance-weighted support vector machine (MDWSVM) classification method that is motivated from the binary distance-weighted ...
A new sampling method for classifying imbalanced data based on support vector machine ensemble
The insufficient information from the minority examples cannot exactly represent the inherent structure of the dataset, which leads to a low prediction accuracy of the minority through the existing classification methods. The over- and under-sampling ...
Application of distributed SVM architectures in classifying forest data cover types
In many 'real-world' applications, a classification of large data sets, which are often also imbalanced, is difficult due to the small, but usually more interesting classes. In this study, a large data set, forest cover type classes, which is actually ...
Comments