Abstract
Online advertising models are vulnerable to click fraud, which occurs when an individual or a group repeatedly clicks on an online advertisement with the intent to generate illegitimate clicks and make money from the advertiser. In machine learning-based approaches for detecting click fraud, the performance of the models can be affected by the presence of collinear, redundant, and least significant features in the dataset. These types of features can lead to overfitting, where the model becomes too complex and fails to generalize well to new data. Therefore, a Manifold Criterion Variable Elimination method is proposed in this work to select significant features utilizing the potential of six filter-based feature selection techniques for the discrimination of fraud and genuine publishers. Experimentations are conducted on the online advertisement user click dataset in two modes, first considering all extracted features and second considering only selected features. An extraction of 103 statistical features from the user-click dataset is performed for each class instance labelled with OK, Fraud and Observation. The Manifold Criterion Variable Elimination method selects the top 15 most relevant features. Individual and ensemble learning models are trained with selected feature-set and tuned parameter values. The performances of learners are evaluated using standard evaluation measures. The results demonstrated that, in general, the performance of all learners improved with the selected feature set. Particularly, the Gradient Tree Boosting (GTB) ensemble model performed superiorly by improving the weak learners by minimizing the model's loss via merging the weak learners into a strong one iteratively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhang, L., Guan, Y.: Detecting click fraud in pay-per-click streams of online advertising networks. In: The 28th International Conference on Distributed Computing Systems, 2008. ICDCS’08, pp. 77–84 (2008)
Sisodia, D., Sisodia, D.S.: Stacked generalization architecture for predicting publisher behaviour from highly imbalanced user-click data set for click fraud detection. New Gener. Comput. (2023b). https://doi.org/10.1007/s00354-023-00218-1
Sisodia, D., Sisodia, D.S., Singh, D.: Evaluating feature importance to investigate publishers conduct for detecting click fraud. In: In: Sisodia, D.S., Garg, L., Pachori, R.B., Tanveer, M. (eds.) Machine Intelligence Techniques for Data Analysis and Signal Processing. LNEE, vol. 997, pp. 515–524. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-0085-5_42
Singh, L., Sisodia, D., Shashvat, K., Kaur, A., Sharma, P.C.: A reliable click-fraud detection system for the investigation of fraudulent publishers in online advertising. Appl. Intell. Hum. Comput. Interact. 221–254 (2023)
Sisodia, D., Sisodia, D.S.: A transfer learning framework towards identifying behavioral changes of fraudulent publishers in pay-per-click model of online advertising for click fraud detection. Expert Syst. Appl. 120922 (2023)
Berrar, D.: Random forests for the detection of click fraud in online mobile advertising. In: Proceedings of 2012 International Workshop on Fraud Detection in Mobile Advertising (FDMA), Singapore, pp. 1–10 (2012). http://berrar.com/resources/Berrar_FDMA2012.pdf
Sisodia, D., Sisodia, D.S.: Prediction of diabetes using classification algorithms. Procedia Comput. Sci. 132, 1578–1585 (2018). https://doi.org/10.1016/j.procs.2018.05.122
Perera, K.S., Neupane, B., Faisal, M.A., Aung, Z., Woon, W.L.: A novel ensemble learning-based approach for click fraud detection in mobile advertising. In: Prasath, R., Kathirvalavakumar, T. (eds.) MIKE 2013. LNCS (LNAI), vol. 8284, pp. 370–382. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-03844-5_38
Richard Oentaryo, W.L.W., et al.: Detecting click fraud in online advertising: a data mining approach. J. Mach. Learn. Res. 15(1), 99–140 (2014). https://doi.org/10.1145/2623330.2623718
Vasumati, D., Vani, M.S., Bhramaramba, R., Babu, O.Y.: Data mining approach to filter click-spam in mobile ad networks. In: International Conference on Computer Science, Data Mining & Mechanical Engineering, pp. 90–94 (2015)
Berrar, D.: Learning from automatically labeled data: case study on click fraud prediction. Knowl. Inf. Syst. 46(2), 477–490 (2015). https://doi.org/10.1007/s10115-015-0827-6
Taneja, M., Garg, K., Purwar, A., Sharma, S.: Prediction of click frauds in mobile advertising. In: International Conference on Contemporary Computing, IC3, Noida, India, pp. 162–166 (2015). https://doi.org/10.1109/IC3.2015.7346672
Sisodia, D., Sisodia, D.S.: Gradient boosting learning for fraudulent publisher detection in online advertising. Data Technol. Appl. 55(2), 216–232 (2020). https://doi.org/10.1108/DTA-04-2020-0093
Sisodia, D., Sisodia, D.S.: Data sampling strategies for click fraud detection using imbalanced user click data of online advertising: an empirical review. IETE Tech. Rev. 39(4), 1–10 (2021). https://doi.org/10.1080/02564602.2021.1915892
Sisodia, D., Sisodia, D.S.: Quad division prototype selection-based k-nearest neighbor classifier for click fraud detection from highly skewed user click dataset. Int. J. Eng. Sci. Technol. 28, 1–12 (2022). https://doi.org/10.1016/J.JESTCH.2021.05.015
Sisodia, D., Sisodia, D.S.: A hybrid data-level sampling approach in learning from skewed user-click data for click fraud detection in online advertising. Expert. Syst. 40(July), 1–17 (2022). https://doi.org/10.1111/exsy.13147
Sisodia, D., Sisodia, D.S.: Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising. Data Technol. Appl. 56(4), 1–24 (2022). https://doi.org/10.1108/dta-09-2021-0233
Sisodia, Deepti, Sisodia, Dilip Singh: Data Sampling Methods for Analyzing Publishers Conduct from Highly Imbalanced Dataset in Web Advertising. In: Garg, Lalit, et al. (eds.) Information Systems and Management Science: Conference Proceedings of 4th International Conference on Information Systems and Management Science (ISMS) 2021, pp. 428–441. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-13150-9_34
Sisodia, D., Sisodia, D.S.: Feature space transformation of user-clicks and deep transfer learning framework for fraudulent publisher detection in online advertising. Appl. Soft Comput. 125, 109142 (2022). https://doi.org/10.1016/j.asoc.2022.109142
Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)
Hoque, N., Bhattacharyya, D.K., Kalita, J.K.: MIFS-ND: a mutual information-based feature selection method. Expert Syst. Appl. 41(14), 6371–6385 (2014). https://doi.org/10.1016/j.eswa.2014.04.019
Kalapatapu, P., Goli, S., Arthum, P., Malapati, A.: A study on feature selection and classification techniques of Indian music. Procedia Comput. Sci. 58(Euspn), 125–131 (2016). https://doi.org/10.1016/j.procs.2016.09.020
Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Eng. 40(1), 16–28 (2014). https://doi.org/10.1016/j.compeleceng.2013.11.024
Todeschini, R.: k-nearest neighbour method: the influence of data transformations and metrics. Chemom. Intell. Lab. Syst. 6(3), 213–220 (1989)
Zhang, S.: KNN-CF approach: incorporating certainty factor to kNN classification. IEEE Intell. Inf. Bull. 11(1), 24–33 (2010). http://www.comp.hkbu.edu.hk/~iib/2010/Dec/article4/iib_vol11no1_article4.pdf
Utgoff, P.E.: Incremental induction of decision trees. Mach. Learn. 4(2), 161–186 (1989). https://doi.org/10.1023/A:1022699900025
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). https://doi.org/10.1023/A:1022643204877
Friedman, J.H.: Regularized discriminant analysis. J. Am. Stat. Assoc. 84(405), 165–175 (1989)
Ramayah, T., Ahmad, N.H., Halim, H.A., Rohaida, S., Zainal, M., Lo, M.: Discriminant analysis: an illustrated example. Afr. J. Bus. Manag. 4(9), 1654–1667 (2010)
Friedman, N., Geiger, D., Goldszmit, M.: Bayesian network classifiers. Mach. Learn. 29, 131–163 (1997). https://doi.org/10.1023/a:1007465528199
Friedman, N., Goldszmidt, M.: Building classifiers using Bayesian networks. In: AAAI-96 Proceedings, pp. 1277–1284 (1996). 10.1.1.30.4898
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018
Vapnik, V.N.: Statistical learning theory. Adapt. Learn. Syst. Signal Process. Commun. Control 2, 1–740 (1998). https://doi.org/10.2307/1271368
Sisodia, D., Shrivastava, S.K., Jain, R.C.: ISVM for face recognition. In: International Conference on Computational Intelligence and Communication Networks, (CICN), pp. 554–559 (2010). https://doi.org/10.1109/CICN.2010.109
Singh, L., Janghel, R.R., Sahu, S.P.: A hybrid feature fusion strategy for early fusion and majority voting for late fusion towards melanocytic skin lesion detection. Int. J. Syst. Technol. 32, 1–20 (2021). https://doi.org/10.1002/ima.22692
Freund, Y., Schapire, R., Abe, N.: A short introduction to boosting. J.-Japanese Soc. Artif. Intell. 14(771–780), 1612 (1999)
Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4(6), 933–969 (2004). https://doi.org/10.1162/1532443041827916
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Fang, Y.K.: LPBoost with strong classifiers. Int. J. Comput. Intell. Syst. 1(2006), 88–100 (2010)
Bennett, K.P., Shawe-taylor, J.: Linear programming boosting via column generation. Mach. Learn. 46, 225–254 (2002). https://doi.org/10.1023/A:1012470815092
Lemaitre, G., Radojevic, M.: Directed reading: boosting algorithms. Heriot-Watt University, Universitat de Girona, Universite de Bourgogne, pp. 1–13 (2009)
Warmuth, M.K., Liao, J., Rätsch, G.: Totally corrective boosting algorithms that maximize the margin. In: Proceedings of the 23rd international conference on Machine learning - ICML ’06, no. 1999, pp. 1001–1008 (2006). https://doi.org/10.1145/1143844.1143970
Wong, T.T.: Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recogn. 48(9), 2839–2846 (2015). https://doi.org/10.1016/j.patcog.2015.03.009
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manage. 45(4), 427–437 (2009). https://doi.org/10.1016/j.ipm.2009.03.002
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 IFIP International Federation for Information Processing
About this paper
Cite this paper
Singh, L., Sisodia, D., Taranath, N.L. (2023). Gradient Boosting-Based Predictive Click Fraud Detection Using Manifold Criterion Variable Elimination. In: Chandran K R, S., N, S., A, B., Hamead H, S. (eds) Computational Intelligence in Data Science. ICCIDS 2023. IFIP Advances in Information and Communication Technology, vol 673. Springer, Cham. https://doi.org/10.1007/978-3-031-38296-3_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-38296-3_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-38295-6
Online ISBN: 978-3-031-38296-3
eBook Packages: Computer ScienceComputer Science (R0)