Abstract
Currently, web spamming is a serious problem for search engines. It not only degrades the quality of search results by intentionally boosting undesirable web pages to users, but also causes the search engine to waste a significant amount of computational and storage resources in manipulating useless information. In this paper, we present a novel ensemble classifier for web spam detection which combines the clonal selection algorithm for feature selection and under-sampling for data balancing. This web spam detection system is called USCS. The USCS ensemble classifiers can automatically sample and select sub-classifiers. First, the system will convert the imbalanced training dataset into several balanced datasets using the under-sampling method. Second, the system will automatically select several optimal feature subsets for each sub-classifier using a customized clonal selection algorithm. Third, the system will build several C4.5 decision tree sub-classifiers from these balanced datasets based on its specified features. Finally, these sub-classifiers will be used to construct an ensemble decision tree classifier which will be applied to classify the examples in the testing data. Experiments on WEBSPAM-UK2006 dataset on the web spam problem show that our proposed approach, the USCS ensemble web spam classifier, contributes significant classification performance compared to several baseline systems and state-of-the-art approaches.
Similar content being viewed by others
Notes
The participants and their experimental results can be found in the web spam challenge website (http://webspam.lip6.fr/wiki/pmwiki.php?n=Main.PhaseIResults).
The participants and their experimental results can be found in the web spam challenge website (http://webspam.lip6.fr/wiki/pmwiki.php?n=Main.PhaseIIIResults).
References
Gyongyi Z, Garcia-Molina H (2005) Web spam taxonomy. In: Proceedings of first international workshop on adversarial information retrieval on the web. pp 1–11
Silverstein C, Marais H, Henzinger M, Moricz M (1999) Analysis of a very large web search engine query log. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development on information retrieval. pp 6–12
Joachims T, Granka L, Pan B, Hembrooke H, Gay G (2005) Accurately interpreting click through data as implicit feedback. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval. pp 154–161
Spirin N, Han J (2012) Survey on web spam detection: principles and algorithms. ACM SIGKDD Explor Newsl 13(2):50–64
Chandra A, Suaib M (2014) A survey on web spam and spam 2.0. Int J Adv Comput Res 4(2):634–644
Tahir MA, Bouridane A, Kurugollu F (2007) Simultaneous feature selection and feature weighting using Hybrid Tabu Search/K-nearest neighbor classifier. Pattern Recognit Lett 28(4):438–446
Bonev B, Escolano F, Cazorla M (2008) Feature selection, mutual information, and the classification of high-dimensional patterns. Pattern Anal Appl 11(3–4):309–319
Kohavi R, Sommerfield D (1995) Feature subset selection using the wrapper method: overfitting and dynamic search space topology. In: Proceedings of the first international conference on knowledge discovery and data mining. AAAI press. pp 192–197
Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the eighteenth international conference on machine learning. pp 74–81
Blum AL, Rivest RL (1992) Training a 3-node neural network is NP-complete. Neural Netw 5(1):117–127
Lin S, Lee Z, Chen S, Tseng T (2008) Parameter determination of support vector machine and feature selection using simulated annealing approach. Appl Soft Comput 8(4):1505–1512
Ahmed A (2005) Feature subset selection using ant colony optimization. Int J Comput Intell Appl 2(1):53–58
Ahmad F, Isa NAM, Hussain Z, Osman MK, Sulaiman SN (2014) A GA-based feature selection and parameter optimization of an ANN in diagnosing breast cancer. Pattern Anal Appl 5(5):1–10
Marinaki M, Marinakis Y (2015) A hybridization of clonal selection algorithm with iterated local search and variable neighborhood search for the feature selection problem. Memet Comput 1(1):1–21
Samadzadegan F, Namin SR, Rajabi MA (2012) Evaluating the potential of clonal selection optimization algorithm to hyperspectral image feature selection. Key Eng Mater 500(1):799–805
Yen S, Lee Y (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Exp Syst Appl 36(3):5718–5727
Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378
Hong X, Chen S, Harris CJ (2007) A kernel-based two-class classifier for imbalanced data sets. IEEE Trans Neural Netw 18(1):28–41
Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier, Amsterdam
Ntoulas A, Najork M, Manasse M, Fetterly D (2006) Detecting spam web pages through content analysis. In: Proceedings of the 15th international conference on World Wide Web. pp 89–92
Castillo C, Donato D, Gionis A, Murdock V, Silvestri F (2007) Know your neighbors: Web spam detection using the web topology. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. pp 423–430
Liu Y, Gao B, Liu T, Zhang Y, Ma Z et al (2008) BrowseRank: letting web users vote for page importance. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. pp 451–458
Craswell N, Zoeter O, Taylor M, Ramsey B (2008) An experimental comparison of click position-bias models. In: Proceedings of the 2008 international conference on web search and data mining. pp 87–94
Scarselli F, Tsoi AC, Hagenbuchner M, Di Noi L (2013) Solving graph data issues using a layered architecture approach with applications to web spam detection. Neural Netw 48:78–90
Jegadeesh JS, Jacob PL (2013) Web spam detection using fuzzy clustering. Int J Recent Innov Trends Comput Commun 1(12):928–938
Wei W, Xiao-Dong L, An-Lei H, Guang-Gang G (2013) Co-training based semi-supervised Web spam detection. In: Proceedings of 10th international conference on fuzzy systems and knowledge discovery. pp 789–793
Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Decis Mak 5(04):597–604
He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley, Hoboken
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C Appl Rev 42(4):463–484
Fan W, Stolfo SJ, Zhang J, Chan PK (1999) Adacost: misclassification cost-sensitive boosting. In: Proceedings of sixth international conference on machine learning (ICML-99), Bled, Slovenia. pp 97–105
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A Syst Hum 40(1):185–197
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. Knowledge discovery in databases: PKDD 2003. Springer, New York, pp 107–119
Blaszczynski J, Stefanowski J (2015) Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 150:529–542
Hido S, Kashima H, Takahashi Y (2009) Roughly balanced bagging for imbalanced data. Stat Anal Data Min 2(5–6):412–426
Liu X, Wu J, Zhou Z (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B Cybern 39(2):539–550
Geng GG, Wang CH, Li QD, Xu L, Jin XB (2007) Boosting the performance of web spam detection with ensemble under-sampling classification. In: Proceedings of the IEEE fourth international conference on fuzzy systems and knowledge discovery. pp 583–587
Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning. pp 249–256
De Castro LN, Von Zuben FJ (2002) Learning and optimization using the clonal selection principle. IEEE Trans Evolut Comput 6(3):239–251
De Castro LN, Von Zuben FJ (2002) The clonal selection algorithm with engineering applications. In: Proceedings of the 17th genetic and evolutionary computation conference. pp 36–37
Dudek G (2012) An artificial immune system for classification with local feature selection. IEEE Trans Evolut Comput 16(6):847–860
Castillo C, Donato D, Becchetti L, Boldi P, Leonardi S et al (2006) A reference collection for web spam. ACM Sigir Forum 40(2):11–24
Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
Mason SJ, Graham NE (2002) Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: statistical significance and interpretation. Q J R Meteorol Soc 128(584):2145–2166
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lu, XY., Chen, MS., Wu, JL. et al. A novel ensemble decision tree based on under-sampling and clonal selection for web spam detection. Pattern Anal Applic 21, 741–754 (2018). https://doi.org/10.1007/s10044-017-0602-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-017-0602-2