Skip to main content
Log in

A novel ensemble decision tree based on under-sampling and clonal selection for web spam detection

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Currently, web spamming is a serious problem for search engines. It not only degrades the quality of search results by intentionally boosting undesirable web pages to users, but also causes the search engine to waste a significant amount of computational and storage resources in manipulating useless information. In this paper, we present a novel ensemble classifier for web spam detection which combines the clonal selection algorithm for feature selection and under-sampling for data balancing. This web spam detection system is called USCS. The USCS ensemble classifiers can automatically sample and select sub-classifiers. First, the system will convert the imbalanced training dataset into several balanced datasets using the under-sampling method. Second, the system will automatically select several optimal feature subsets for each sub-classifier using a customized clonal selection algorithm. Third, the system will build several C4.5 decision tree sub-classifiers from these balanced datasets based on its specified features. Finally, these sub-classifiers will be used to construct an ensemble decision tree classifier which will be applied to classify the examples in the testing data. Experiments on WEBSPAM-UK2006 dataset on the web spam problem show that our proposed approach, the USCS ensemble web spam classifier, contributes significant classification performance compared to several baseline systems and state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. The participants and their experimental results can be found in the web spam challenge website (http://webspam.lip6.fr/wiki/pmwiki.php?n=Main.PhaseIResults).

  2. The participants and their experimental results can be found in the web spam challenge website (http://webspam.lip6.fr/wiki/pmwiki.php?n=Main.PhaseIIIResults).

References

  1. Gyongyi Z, Garcia-Molina H (2005) Web spam taxonomy. In: Proceedings of first international workshop on adversarial information retrieval on the web. pp 1–11

  2. Silverstein C, Marais H, Henzinger M, Moricz M (1999) Analysis of a very large web search engine query log. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development on information retrieval. pp 6–12

  3. Joachims T, Granka L, Pan B, Hembrooke H, Gay G (2005) Accurately interpreting click through data as implicit feedback. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval. pp 154–161

  4. Spirin N, Han J (2012) Survey on web spam detection: principles and algorithms. ACM SIGKDD Explor Newsl 13(2):50–64

    Article  Google Scholar 

  5. Chandra A, Suaib M (2014) A survey on web spam and spam 2.0. Int J Adv Comput Res 4(2):634–644

    Google Scholar 

  6. Tahir MA, Bouridane A, Kurugollu F (2007) Simultaneous feature selection and feature weighting using Hybrid Tabu Search/K-nearest neighbor classifier. Pattern Recognit Lett 28(4):438–446

    Article  Google Scholar 

  7. Bonev B, Escolano F, Cazorla M (2008) Feature selection, mutual information, and the classification of high-dimensional patterns. Pattern Anal Appl 11(3–4):309–319

    Article  Google Scholar 

  8. Kohavi R, Sommerfield D (1995) Feature subset selection using the wrapper method: overfitting and dynamic search space topology. In: Proceedings of the first international conference on knowledge discovery and data mining. AAAI press. pp 192–197

  9. Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the eighteenth international conference on machine learning. pp 74–81

  10. Blum AL, Rivest RL (1992) Training a 3-node neural network is NP-complete. Neural Netw 5(1):117–127

    Article  Google Scholar 

  11. Lin S, Lee Z, Chen S, Tseng T (2008) Parameter determination of support vector machine and feature selection using simulated annealing approach. Appl Soft Comput 8(4):1505–1512

    Article  Google Scholar 

  12. Ahmed A (2005) Feature subset selection using ant colony optimization. Int J Comput Intell Appl 2(1):53–58

    MathSciNet  Google Scholar 

  13. Ahmad F, Isa NAM, Hussain Z, Osman MK, Sulaiman SN (2014) A GA-based feature selection and parameter optimization of an ANN in diagnosing breast cancer. Pattern Anal Appl 5(5):1–10

    Google Scholar 

  14. Marinaki M, Marinakis Y (2015) A hybridization of clonal selection algorithm with iterated local search and variable neighborhood search for the feature selection problem. Memet Comput 1(1):1–21

    Google Scholar 

  15. Samadzadegan F, Namin SR, Rajabi MA (2012) Evaluating the potential of clonal selection optimization algorithm to hyperspectral image feature selection. Key Eng Mater 500(1):799–805

    Article  Google Scholar 

  16. Yen S, Lee Y (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Exp Syst Appl 36(3):5718–5727

    Article  MathSciNet  Google Scholar 

  17. Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378

    Article  MATH  Google Scholar 

  18. Hong X, Chen S, Harris CJ (2007) A kernel-based two-class classifier for imbalanced data sets. IEEE Trans Neural Netw 18(1):28–41

    Article  Google Scholar 

  19. Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier, Amsterdam

    Google Scholar 

  20. Ntoulas A, Najork M, Manasse M, Fetterly D (2006) Detecting spam web pages through content analysis. In: Proceedings of the 15th international conference on World Wide Web. pp 89–92

  21. Castillo C, Donato D, Gionis A, Murdock V, Silvestri F (2007) Know your neighbors: Web spam detection using the web topology. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. pp 423–430

  22. Liu Y, Gao B, Liu T, Zhang Y, Ma Z et al (2008) BrowseRank: letting web users vote for page importance. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. pp 451–458

  23. Craswell N, Zoeter O, Taylor M, Ramsey B (2008) An experimental comparison of click position-bias models. In: Proceedings of the 2008 international conference on web search and data mining. pp 87–94

  24. Scarselli F, Tsoi AC, Hagenbuchner M, Di Noi L (2013) Solving graph data issues using a layered architecture approach with applications to web spam detection. Neural Netw 48:78–90

    Article  Google Scholar 

  25. Jegadeesh JS, Jacob PL (2013) Web spam detection using fuzzy clustering. Int J Recent Innov Trends Comput Commun 1(12):928–938

    Google Scholar 

  26. Wei W, Xiao-Dong L, An-Lei H, Guang-Gang G (2013) Co-training based semi-supervised Web spam detection. In: Proceedings of 10th international conference on fuzzy systems and knowledge discovery. pp 789–793

  27. Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Decis Mak 5(04):597–604

    Article  Google Scholar 

  28. He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley, Hoboken

    Book  MATH  Google Scholar 

  29. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C Appl Rev 42(4):463–484

    Article  Google Scholar 

  30. Fan W, Stolfo SJ, Zhang J, Chan PK (1999) Adacost: misclassification cost-sensitive boosting. In: Proceedings of sixth international conference on machine learning (ICML-99), Bled, Slovenia. pp 97–105

  31. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A Syst Hum 40(1):185–197

    Article  Google Scholar 

  32. Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. Knowledge discovery in databases: PKDD 2003. Springer, New York, pp 107–119

    Chapter  Google Scholar 

  33. Blaszczynski J, Stefanowski J (2015) Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 150:529–542

    Article  Google Scholar 

  34. Hido S, Kashima H, Takahashi Y (2009) Roughly balanced bagging for imbalanced data. Stat Anal Data Min 2(5–6):412–426

    Article  MathSciNet  Google Scholar 

  35. Liu X, Wu J, Zhou Z (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B Cybern 39(2):539–550

    Article  Google Scholar 

  36. Geng GG, Wang CH, Li QD, Xu L, Jin XB (2007) Boosting the performance of web spam detection with ensemble under-sampling classification. In: Proceedings of the IEEE fourth international conference on fuzzy systems and knowledge discovery. pp 583–587

  37. Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning. pp 249–256

  38. De Castro LN, Von Zuben FJ (2002) Learning and optimization using the clonal selection principle. IEEE Trans Evolut Comput 6(3):239–251

    Article  Google Scholar 

  39. De Castro LN, Von Zuben FJ (2002) The clonal selection algorithm with engineering applications. In: Proceedings of the 17th genetic and evolutionary computation conference. pp 36–37

  40. Dudek G (2012) An artificial immune system for classification with local feature selection. IEEE Trans Evolut Comput 16(6):847–860

    Article  Google Scholar 

  41. Castillo C, Donato D, Becchetti L, Boldi P, Leonardi S et al (2006) A reference collection for web spam. ACM Sigir Forum 40(2):11–24

    Article  Google Scholar 

  42. Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  43. Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874

    Article  MathSciNet  Google Scholar 

  44. Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36

    Article  Google Scholar 

  45. Mason SJ, Graham NE (2002) Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: statistical significance and interpretation. Q J R Meteorol Soc 128(584):2145–2166

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pei-Chan Chang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, XY., Chen, MS., Wu, JL. et al. A novel ensemble decision tree based on under-sampling and clonal selection for web spam detection. Pattern Anal Applic 21, 741–754 (2018). https://doi.org/10.1007/s10044-017-0602-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-017-0602-2

Keywords

Navigation