A novel ensemble decision tree based on under-sampling and clonal selection for web spam detection

Lu, Xiao-Yong; Chen, Mu-Sheng; Wu, Jheng-Long; Chang, Pei-Chan; Chen, Meng-Hui

doi:10.1007/s10044-017-0602-2

A novel ensemble decision tree based on under-sampling and clonal selection for web spam detection

Theoretical Advances
Published: 09 February 2017

Volume 21, pages 741–754, (2018)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Xiao-Yong Lu¹,
Mu-Sheng Chen^2,3,
Jheng-Long Wu⁴,
Pei-Chan Chang ORCID: orcid.org/0000-0002-9900-3513^1,5 &
…
Meng-Hui Chen¹

547 Accesses
10 Citations
Explore all metrics

Abstract

Currently, web spamming is a serious problem for search engines. It not only degrades the quality of search results by intentionally boosting undesirable web pages to users, but also causes the search engine to waste a significant amount of computational and storage resources in manipulating useless information. In this paper, we present a novel ensemble classifier for web spam detection which combines the clonal selection algorithm for feature selection and under-sampling for data balancing. This web spam detection system is called USCS. The USCS ensemble classifiers can automatically sample and select sub-classifiers. First, the system will convert the imbalanced training dataset into several balanced datasets using the under-sampling method. Second, the system will automatically select several optimal feature subsets for each sub-classifier using a customized clonal selection algorithm. Third, the system will build several C4.5 decision tree sub-classifiers from these balanced datasets based on its specified features. Finally, these sub-classifiers will be used to construct an ensemble decision tree classifier which will be applied to classify the examples in the testing data. Experiments on WEBSPAM-UK2006 dataset on the web spam problem show that our proposed approach, the USCS ensemble web spam classifier, contributes significant classification performance compared to several baseline systems and state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Customer churn prediction system: a machine learning approach

Article 14 February 2021

A machine learning based credit card fraud detection using the GA algorithm for feature selection

Article Open access 25 February 2022

Analysis of Breast Cancer Detection Using Different Machine Learning Techniques

Notes

The participants and their experimental results can be found in the web spam challenge website (http://webspam.lip6.fr/wiki/pmwiki.php?n=Main.PhaseIResults).
The participants and their experimental results can be found in the web spam challenge website (http://webspam.lip6.fr/wiki/pmwiki.php?n=Main.PhaseIIIResults).

References

Gyongyi Z, Garcia-Molina H (2005) Web spam taxonomy. In: Proceedings of first international workshop on adversarial information retrieval on the web. pp 1–11
Silverstein C, Marais H, Henzinger M, Moricz M (1999) Analysis of a very large web search engine query log. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development on information retrieval. pp 6–12
Joachims T, Granka L, Pan B, Hembrooke H, Gay G (2005) Accurately interpreting click through data as implicit feedback. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval. pp 154–161
Spirin N, Han J (2012) Survey on web spam detection: principles and algorithms. ACM SIGKDD Explor Newsl 13(2):50–64
Article Google Scholar
Chandra A, Suaib M (2014) A survey on web spam and spam 2.0. Int J Adv Comput Res 4(2):634–644
Google Scholar
Tahir MA, Bouridane A, Kurugollu F (2007) Simultaneous feature selection and feature weighting using Hybrid Tabu Search/K-nearest neighbor classifier. Pattern Recognit Lett 28(4):438–446
Article Google Scholar
Bonev B, Escolano F, Cazorla M (2008) Feature selection, mutual information, and the classification of high-dimensional patterns. Pattern Anal Appl 11(3–4):309–319
Article Google Scholar
Kohavi R, Sommerfield D (1995) Feature subset selection using the wrapper method: overfitting and dynamic search space topology. In: Proceedings of the first international conference on knowledge discovery and data mining. AAAI press. pp 192–197
Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the eighteenth international conference on machine learning. pp 74–81
Blum AL, Rivest RL (1992) Training a 3-node neural network is NP-complete. Neural Netw 5(1):117–127
Article Google Scholar
Lin S, Lee Z, Chen S, Tseng T (2008) Parameter determination of support vector machine and feature selection using simulated annealing approach. Appl Soft Comput 8(4):1505–1512
Article Google Scholar
Ahmed A (2005) Feature subset selection using ant colony optimization. Int J Comput Intell Appl 2(1):53–58
MathSciNet Google Scholar
Ahmad F, Isa NAM, Hussain Z, Osman MK, Sulaiman SN (2014) A GA-based feature selection and parameter optimization of an ANN in diagnosing breast cancer. Pattern Anal Appl 5(5):1–10
Google Scholar
Marinaki M, Marinakis Y (2015) A hybridization of clonal selection algorithm with iterated local search and variable neighborhood search for the feature selection problem. Memet Comput 1(1):1–21
Google Scholar
Samadzadegan F, Namin SR, Rajabi MA (2012) Evaluating the potential of clonal selection optimization algorithm to hyperspectral image feature selection. Key Eng Mater 500(1):799–805
Article Google Scholar
Yen S, Lee Y (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Exp Syst Appl 36(3):5718–5727
Article MathSciNet Google Scholar
Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378
Article MATH Google Scholar
Hong X, Chen S, Harris CJ (2007) A kernel-based two-class classifier for imbalanced data sets. IEEE Trans Neural Netw 18(1):28–41
Article Google Scholar
Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier, Amsterdam
Google Scholar
Ntoulas A, Najork M, Manasse M, Fetterly D (2006) Detecting spam web pages through content analysis. In: Proceedings of the 15th international conference on World Wide Web. pp 89–92
Castillo C, Donato D, Gionis A, Murdock V, Silvestri F (2007) Know your neighbors: Web spam detection using the web topology. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. pp 423–430
Liu Y, Gao B, Liu T, Zhang Y, Ma Z et al (2008) BrowseRank: letting web users vote for page importance. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. pp 451–458
Craswell N, Zoeter O, Taylor M, Ramsey B (2008) An experimental comparison of click position-bias models. In: Proceedings of the 2008 international conference on web search and data mining. pp 87–94
Scarselli F, Tsoi AC, Hagenbuchner M, Di Noi L (2013) Solving graph data issues using a layered architecture approach with applications to web spam detection. Neural Netw 48:78–90
Article Google Scholar
Jegadeesh JS, Jacob PL (2013) Web spam detection using fuzzy clustering. Int J Recent Innov Trends Comput Commun 1(12):928–938
Google Scholar
Wei W, Xiao-Dong L, An-Lei H, Guang-Gang G (2013) Co-training based semi-supervised Web spam detection. In: Proceedings of 10th international conference on fuzzy systems and knowledge discovery. pp 789–793
Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Decis Mak 5(04):597–604
Article Google Scholar
He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley, Hoboken
Book MATH Google Scholar
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C Appl Rev 42(4):463–484
Article Google Scholar
Fan W, Stolfo SJ, Zhang J, Chan PK (1999) Adacost: misclassification cost-sensitive boosting. In: Proceedings of sixth international conference on machine learning (ICML-99), Bled, Slovenia. pp 97–105
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A Syst Hum 40(1):185–197
Article Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. Knowledge discovery in databases: PKDD 2003. Springer, New York, pp 107–119
Chapter Google Scholar
Blaszczynski J, Stefanowski J (2015) Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 150:529–542
Article Google Scholar
Hido S, Kashima H, Takahashi Y (2009) Roughly balanced bagging for imbalanced data. Stat Anal Data Min 2(5–6):412–426
Article MathSciNet Google Scholar
Liu X, Wu J, Zhou Z (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B Cybern 39(2):539–550
Article Google Scholar
Geng GG, Wang CH, Li QD, Xu L, Jin XB (2007) Boosting the performance of web spam detection with ensemble under-sampling classification. In: Proceedings of the IEEE fourth international conference on fuzzy systems and knowledge discovery. pp 583–587
Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning. pp 249–256
De Castro LN, Von Zuben FJ (2002) Learning and optimization using the clonal selection principle. IEEE Trans Evolut Comput 6(3):239–251
Article Google Scholar
De Castro LN, Von Zuben FJ (2002) The clonal selection algorithm with engineering applications. In: Proceedings of the 17th genetic and evolutionary computation conference. pp 36–37
Dudek G (2012) An artificial immune system for classification with local feature selection. IEEE Trans Evolut Comput 16(6):847–860
Article Google Scholar
Castillo C, Donato D, Becchetti L, Boldi P, Leonardi S et al (2006) A reference collection for web spam. ACM Sigir Forum 40(2):11–24
Article Google Scholar
Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Book MATH Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874
Article MathSciNet Google Scholar
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
Article Google Scholar
Mason SJ, Graham NE (2002) Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: statistical significance and interpretation. Q J R Meteorol Soc 128(584):2145–2166
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Software Engineering, Nanchang University, Nanchang, 330047, China
Xiao-Yong Lu, Pei-Chan Chang & Meng-Hui Chen
School of Information Engineering, Nanchang University, Nanchang, 330031, China
Mu-Sheng Chen
Software School, Nanchang University, Nanchang, 330031, China
Mu-Sheng Chen
Institute of Information Science, Academia Sinica, Taipei, 115, Taiwan
Jheng-Long Wu
Information Management and Innovation Center for Big Data and Digital Convergence, Yuan Ze University, Taoyuan, 32003, Taiwan
Pei-Chan Chang

Authors

Xiao-Yong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Mu-Sheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jheng-Long Wu
View author publications
You can also search for this author in PubMed Google Scholar
Pei-Chan Chang
View author publications
You can also search for this author in PubMed Google Scholar
Meng-Hui Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pei-Chan Chang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, XY., Chen, MS., Wu, JL. et al. A novel ensemble decision tree based on under-sampling and clonal selection for web spam detection. Pattern Anal Applic 21, 741–754 (2018). https://doi.org/10.1007/s10044-017-0602-2

Download citation

Received: 09 May 2016
Accepted: 18 January 2017
Published: 09 February 2017
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10044-017-0602-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel ensemble decision tree based on under-sampling and clonal selection for web spam detection

Abstract

Access this article

Similar content being viewed by others

Customer churn prediction system: a machine learning approach

A machine learning based credit card fraud detection using the GA algorithm for feature selection

Analysis of Breast Cancer Detection Using Different Machine Learning Techniques

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel ensemble decision tree based on under-sampling and clonal selection for web spam detection

Abstract

Access this article

Similar content being viewed by others

Customer churn prediction system: a machine learning approach

A machine learning based credit card fraud detection using the GA algorithm for feature selection

Analysis of Breast Cancer Detection Using Different Machine Learning Techniques

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation