Link-based web spam detection using weight properties

Goh, Kwang Leng; Patchmuthu, Ravi Kumar; Singh, Ashutosh Kumar

doi:10.1007/s10844-014-0310-y

Link-based web spam detection using weight properties

Published: 05 March 2014

Volume 43, pages 129–145, (2014)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Kwang Leng Goh¹,
Ravi Kumar Patchmuthu¹ &
Ashutosh Kumar Singh²

624 Accesses
12 Citations
Explore all metrics

Abstract

Link spam is created with the intention of boosting one target’s rank in exchange of business profit. This unethical way of deceiving Web search engines is known as Web spam. Since then many anti-link spam detection techniques have constantly being proposed. Web spam detection is a crucial task due to its devastation towards Web search engines and global cost of billion dollars annually. In this paper, we proposed a novel technique by incorporating weight properties to enhance the Web spam detection algorithms. Weight properties can be defined as the influences of one Web node towards another Web node. We modified existing Web spam detection algorithms with our novel technique to evaluate the performances on a large public Web spam dataset – WEBSPAM-UK2007. The overall performance have shown that the modified algorithms outperform the benchmark algorithms up to 30.5 % improvement at host level and 6.11 % improvement at page level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Online social networks security and privacy: comprehensive review and analysis

Article Open access 01 June 2021

Ankit Kumar Jain, Somya Ranjan Sahoo & Jyoti Kaubiyal

Fighting against phishing attacks: state of the art and future challenges

Article 17 March 2016

B. B. Gupta, Aakanksha Tewari, … Dharma P. Agrawal

Survey of intrusion detection systems: techniques, datasets and challenges

Article Open access 17 July 2019

Ansam Khraisat, Iqbal Gondal, … Joarder Kamruzzaman

References

Becchetti, L., Castillo, C., Donato, D., Baeza-Yates, R., Leonardi, S. (2008). Link analysis for Web spam detection. ACM Trans Web, 2(1), 1–42. doi:10.1145/1326561.1326563.
Article Google Scholar
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R. (2006). Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the workshop on web mining and web usage analysis (WebKDD 2006), 20-23 August. Philadelphia: ACM Press.
Google Scholar
Borodin, A., Roberts, G.O., Rosenthal, J.S., Tsaparas, P. (2005). Link analysis ranking: algorithms, theory, and experiments. ACM Transactions on Internet Technology, 5(1), 231–297. doi:10.1145/1052934.052942.
Article Google Scholar
Brinkmeier, M. (2006). PageRank revisited. ACM Transactions on Internet Technology, 6(3), 282–301. doi:10.1145/1151087.151090.
Article Google Scholar
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Santini, M., Vigna, S. (2006). A reference collection for web spam. SIGIR Forum, 40(2).
Eiron, N., McCurley, K.S., Tomlin, J.A. (2004). Ranking the web frontier. Paper presented at the proceedings of the 13th international conference on world wide web, 19-21 May. New York.
Fetterly, D., Manasse, M., Najork, M. (2004). Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. Paper presented at the proceedings of the 7th international workshop on the web and databases: colocated with ACM SIGMOD/PODS 2004, 1718 June. Paris: Maison de la Chimie.
Google Scholar
Gyöngyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. In Proceedings of the 1st international workshop on adversarial information retrieval on the web (AIRWeb), 10–14 May (pp. 39–47). Chiba.
Gyöngyi, Z., Garcia-Molina, H., Pedersen, J. (2004). Combating web spam with TrustRank. In Proceedings of the thirtieth international conference on very large data bases (pp. 576–587) VLDB Endowment. Toronto.
Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., Pedersen, J. (2006). Link spam detection based on mass estimation. In: Proceedings of the 32nd international conference on very large data bases (pp. 439–450). VLDB Endowment. Seoul.
Henzinger, M.R., Motwani, R., Silverstein, C. (2002). Challenges in web search engines. SIGIR Forum, 36(2), 11–22. doi:10.1145/792550.792553.
Article Google Scholar
Kleinberg, J.M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632. doi:10.1145/324133.324140.
Article MATH MathSciNet Google Scholar
Krishnan, V., & Raj, R. (2006). Web spam detection with anti-TrustRank. In Proceedings of the 2nd international workshop on adversarial information retrieval on the web (AIRWeb), 10 August (pp. 37–40). Seattle.
Lempel, R., & Moran, S. (2001). SALSA: the stochastic approach for link-structure analysis. ACM Transactions on Information Systems, 19(2), 131–160. doi:10.1145/382979.383041.
Article Google Scholar
Leng, A.G.K., Patchmuthu, R.K., Singh, A.K. (2012a). Incorporating weight properties in detection of web spam. In The 2nd international conference on uncertainty reasoning and knowledge engineering, 14-15 August (pp 18–21). Jakarta: IEEE.
Google Scholar
Leng, A.G.K., Patchmuthu, R.K., Singh, A.K., Mohan, A. (2012b). Link based spam algorithms in adversarial information retrieval. Cybernetics and Systems: An International Journal, 43(6), 459–475. doi:10.1080/01969722.2012.707491.
Article Google Scholar
Li, L., Shang, Y., Zhang, W. (2002). Improvement of HITS-based algorithms on web documents. In Proceedings of the 11th international conference on world wide web (pp. 527–535). ACM.
Liang, C., Ru, L., Zhu, X. (2007). R-SpamRank: a spam detection algorithm based on link analysis. Journal of Computational Information Systems, 3(4), 1705–1712.
Google Scholar
Nemirovsky, D., & Avrachenkov, K. (2008). Weighted PageRank: Cluster-related weights.
Nie, L., Wu, B., Davison, B.D. (2007). Winnowing wheat from the chaff: propagating trust to sift spam from the web. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, 23-27 July (pp. 869–870). New York: ACM. doi:10.1145/1277741.1277950.
Google Scholar
Noi, L.D., Hagenbuchner, M., Scarselli, F., Tsoi, A.C. (2010). Web spam detection by probability mapping GraphSOMs and Graph Neural Networks. In: Proceedings of the 20th international conference on artificial neural networks: Part II, Thessaloniki, Greece, 15–18 September (pp. 372–381). Germany: Springer.
Google Scholar
Qi, C., Song-Nian, Y., Sisi, C. (2008). Link variable TrustRank for fighting web spam. In: Proceedings of international conference on computer science and software engineering, Wuhan, China, 12–14 Dec (pp. 1004–1007). Wuhan. doi:10.1109/csse.2008.1099.
Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G. (2009a). Computational capabilities of graph neural networks. Transactions on Neural Network, 20(1), 81–102. doi:10.1109/tnn.2008.2005141.
Article Google Scholar
Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G. (2009b). The graph neural network model. Trans Neur Netw, 20(1), 61–80. doi:10.1109/tnn.2008.2005605.
Article Google Scholar
Sobek, M. (2002). Pr0 - Google’s PageRank 0 Penalty. http://pr.efactory.de/e-pr0.shtml. Accessed 25 Feb 2012.
Wang, X., Tao, T., Sun, J.-T., Shakery, A., Zhai, C. (2008). DirichletRank: solving the zero-one gap problem of PageRank. ACM Transactions on Information Systems, 26(2), 1–29. doi:10.1145/1344411.1344416.
Article MATH Google Scholar
Wu, B., & Davison, B.D. (2005a). Cloaking and redirection: a preliminary study. In: Proceedings of the 1st international workshop on adversarial information retrieval on the web (AIRWeb), Chiba, Japan, 10–14 May (pp. 39–47). Chiba.
Wu, B., & Davison, B.D. (2005b). Identifying link farm spam pages. In: Proceedings of special interest tracks and posters of the 14th international conference on world wide web (pp. 820–829). New York: ACM. doi:10.1145/1062745.1062762.
Chapter Google Scholar
Wu, B., Goel, V., Davison, B.D. (2006a). Propagating trust and distrust to demote web spam. Paper presented at the world wide web (WWW2006) Workshop on Models of Trust for the Web (MTW), 22–26 May. Edinburgh.
Wu, B., Goel, V., Davison, B.D. (2006b). Topical TrustRank: using topicality to combat web spam. In Proceedings of the 15th international conference on world wide web, Edinburgh, Scotland, 22–26 May (pp. 63–72). New York: ACM. doi:10.1145/1135777.1135792.
Chapter Google Scholar
Xing, W., & Ghorbani, A. (2004). Weighted pagerank algorithm. In: Communication Networks and Services Research, 2004. Proceedings. Second Annual Conference on (pp. 305–314). IEEE.
Yahoo! (2007). Web Spam Collections. http://barcelona.research.yahoo.net/webspam/datasets/.
Yang, H., King, I., Lyu, M.R. (2007). DiffusionRank: a possible penicillin for web spamming. In Proceedings of the 30th annual international acm sigir conference on research and development in information retrieval, Amsterdam, The Netherlands, 23-27 July (pp. 431–438). New York: ACM. doi:10.1145/1277741.1277815.
Google Scholar
Zhang, Y., Jiang, Q., Zhang, L., Zhu, Y. (2009). Exploiting bidirectional links: making spamming detection easier. In Proceedings of the 18th ACM conference on Information and knowledge management, Hong Kong, China (pp. 1839–1842). ACM, 1646244. doi:10.1145/1645953.1646244.
Zhang, X., Wang, Y., Mou, N., Liang, W. (2011). Propagating both trust and distrust with target differentiation for combating web spam. In W. Burgard, D. Roth (Eds.), Proceedings of the twenty-fifth conference on artificial intelligence (AAAI-11) (pp. 1292–1297). San Francisco: AAAI Press, conf/aaai/ZhangWML11.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Curtin University, Sarawak Campus, Sarawak, Malaysia
Kwang Leng Goh & Ravi Kumar Patchmuthu
Department of Computer Application, National Institute of Technology, Kurukshetra, Haryana, India
Ashutosh Kumar Singh

Authors

Kwang Leng Goh
View author publications
You can also search for this author in PubMed Google Scholar
Ravi Kumar Patchmuthu
View author publications
You can also search for this author in PubMed Google Scholar
Ashutosh Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kwang Leng Goh.

Appendix

Anti-TrustRank algorithms

$$ATR(p)=j\cdot \sum\limits_{(p,q)\in \varepsilon } {\left( {\frac{ATR(q)}{\tau (q)}} \right)+(1-j)\cdot B(p)} $$

Wu et al. Distrust algorithm

$$DISTR(p)=j\cdot c\cdot MaxShare\left[ {\sum\limits_{\forall (p:q)\in G} {\left( {\frac{DISTR(q)}{\log (1+\tau (q))}} \right)} } \right]+(1-j)\cdot B(p) $$

Nie et al. Distrust algorithm

$$Distrust(p)=j\cdot MaxShare\left[ {\sum\limits_{\forall (p:q)\in G} {\left( {\frac{Distrust(q)}{\tau (q)}} \right)} } \right]+(1-j)\cdot B(p) $$

Where ATR represent Anti-TrustRank, DISTR represent Baoning Wu et al. Distrust algorithm and Distrust(p) represent Lan Nie et al. Distrust algorithm. j is a jumping probability, τ(q) is the number of incoming links of host q and B(p) is the spam vector. MaxShareis a function that only takes the maximum distrust values from the children.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Goh, K.L., Patchmuthu, R.K. & Singh, A.K. Link-based web spam detection using weight properties. J Intell Inf Syst 43, 129–145 (2014). https://doi.org/10.1007/s10844-014-0310-y

Download citation

Received: 14 April 2013
Revised: 23 December 2013
Accepted: 31 January 2014
Published: 05 March 2014
Issue Date: August 2014
DOI: https://doi.org/10.1007/s10844-014-0310-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Link-based web spam detection using weight properties

Abstract

Access this article

Similar content being viewed by others

Online social networks security and privacy: comprehensive review and analysis

Fighting against phishing attacks: state of the art and future challenges

Survey of intrusion detection systems: techniques, datasets and challenges

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Link-based web spam detection using weight properties

Abstract

Access this article

Similar content being viewed by others

Online social networks security and privacy: comprehensive review and analysis

Fighting against phishing attacks: state of the art and future challenges

Survey of intrusion detection systems: techniques, datasets and challenges

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation