Skip to main content
Log in

Link-based web spam detection using weight properties

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Link spam is created with the intention of boosting one target’s rank in exchange of business profit. This unethical way of deceiving Web search engines is known as Web spam. Since then many anti-link spam detection techniques have constantly being proposed. Web spam detection is a crucial task due to its devastation towards Web search engines and global cost of billion dollars annually. In this paper, we proposed a novel technique by incorporating weight properties to enhance the Web spam detection algorithms. Weight properties can be defined as the influences of one Web node towards another Web node. We modified existing Web spam detection algorithms with our novel technique to evaluate the performances on a large public Web spam dataset – WEBSPAM-UK2007. The overall performance have shown that the modified algorithms outperform the benchmark algorithms up to 30.5 % improvement at host level and 6.11 % improvement at page level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Becchetti, L., Castillo, C., Donato, D., Baeza-Yates, R., Leonardi, S. (2008). Link analysis for Web spam detection. ACM Trans Web, 2(1), 1–42. doi:10.1145/1326561.1326563.

    Article  Google Scholar 

  • Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R. (2006). Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the workshop on web mining and web usage analysis (WebKDD 2006), 20-23 August. Philadelphia: ACM Press.

    Google Scholar 

  • Borodin, A., Roberts, G.O., Rosenthal, J.S., Tsaparas, P. (2005). Link analysis ranking: algorithms, theory, and experiments. ACM Transactions on Internet Technology, 5(1), 231–297. doi:10.1145/1052934.052942.

    Article  Google Scholar 

  • Brinkmeier, M. (2006). PageRank revisited. ACM Transactions on Internet Technology, 6(3), 282–301. doi:10.1145/1151087.151090.

    Article  Google Scholar 

  • Castillo, C., Donato, D., Becchetti, L., Boldi, P., Santini, M., Vigna, S. (2006). A reference collection for web spam. SIGIR Forum, 40(2).

  • Eiron, N., McCurley, K.S., Tomlin, J.A. (2004). Ranking the web frontier. Paper presented at the proceedings of the 13th international conference on world wide web, 19-21 May. New York.

  • Fetterly, D., Manasse, M., Najork, M. (2004). Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. Paper presented at the proceedings of the 7th international workshop on the web and databases: colocated with ACM SIGMOD/PODS 2004, 1718 June. Paris: Maison de la Chimie.

    Google Scholar 

  • Gyöngyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. In Proceedings of the 1st international workshop on adversarial information retrieval on the web (AIRWeb), 10–14 May (pp. 39–47). Chiba.

  • Gyöngyi, Z., Garcia-Molina, H., Pedersen, J. (2004). Combating web spam with TrustRank. In Proceedings of the thirtieth international conference on very large data bases (pp. 576–587) VLDB Endowment. Toronto.

  • Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., Pedersen, J. (2006). Link spam detection based on mass estimation. In: Proceedings of the 32nd international conference on very large data bases (pp. 439–450). VLDB Endowment. Seoul.

  • Henzinger, M.R., Motwani, R., Silverstein, C. (2002). Challenges in web search engines. SIGIR Forum, 36(2), 11–22. doi:10.1145/792550.792553.

    Article  Google Scholar 

  • Kleinberg, J.M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632. doi:10.1145/324133.324140.

    Article  MATH  MathSciNet  Google Scholar 

  • Krishnan, V., & Raj, R. (2006). Web spam detection with anti-TrustRank. In Proceedings of the 2nd international workshop on adversarial information retrieval on the web (AIRWeb), 10 August (pp. 37–40). Seattle.

  • Lempel, R., & Moran, S. (2001). SALSA: the stochastic approach for link-structure analysis. ACM Transactions on Information Systems, 19(2), 131–160. doi:10.1145/382979.383041.

    Article  Google Scholar 

  • Leng, A.G.K., Patchmuthu, R.K., Singh, A.K. (2012a). Incorporating weight properties in detection of web spam. In The 2nd international conference on uncertainty reasoning and knowledge engineering, 14-15 August (pp 18–21). Jakarta: IEEE.

    Google Scholar 

  • Leng, A.G.K., Patchmuthu, R.K., Singh, A.K., Mohan, A. (2012b). Link based spam algorithms in adversarial information retrieval. Cybernetics and Systems: An International Journal, 43(6), 459–475. doi:10.1080/01969722.2012.707491.

    Article  Google Scholar 

  • Li, L., Shang, Y., Zhang, W. (2002). Improvement of HITS-based algorithms on web documents. In Proceedings of the 11th international conference on world wide web (pp. 527–535). ACM.

  • Liang, C., Ru, L., Zhu, X. (2007). R-SpamRank: a spam detection algorithm based on link analysis. Journal of Computational Information Systems, 3(4), 1705–1712.

    Google Scholar 

  • Nemirovsky, D., & Avrachenkov, K. (2008). Weighted PageRank: Cluster-related weights.

  • Nie, L., Wu, B., Davison, B.D. (2007). Winnowing wheat from the chaff: propagating trust to sift spam from the web. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, 23-27 July (pp. 869–870). New York: ACM. doi:10.1145/1277741.1277950.

    Google Scholar 

  • Noi, L.D., Hagenbuchner, M., Scarselli, F., Tsoi, A.C. (2010). Web spam detection by probability mapping GraphSOMs and Graph Neural Networks. In: Proceedings of the 20th international conference on artificial neural networks: Part II, Thessaloniki, Greece, 15–18 September (pp. 372–381). Germany: Springer.

    Google Scholar 

  • Qi, C., Song-Nian, Y., Sisi, C. (2008). Link variable TrustRank for fighting web spam. In: Proceedings of international conference on computer science and software engineering, Wuhan, China, 12–14 Dec (pp. 1004–1007). Wuhan. doi:10.1109/csse.2008.1099.

  • Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G. (2009a). Computational capabilities of graph neural networks. Transactions on Neural Network, 20(1), 81–102. doi:10.1109/tnn.2008.2005141.

    Article  Google Scholar 

  • Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G. (2009b). The graph neural network model. Trans Neur Netw, 20(1), 61–80. doi:10.1109/tnn.2008.2005605.

    Article  Google Scholar 

  • Sobek, M. (2002). Pr0 - Google’s PageRank 0 Penalty. http://pr.efactory.de/e-pr0.shtml. Accessed 25 Feb 2012.

  • Wang, X., Tao, T., Sun, J.-T., Shakery, A., Zhai, C. (2008). DirichletRank: solving the zero-one gap problem of PageRank. ACM Transactions on Information Systems, 26(2), 1–29. doi:10.1145/1344411.1344416.

    Article  MATH  Google Scholar 

  • Wu, B., & Davison, B.D. (2005a). Cloaking and redirection: a preliminary study. In: Proceedings of the 1st international workshop on adversarial information retrieval on the web (AIRWeb), Chiba, Japan, 10–14 May (pp. 39–47). Chiba.

  • Wu, B., & Davison, B.D. (2005b). Identifying link farm spam pages. In: Proceedings of special interest tracks and posters of the 14th international conference on world wide web (pp. 820–829). New York: ACM. doi:10.1145/1062745.1062762.

    Chapter  Google Scholar 

  • Wu, B., Goel, V., Davison, B.D. (2006a). Propagating trust and distrust to demote web spam. Paper presented at the world wide web (WWW2006) Workshop on Models of Trust for the Web (MTW), 22–26 May. Edinburgh.

  • Wu, B., Goel, V., Davison, B.D. (2006b). Topical TrustRank: using topicality to combat web spam. In Proceedings of the 15th international conference on world wide web, Edinburgh, Scotland, 22–26 May (pp. 63–72). New York: ACM. doi:10.1145/1135777.1135792.

    Chapter  Google Scholar 

  • Xing, W., & Ghorbani, A. (2004). Weighted pagerank algorithm. In: Communication Networks and Services Research, 2004. Proceedings. Second Annual Conference on (pp. 305–314). IEEE.

  • Yahoo! (2007). Web Spam Collections. http://barcelona.research.yahoo.net/webspam/datasets/.

  • Yang, H., King, I., Lyu, M.R. (2007). DiffusionRank: a possible penicillin for web spamming. In Proceedings of the 30th annual international acm sigir conference on research and development in information retrieval, Amsterdam, The Netherlands, 23-27 July (pp. 431–438). New York: ACM. doi:10.1145/1277741.1277815.

    Google Scholar 

  • Zhang, Y., Jiang, Q., Zhang, L., Zhu, Y. (2009). Exploiting bidirectional links: making spamming detection easier. In Proceedings of the 18th ACM conference on Information and knowledge management, Hong Kong, China (pp. 1839–1842). ACM, 1646244. doi:10.1145/1645953.1646244.

  • Zhang, X., Wang, Y., Mou, N., Liang, W. (2011). Propagating both trust and distrust with target differentiation for combating web spam. In W. Burgard, D. Roth (Eds.), Proceedings of the twenty-fifth conference on artificial intelligence (AAAI-11) (pp. 1292–1297). San Francisco: AAAI Press, conf/aaai/ZhangWML11.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kwang Leng Goh.

Appendix

Appendix

Anti-TrustRank algorithms

$$ATR(p)=j\cdot \sum\limits_{(p,q)\in \varepsilon } {\left( {\frac{ATR(q)}{\tau (q)}} \right)+(1-j)\cdot B(p)} $$

Wu et al. Distrust algorithm

$$DISTR(p)=j\cdot c\cdot MaxShare\left[ {\sum\limits_{\forall (p:q)\in G} {\left( {\frac{DISTR(q)}{\log (1+\tau (q))}} \right)} } \right]+(1-j)\cdot B(p) $$

Nie et al. Distrust algorithm

$$Distrust(p)=j\cdot MaxShare\left[ {\sum\limits_{\forall (p:q)\in G} {\left( {\frac{Distrust(q)}{\tau (q)}} \right)} } \right]+(1-j)\cdot B(p) $$

Where ATR represent Anti-TrustRank, DISTR represent Baoning Wu et al. Distrust algorithm and Distrust(p) represent Lan Nie et al. Distrust algorithm. j is a jumping probability, τ(q) is the number of incoming links of host q and B(p) is the spam vector. MaxShareis a function that only takes the maximum distrust values from the children.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Goh, K.L., Patchmuthu, R.K. & Singh, A.K. Link-based web spam detection using weight properties. J Intell Inf Syst 43, 129–145 (2014). https://doi.org/10.1007/s10844-014-0310-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-014-0310-y

Keywords

Navigation