Abstract
Link spam is created with the intention of boosting one target’s rank in exchange of business profit. This unethical way of deceiving Web search engines is known as Web spam. Since then many anti-link spam detection techniques have constantly being proposed. Web spam detection is a crucial task due to its devastation towards Web search engines and global cost of billion dollars annually. In this paper, we proposed a novel technique by incorporating weight properties to enhance the Web spam detection algorithms. Weight properties can be defined as the influences of one Web node towards another Web node. We modified existing Web spam detection algorithms with our novel technique to evaluate the performances on a large public Web spam dataset – WEBSPAM-UK2007. The overall performance have shown that the modified algorithms outperform the benchmark algorithms up to 30.5 % improvement at host level and 6.11 % improvement at page level.
Similar content being viewed by others
References
Becchetti, L., Castillo, C., Donato, D., Baeza-Yates, R., Leonardi, S. (2008). Link analysis for Web spam detection. ACM Trans Web, 2(1), 1–42. doi:10.1145/1326561.1326563.
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R. (2006). Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the workshop on web mining and web usage analysis (WebKDD 2006), 20-23 August. Philadelphia: ACM Press.
Borodin, A., Roberts, G.O., Rosenthal, J.S., Tsaparas, P. (2005). Link analysis ranking: algorithms, theory, and experiments. ACM Transactions on Internet Technology, 5(1), 231–297. doi:10.1145/1052934.052942.
Brinkmeier, M. (2006). PageRank revisited. ACM Transactions on Internet Technology, 6(3), 282–301. doi:10.1145/1151087.151090.
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Santini, M., Vigna, S. (2006). A reference collection for web spam. SIGIR Forum, 40(2).
Eiron, N., McCurley, K.S., Tomlin, J.A. (2004). Ranking the web frontier. Paper presented at the proceedings of the 13th international conference on world wide web, 19-21 May. New York.
Fetterly, D., Manasse, M., Najork, M. (2004). Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. Paper presented at the proceedings of the 7th international workshop on the web and databases: colocated with ACM SIGMOD/PODS 2004, 1718 June. Paris: Maison de la Chimie.
Gyöngyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. In Proceedings of the 1st international workshop on adversarial information retrieval on the web (AIRWeb), 10–14 May (pp. 39–47). Chiba.
Gyöngyi, Z., Garcia-Molina, H., Pedersen, J. (2004). Combating web spam with TrustRank. In Proceedings of the thirtieth international conference on very large data bases (pp. 576–587) VLDB Endowment. Toronto.
Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., Pedersen, J. (2006). Link spam detection based on mass estimation. In: Proceedings of the 32nd international conference on very large data bases (pp. 439–450). VLDB Endowment. Seoul.
Henzinger, M.R., Motwani, R., Silverstein, C. (2002). Challenges in web search engines. SIGIR Forum, 36(2), 11–22. doi:10.1145/792550.792553.
Kleinberg, J.M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632. doi:10.1145/324133.324140.
Krishnan, V., & Raj, R. (2006). Web spam detection with anti-TrustRank. In Proceedings of the 2nd international workshop on adversarial information retrieval on the web (AIRWeb), 10 August (pp. 37–40). Seattle.
Lempel, R., & Moran, S. (2001). SALSA: the stochastic approach for link-structure analysis. ACM Transactions on Information Systems, 19(2), 131–160. doi:10.1145/382979.383041.
Leng, A.G.K., Patchmuthu, R.K., Singh, A.K. (2012a). Incorporating weight properties in detection of web spam. In The 2nd international conference on uncertainty reasoning and knowledge engineering, 14-15 August (pp 18–21). Jakarta: IEEE.
Leng, A.G.K., Patchmuthu, R.K., Singh, A.K., Mohan, A. (2012b). Link based spam algorithms in adversarial information retrieval. Cybernetics and Systems: An International Journal, 43(6), 459–475. doi:10.1080/01969722.2012.707491.
Li, L., Shang, Y., Zhang, W. (2002). Improvement of HITS-based algorithms on web documents. In Proceedings of the 11th international conference on world wide web (pp. 527–535). ACM.
Liang, C., Ru, L., Zhu, X. (2007). R-SpamRank: a spam detection algorithm based on link analysis. Journal of Computational Information Systems, 3(4), 1705–1712.
Nemirovsky, D., & Avrachenkov, K. (2008). Weighted PageRank: Cluster-related weights.
Nie, L., Wu, B., Davison, B.D. (2007). Winnowing wheat from the chaff: propagating trust to sift spam from the web. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, 23-27 July (pp. 869–870). New York: ACM. doi:10.1145/1277741.1277950.
Noi, L.D., Hagenbuchner, M., Scarselli, F., Tsoi, A.C. (2010). Web spam detection by probability mapping GraphSOMs and Graph Neural Networks. In: Proceedings of the 20th international conference on artificial neural networks: Part II, Thessaloniki, Greece, 15–18 September (pp. 372–381). Germany: Springer.
Qi, C., Song-Nian, Y., Sisi, C. (2008). Link variable TrustRank for fighting web spam. In: Proceedings of international conference on computer science and software engineering, Wuhan, China, 12–14 Dec (pp. 1004–1007). Wuhan. doi:10.1109/csse.2008.1099.
Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G. (2009a). Computational capabilities of graph neural networks. Transactions on Neural Network, 20(1), 81–102. doi:10.1109/tnn.2008.2005141.
Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G. (2009b). The graph neural network model. Trans Neur Netw, 20(1), 61–80. doi:10.1109/tnn.2008.2005605.
Sobek, M. (2002). Pr0 - Google’s PageRank 0 Penalty. http://pr.efactory.de/e-pr0.shtml. Accessed 25 Feb 2012.
Wang, X., Tao, T., Sun, J.-T., Shakery, A., Zhai, C. (2008). DirichletRank: solving the zero-one gap problem of PageRank. ACM Transactions on Information Systems, 26(2), 1–29. doi:10.1145/1344411.1344416.
Wu, B., & Davison, B.D. (2005a). Cloaking and redirection: a preliminary study. In: Proceedings of the 1st international workshop on adversarial information retrieval on the web (AIRWeb), Chiba, Japan, 10–14 May (pp. 39–47). Chiba.
Wu, B., & Davison, B.D. (2005b). Identifying link farm spam pages. In: Proceedings of special interest tracks and posters of the 14th international conference on world wide web (pp. 820–829). New York: ACM. doi:10.1145/1062745.1062762.
Wu, B., Goel, V., Davison, B.D. (2006a). Propagating trust and distrust to demote web spam. Paper presented at the world wide web (WWW2006) Workshop on Models of Trust for the Web (MTW), 22–26 May. Edinburgh.
Wu, B., Goel, V., Davison, B.D. (2006b). Topical TrustRank: using topicality to combat web spam. In Proceedings of the 15th international conference on world wide web, Edinburgh, Scotland, 22–26 May (pp. 63–72). New York: ACM. doi:10.1145/1135777.1135792.
Xing, W., & Ghorbani, A. (2004). Weighted pagerank algorithm. In: Communication Networks and Services Research, 2004. Proceedings. Second Annual Conference on (pp. 305–314). IEEE.
Yahoo! (2007). Web Spam Collections. http://barcelona.research.yahoo.net/webspam/datasets/.
Yang, H., King, I., Lyu, M.R. (2007). DiffusionRank: a possible penicillin for web spamming. In Proceedings of the 30th annual international acm sigir conference on research and development in information retrieval, Amsterdam, The Netherlands, 23-27 July (pp. 431–438). New York: ACM. doi:10.1145/1277741.1277815.
Zhang, Y., Jiang, Q., Zhang, L., Zhu, Y. (2009). Exploiting bidirectional links: making spamming detection easier. In Proceedings of the 18th ACM conference on Information and knowledge management, Hong Kong, China (pp. 1839–1842). ACM, 1646244. doi:10.1145/1645953.1646244.
Zhang, X., Wang, Y., Mou, N., Liang, W. (2011). Propagating both trust and distrust with target differentiation for combating web spam. In W. Burgard, D. Roth (Eds.), Proceedings of the twenty-fifth conference on artificial intelligence (AAAI-11) (pp. 1292–1297). San Francisco: AAAI Press, conf/aaai/ZhangWML11.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Anti-TrustRank algorithms
Wu et al. Distrust algorithm
Nie et al. Distrust algorithm
Where ATR represent Anti-TrustRank, DISTR represent Baoning Wu et al. Distrust algorithm and Distrust(p) represent Lan Nie et al. Distrust algorithm. j is a jumping probability, τ(q) is the number of incoming links of host q and B(p) is the spam vector. MaxShareis a function that only takes the maximum distrust values from the children.
Rights and permissions
About this article
Cite this article
Goh, K.L., Patchmuthu, R.K. & Singh, A.K. Link-based web spam detection using weight properties. J Intell Inf Syst 43, 129–145 (2014). https://doi.org/10.1007/s10844-014-0310-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-014-0310-y