skip to main content
research-article

Survey on web spam detection: principles and algorithms

Authors Info & Claims
Published:01 May 2012Publication History
Skip Abstract Section

Abstract

Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even more challenging. Over the last decade research on adversarial information retrieval has gained a lot of interest both from academia and industry. In this paper we present a systematic review of web spam detection techniques with the focus on algorithms and underlying principles. We categorize all existing algorithms into three categories based on the type of information they use: content-based methods, link-based methods, and methods based on non-traditional data such as user behaviour, clicks, HTTP sessions. In turn, we perform a subcategorization of link-based category into five groups based on ideas and principles used: labels propagation, link pruning and reweighting, labels refinement, graph regularization, and featurebased. We also define the concept of web spam numerically and provide a brief survey on various spam forms. Finally, we summarize the observations and underlying principles applied for web spam detection.

References

  1. J. Abernethy, O. Chapelle, and C. Castillo. Graph regularization methods for web spam detection. Mach. Learn., Vol. 81, Nov. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Abernethy, O. Chapelle, C. Castillo, J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A new approach to web spam detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'08, 2008.Google ScholarGoogle Scholar
  3. S. Adali, T. Liu, and M. Magdon-Ismail. Optimal Link Bombs are Uncoordinated. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'05, Chiba, Japan, 2005.Google ScholarGoogle Scholar
  4. E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: detecting site functionality by structural patterns. In Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, Nottingham, UK, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Baeza-Yates, P. Boldi, and C. Castillo. Generalizing pagerank: damping functions for link-based ranking algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'06, Seattle, Washington, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Baeza-Yates, C. Castillo, and V. López. Pagerank Increase under Different Collusion Topologies. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'05, 2005.Google ScholarGoogle Scholar
  7. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'06, Seattle, USA, 2006.Google ScholarGoogle Scholar
  8. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis, WebKDD'06, Philadelphia, USA, 2006.Google ScholarGoogle Scholar
  9. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Web spam detection: Link-based and content-based techniques. In The European Integrated Project Dynamically Evolving, Large Scale Information Systems (DELIS): proceedings of the final workshop, volume Vol. 222, 2008.Google ScholarGoogle Scholar
  10. A. Benczúr, I. Bíró, K. Csalogány, and T. Sarlós. Web spam detection via commercial intent analysis. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight Web spam. In Proceedings of the Second Workshop on Adversarial Information Retrieval on the Web, AIRWeb'06, Seattle, WA, 2006.Google ScholarGoogle Scholar
  12. A. A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher. Spamrank: Fully automatic link spam detection work in progress. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'05, May 2005.Google ScholarGoogle Scholar
  13. A. A. Benczúr, D. Siklósi, J. Szabó, I. Bíró, Z. Fekete, M. Kurucz, A. Pereszlényi, S. Rácz, and A. Szabó. Web spam: a survey with vision for the archivist. In Proceedings of the International Web Archiving Workshop, IWAW'08.Google ScholarGoogle Scholar
  14. P. Berkhin. A survey on pagerank computing. Internet Mathematics, Vol. 2, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  15. A. Berman and R. Plemmons. Nonnegative Matrices in the Mathematical Sciences (Classics in Applied Mathematics). Society for Industrial Mathematics, 1987.Google ScholarGoogle Scholar
  16. K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'98, Melbourne, Australia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Bhattacharjee and A. Goel. Algorithms and Incentives for Robust Ranking. Technical report, Stanford University, 2006.Google ScholarGoogle Scholar
  18. M. Bianchini, M. Gori, and F. Scarselli. Inside pagerank. ACM Trans. Internet Technol., Vol. 5, Feb. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Blanzieri and A. Bryl. A survey of learning-based techniques of email spam filtering. Artif. Intell. Rev., 29, March 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide web. In Proceedings of the 10th International Conference on World Wide Web, WWW'01, Hong Kong, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Z. Broder. Some applications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science. Springer-Verlag, 1993.Google ScholarGoogle Scholar
  22. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Selected papers from the Sixth International Conference on World Wide Web, WWW'97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Castillo and B. D. Davison. Adversarial web search. Found. Trends Inf. Retr., 4, May 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. SIGIR Forum, 40, Dec. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR' 07, Amsterdam, The Netherlands, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Caverlee and L. Liu. Countering web spam with credibility-based link analysis. In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, PODC'07, Portland, OR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Chellapilla and D. Chickering. Improving cloaking detection using search query popularity and monetizability, 2006.Google ScholarGoogle Scholar
  29. K. Chellapilla and A. Maykov. A taxonomy of javascript redirection spam. In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, AIRWeb'07, Banff, Canada, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Z. Cheng, B. Gao, C. Sun, Y. Jiang, and T.-Y. Liu. Let web spammers expose themselves. In Proceedings of the fourth ACM International Conference on Web search and Data Mining, WSDM'11, Hong Kong, China, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. Convey. Porn sneaks way back on web. The Boston Herald, 1996.Google ScholarGoogle Scholar
  32. A. L. da Costa Carvalho, P. A. Chirita, E. S. de Moura, P. Calado, and W. Nejdl. Site level noise removal for search engines. In Proceedings of the 15th International Conference on World Wide Web, WWW'06, Edinburgh, Scotland, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD'04, WA, USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. N. Daswani and M. Stoppelman. The anatomy of clickbot.a. In Proceedings of the First Conference on First Workshop on Hot Topics in Understanding Botnets, Berkeley, CA, 2007. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. B. Davison. Recognizing nepotistic links on the web. In Workshop on Artificial Intelligence for Web Search, AAAI'00.Google ScholarGoogle Scholar
  36. B. D. Davison. Topical locality in the web. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'00, Athens, Greece. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Z. Dou, R. Song, X. Yuan, and J.-R. Wen. Are clickthrough data adequate for learning web search rankings? In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proceedig of the 16th European Conference on Machine Learning, ECML'05, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In Proceedings of the 13th International Conference on World Wide Web, WWW'04, New York, NY, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Erdélyi, A. Garzó, and A. A. Benczúr. Web spam classification: a few features worth more. In Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, WebQuality'11, Hyderabad, India, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. D. Fetterly. Adversarial Information Retrieval: The Manipulation of Web Content. 2007.Google ScholarGoogle Scholar
  42. D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'05, Salvador, Brazil. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. J. Web Eng., 2, Oct. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, WebDB'04, Paris, France, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. D. Fogaras. Where to start browsing the web. In Proceedings of IICS. Springer-Verlag, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  46. D. Fogaras and B. Racz. Towards scaling fully personalized pagerank. In Proceedings of the 3rd Workshop on Algorithms and Models for the Web-Graph, WAW'04, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  47. Q. Gan and T. Suel. Improving web spam classifiers using link structure. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'07, Banff, Alberta, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. G. Geng, C.Wang, and Q. Li. Improving web spam detection with re-extracted features. In Proceeding of the 17th International Conference on World Wide Web, WWW'08, Beijing, China, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. G.-G. Geng, Q. Li, and X. Zhang. Link based small sample learning for web spam detection. In Proceedings of the 18th international conference on World Wide Web, WWW'09, Madrid, Spain, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. googleblog.blogspot.com. http://googleblog.blogspot.com/2011/01/googlesearch- and-search-engine-spam.html,2011.Google ScholarGoogle Scholar
  51. R. Guha, R. Kumar, P. Raghavan, and A. Tomkins. Propagation of trust and distrust. In Proceedings of the 13th International Conference on World Wide Web, WWW'04, New York, NY, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Z. Gyongyi and H. Garcia-Molina. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Databases, VLDB'06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB'05, Trondheim, Norway, 2005. VLDB Endowment. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In Proceeding of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'05, Chiba, Japan, May 2005.Google ScholarGoogle Scholar
  55. Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB'04, Toronto, Canada, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. T. H. Haveliwala. Topic-sensitive pagerank. In Proceedings of the 11th International Conference on World Wide Web, WWW'02, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. P. Heymann, G. Koutrika, and H. Garcia-Molina. Fighting spam on social web sites: A survey of approaches and future challenges. IEEE Internet Computing, Vol. 11(6), Nov. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. D. Hiemstra. Language models. In Encyclopedia of Database Systems. 2009.Google ScholarGoogle Scholar
  60. N. Immorlica, K. Jain, M. Mahdian, and K. Talwar. Click Fraud Resistant Methods for Learning Click-Through Rates. Technical report, Microsoft Research, Redmond, 2006.Google ScholarGoogle Scholar
  61. G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD'02, Edmonton, Alberta, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. G. Jeh and J. Widom. Scaling personalized web search. In Proceedings of the 12th international conference on World Wide Web, WWW'03, Budapest, Hungary, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. R. Jennings. The global economic impact of spam. Ferris Research, 2005.Google ScholarGoogle Scholar
  64. R. Jennings. Cost of spam is flattening -- our 2009 predictions. Ferris Research, 2009.Google ScholarGoogle Scholar
  65. T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'05, Salvador, Brazil, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. A. Joshi, R. Kumar, B. Reed, and A. Tomkins. Anchor-based proximity measures. In Proceedings of the 16th International Conference on World Wide Web, WWW'07, Banff, Alberta, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, Vol. 48, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46, Sept. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. P. Kolari, A. Java, T. Finin, T. Oates, and A. Joshi. Detecting spam blogs: a machine learning approach. In Proceedings of the 21st National Conference on Artificial Intelligence, volume Vol. 2, Boston, MA, 2006. AAAI Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Z. Kou and W. W. Cohen. Stacked graphical models for efficient inference in markov random fields. In Proceedings of the Seventh SIAM International Conference on Data Mining, SDM'07, Minneapolis, Minnesota, April 2007.Google ScholarGoogle ScholarCross RefCross Ref
  71. V. Krishnan and R. Raj. Web spam detection with anti-trust rank, 2006.Google ScholarGoogle Scholar
  72. A. Langville and C. Meyer. Deeper inside pagerank. Internet Mathematics, Vol. 1, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  73. R. Lempel and S. Moran. SALSA: the stochastic approach for link-structure analysis. ACM Trans. Inf. Syst., 19, April 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. L. Li, Y. Shang, and W. Zhang. Improvement of hitsbased algorithms on web documents. In Proceedings of the 11th International Conference on World Wide Web, WWW'02, Honolulu, Hawaii, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. J.-L. Lin. Detection of cloaked web spam by using tagbased methods. Expert Syst. Appl., 36, May 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. Tseng. Splog detection using self-similarity analysis on blog temporal dynamics. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'07, Banff, Alberta, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Y. Liu, B. Gao, T.-Y. Liu, Y. Zhang, Z. Ma, S. He, and H. Li. Browserank: letting web users vote for page importance. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'08, Singapore, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Y. Liu, M. Zhang, S. Ma, and L. Ru. User behavior oriented web spam detection. In Proceeding of the 17th International Conference on World Wide Web, WWW'08, Beijing, China, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. O. A. Mcbryan. GENVL and WWWW: Tools for taming the web. In Proceedings of the First World Wide Web Conference, WWW'94, Geneva, Switzerland, May 1994.Google ScholarGoogle Scholar
  81. G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'05, Chiba, Japan, May 2005.Google ScholarGoogle Scholar
  82. M. Najork. Web spam detection, 2006.Google ScholarGoogle Scholar
  83. A. Y. Ng, A. X. Zheng, and M. I. Jordan. Link analysis, eigenvectors and stability. In Proceedings of the 17th International Joint Conference on Artificial intelligence, Seattle, WA, 2001. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. S. Nomura, S. Oyama, T. Hayamizu, and T. Ishida. Analysis and improvement of hits algorithm for detecting web communities. Syst. Comput. Japan, 35, Nov. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web, WWW'06, Edinburgh, Scotland, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1998.Google ScholarGoogle Scholar
  87. G. Pandurangan, P. Raghavan, and E. Upfal. Using pagerank to characterize web structure. In Proceedings of the 8th Annual International Conference on Computing and Combinatorics, COCOON'02, London, UK, 2002. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Y. Peng, L. Zhang, J. M. Chang, and Y. Guan. An effective method for combating malicious scripts clickbots. In Proceedings of the 14th European Conference on Research in Computer Security, ESORICS'09, Berlin, Heidelberg, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. J. Piskorski, M. Sydow, and D. Weiss. Exploring linguistic features for web spam detection: a preliminary study. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'08, Beijing, China. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. B. Poblete, C. Castillo, and A. Gionis. Dr. searcher and mr. browser: a unified hyperlink-click graph. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. M. Rabin. Fingerprinting by Random Polynomials. Technical report, Center for Research in Computing Technology, Harvard University, 1981.Google ScholarGoogle Scholar
  92. F. Radlinski. Addressing malicious noise in clickthrough data. In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, AIRWeb'07, Banff, Canada, 2007.Google ScholarGoogle Scholar
  93. G. Roberts and J. Rosenthal. Downweighting tightly knit communities in World Wide Web rankings. Advances and Applications in Statistics (ADAS), 2003.Google ScholarGoogle Scholar
  94. S. Robertson, H. Zaragoza, and M. Taylor. Simple bm25 extension to multiple weighted fields. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, CIKM'04, Washington, D.C., 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A bayesian approach to filtering junk e-mail. AAAI'98, Madison, Wisconsin, July 1998.Google ScholarGoogle Scholar
  96. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, Vol.18, Nov. 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. searchengineland.com. http://searchengineland.com/businessweek-dives-deep-into-googles-search-quality-27317, 2011.Google ScholarGoogle Scholar
  98. C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 33, Sept. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. M. Sobek. Pr0 - google's pagerank 0 penalty. badrank. http://pr.efactory.de/e-pr0.shtml, 2002.Google ScholarGoogle Scholar
  100. K. M. Svore, Q. Wu, C. J. C. Burges, and A. Raman. Improving web spam classification using ranktime features. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'07, Banff, Alberta, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. M. Sydow, J. Piskorski, D. Weiss, and C. Castillo. Application of machine learning in combating web spam, 2007.Google ScholarGoogle Scholar
  102. A. Tikhonov and V.Arsenin. Solutions of ill-posed problems, 1977.Google ScholarGoogle Scholar
  103. T. Urvoy, T. Lavergne, and P. Filoche. Tracking Web Spam with Hidden Style Similarity. In Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'06, Seattle, Washington, Aug. 2006.Google ScholarGoogle Scholar
  104. G. Wahba. Spline models for observational data. CBMS-NSF Regional Conference Series in Applied Mathematics, Vol. 59, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  105. Y.-M. Wang, M. Ma, Y. Niu, and H. Chen. Spam double-funnel: connecting web spammers with advertisers. In Proceedings of the 16th International Conference on World Wide Web, WWW'07, Banff, Alberta. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. S. Webb, J. Caverlee, and C. Pu. Characterizing web spam using content and HTTP session analysis. In Proceedings of CEAS, 2007.Google ScholarGoogle Scholar
  107. S. Webb, J. Caverlee, and C. Pu. Predicting web spam with HTTP session information. In Proceeding of the 17th ACM Conference on Information and Knowledge Management, CIKM'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. B. Wu and B. Davison. Cloaking and redirection: A preliminary study, 2005.Google ScholarGoogle Scholar
  109. B. Wu and B. D. Davison. Identifying link farm spam pages. In Special interest tracks and posters of the 14th International Conference on World Wide Web, WWW'05, Chiba, Japan, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. B. Wu and B. D. Davison. Detecting semantic cloaking on the web. In Proceedings of the 15th International Conference on World Wide Web, WWW'06, Edinburgh, Scotland, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. B. Wu and B. D. Davison. Undue influence: eliminating the impact of link plagiarism on web search rankings. In Proceedings of the 2006 ACM symposium on Applied computing, SAC'06, Dijon, France, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Proceedings of the Workshop on Models of Trust for the Web, Edinburgh, Scotland, May 2006.Google ScholarGoogle Scholar
  113. K. Yoshida, F. Adachi, T. Washio, H. Motoda, T. Homma, A. Nakashima, H. Fujikawa, and K. Yamazaki. Density-based spam detector. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. C. Zhai. Statistical Language Models for Information Retrieval. Now Publishers Inc., Hanover, MA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  115. H. Zhang, A. Goel, R. Govindan, K. Mason, and B. Van Roy. Making Eigenvector-Based Reputation Systems Robust to Collusion. LNCS Vol. 3243. Springer Berlin, Heidelberg, 2004.Google ScholarGoogle Scholar
  116. B. Zhou and J. Pei. OSD: An online web spam detection system. In In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'09, Paris, France.Google ScholarGoogle Scholar
  117. B. Zhou and J. Pei. Sketching landscapes of page farms. In Proceedings of the SIAM International Conference on Data Mining, SDM'07, April.Google ScholarGoogle Scholar
  118. B. Zhou and J. Pei. Link spam target detection using page farms. ACM Trans. Knowl. Discov. Data, 3, July 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. B. Zhou, J. Pei, and Z. Tang. A spamicity approach to web spam detection. In Proceedings of the SIAM International Conference on Data Mining, SDM'08, Atlanta, Georgia, April 2008.Google ScholarGoogle ScholarCross RefCross Ref
  120. D. Zhou, O. Bousquet, T. N. Lal, J. Weston, B. Schölkopf, and B. S. Olkopf. Learning with Local and Global Consistency. In Proceedings of the Advances in Neural Information Processing Systems 16, volume Vol. 16, 2003.Google ScholarGoogle Scholar
  121. D. Zhou, C. J. C. Burges, and T. Tao. Transductive link spam detection. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'07, Banff, Alberta, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader