research-article

Survey on web spam detection: principles and algorithms

Authors:
Nikita Spirin

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Jiawei Han

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 13 Issue 2December 2011pp 50–64https://doi.org/10.1145/2207243.2207252

Published:01 May 2012Publication History

ACM SIGKDD Explorations Newsletter

Abstract

Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even more challenging. Over the last decade research on adversarial information retrieval has gained a lot of interest both from academia and industry. In this paper we present a systematic review of web spam detection techniques with the focus on algorithms and underlying principles. We categorize all existing algorithms into three categories based on the type of information they use: content-based methods, link-based methods, and methods based on non-traditional data such as user behaviour, clicks, HTTP sessions. In turn, we perform a subcategorization of link-based category into five groups based on ideas and principles used: labels propagation, link pruning and reweighting, labels refinement, graph regularization, and featurebased. We also define the concept of web spam numerically and provide a brief survey on various spam forms. Finally, we summarize the observations and underlying principles applied for web spam detection.

References

J. Abernethy, O. Chapelle, and C. Castillo. Graph regularization methods for web spam detection. Mach. Learn., Vol. 81, Nov. 2010. Google ScholarDigital Library
J. Abernethy, O. Chapelle, C. Castillo, J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A new approach to web spam detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'08, 2008.Google Scholar
S. Adali, T. Liu, and M. Magdon-Ismail. Optimal Link Bombs are Uncoordinated. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'05, Chiba, Japan, 2005.Google Scholar
E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: detecting site functionality by structural patterns. In Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, Nottingham, UK, 2003. Google ScholarDigital Library
R. Baeza-Yates, P. Boldi, and C. Castillo. Generalizing pagerank: damping functions for link-based ranking algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'06, Seattle, Washington, 2006. Google ScholarDigital Library
R. Baeza-Yates, C. Castillo, and V. López. Pagerank Increase under Different Collusion Topologies. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'05, 2005.Google Scholar
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'06, Seattle, USA, 2006.Google Scholar
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis, WebKDD'06, Philadelphia, USA, 2006.Google Scholar
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Web spam detection: Link-based and content-based techniques. In The European Integrated Project Dynamically Evolving, Large Scale Information Systems (DELIS): proceedings of the final workshop, volume Vol. 222, 2008.Google Scholar
A. Benczúr, I. Bíró, K. Csalogány, and T. Sarlós. Web spam detection via commercial intent analysis. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'07. Google ScholarDigital Library
A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight Web spam. In Proceedings of the Second Workshop on Adversarial Information Retrieval on the Web, AIRWeb'06, Seattle, WA, 2006.Google Scholar
A. A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher. Spamrank: Fully automatic link spam detection work in progress. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'05, May 2005.Google Scholar
A. A. Benczúr, D. Siklósi, J. Szabó, I. Bíró, Z. Fekete, M. Kurucz, A. Pereszlényi, S. Rácz, and A. Szabó. Web spam: a survey with vision for the archivist. In Proceedings of the International Web Archiving Workshop, IWAW'08.Google Scholar
P. Berkhin. A survey on pagerank computing. Internet Mathematics, Vol. 2, 2005.Google ScholarCross Ref
A. Berman and R. Plemmons. Nonnegative Matrices in the Mathematical Sciences (Classics in Applied Mathematics). Society for Industrial Mathematics, 1987.Google Scholar
K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'98, Melbourne, Australia. Google ScholarDigital Library
R. Bhattacharjee and A. Goel. Algorithms and Incentives for Robust Ranking. Technical report, Stanford University, 2006.Google Scholar
M. Bianchini, M. Gori, and F. Scarselli. Inside pagerank. ACM Trans. Internet Technol., Vol. 5, Feb. 2005. Google ScholarDigital Library
E. Blanzieri and A. Bryl. A survey of learning-based techniques of email spam filtering. Artif. Intell. Rev., 29, March 2008. Google ScholarDigital Library
A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide web. In Proceedings of the 10th International Conference on World Wide Web, WWW'01, Hong Kong, 2001. Google ScholarDigital Library
A. Z. Broder. Some applications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science. Springer-Verlag, 1993.Google Scholar
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Selected papers from the Sixth International Conference on World Wide Web, WWW'97. Google ScholarDigital Library
C. Castillo and B. D. Davison. Adversarial web search. Found. Trends Inf. Retr., 4, May 2011. Google ScholarDigital Library
C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. SIGIR Forum, 40, Dec. 2006. Google ScholarDigital Library
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR' 07, Amsterdam, The Netherlands, 2007. Google ScholarDigital Library
J. Caverlee and L. Liu. Countering web spam with credibility-based link analysis. In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, PODC'07, Portland, OR. Google ScholarDigital Library
S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, 2002. Google ScholarDigital Library
K. Chellapilla and D. Chickering. Improving cloaking detection using search query popularity and monetizability, 2006.Google Scholar
K. Chellapilla and A. Maykov. A taxonomy of javascript redirection spam. In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, AIRWeb'07, Banff, Canada, 2007. Google ScholarDigital Library
Z. Cheng, B. Gao, C. Sun, Y. Jiang, and T.-Y. Liu. Let web spammers expose themselves. In Proceedings of the fourth ACM International Conference on Web search and Data Mining, WSDM'11, Hong Kong, China, 2011. Google ScholarDigital Library
E. Convey. Porn sneaks way back on web. The Boston Herald, 1996.Google Scholar
A. L. da Costa Carvalho, P. A. Chirita, E. S. de Moura, P. Calado, and W. Nejdl. Site level noise removal for search engines. In Proceedings of the 15th International Conference on World Wide Web, WWW'06, Edinburgh, Scotland, 2006. Google ScholarDigital Library
N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD'04, WA, USA, 2004. Google ScholarDigital Library
N. Daswani and M. Stoppelman. The anatomy of clickbot.a. In Proceedings of the First Conference on First Workshop on Hot Topics in Understanding Botnets, Berkeley, CA, 2007. USENIX Association. Google ScholarDigital Library
B. Davison. Recognizing nepotistic links on the web. In Workshop on Artificial Intelligence for Web Search, AAAI'00.Google Scholar
B. D. Davison. Topical locality in the web. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'00, Athens, Greece. Google ScholarDigital Library
Z. Dou, R. Song, X. Yuan, and J.-R. Wen. Are clickthrough data adequate for learning web search rankings? In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM'08, 2008. Google ScholarDigital Library
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proceedig of the 16th European Conference on Machine Learning, ECML'05, 2005. Google ScholarDigital Library
N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In Proceedings of the 13th International Conference on World Wide Web, WWW'04, New York, NY, 2004. Google ScholarDigital Library
M. Erdélyi, A. Garzó, and A. A. Benczúr. Web spam classification: a few features worth more. In Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, WebQuality'11, Hyderabad, India, 2011. Google ScholarDigital Library
D. Fetterly. Adversarial Information Retrieval: The Manipulation of Web Content. 2007.Google Scholar
D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'05, Salvador, Brazil. Google ScholarDigital Library
D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. J. Web Eng., 2, Oct. 2003. Google ScholarDigital Library
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, WebDB'04, Paris, France, 2004. Google ScholarDigital Library
D. Fogaras. Where to start browsing the web. In Proceedings of IICS. Springer-Verlag, 2003.Google ScholarCross Ref
D. Fogaras and B. Racz. Towards scaling fully personalized pagerank. In Proceedings of the 3rd Workshop on Algorithms and Models for the Web-Graph, WAW'04, 2004.Google ScholarCross Ref
Q. Gan and T. Suel. Improving web spam classifiers using link structure. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'07, Banff, Alberta, 2007. Google ScholarDigital Library
G. Geng, C.Wang, and Q. Li. Improving web spam detection with re-extracted features. In Proceeding of the 17th International Conference on World Wide Web, WWW'08, Beijing, China, 2008. Google ScholarDigital Library
G.-G. Geng, Q. Li, and X. Zhang. Link based small sample learning for web spam detection. In Proceedings of the 18th international conference on World Wide Web, WWW'09, Madrid, Spain, 2009. Google ScholarDigital Library
googleblog.blogspot.com. http://googleblog.blogspot.com/2011/01/googlesearch- and-search-engine-spam.html,2011.Google Scholar
R. Guha, R. Kumar, P. Raghavan, and A. Tomkins. Propagation of trust and distrust. In Proceedings of the 13th International Conference on World Wide Web, WWW'04, New York, NY, 2004. Google ScholarDigital Library
Z. Gyongyi and H. Garcia-Molina. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Databases, VLDB'06. Google ScholarDigital Library
Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB'05, Trondheim, Norway, 2005. VLDB Endowment. Google ScholarDigital Library
Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In Proceeding of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'05, Chiba, Japan, May 2005.Google Scholar
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB'04, Toronto, Canada, 2004. Google ScholarDigital Library
T. H. Haveliwala. Topic-sensitive pagerank. In Proceedings of the 11th International Conference on World Wide Web, WWW'02, 2002. Google ScholarDigital Library
M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36, 2002. Google ScholarDigital Library
P. Heymann, G. Koutrika, and H. Garcia-Molina. Fighting spam on social web sites: A survey of approaches and future challenges. IEEE Internet Computing, Vol. 11(6), Nov. 2007. Google ScholarDigital Library
D. Hiemstra. Language models. In Encyclopedia of Database Systems. 2009.Google Scholar
N. Immorlica, K. Jain, M. Mahdian, and K. Talwar. Click Fraud Resistant Methods for Learning Click-Through Rates. Technical report, Microsoft Research, Redmond, 2006.Google Scholar
G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD'02, Edmonton, Alberta, 2002. Google ScholarDigital Library
G. Jeh and J. Widom. Scaling personalized web search. In Proceedings of the 12th international conference on World Wide Web, WWW'03, Budapest, Hungary, 2003. Google ScholarDigital Library
R. Jennings. The global economic impact of spam. Ferris Research, 2005.Google Scholar
R. Jennings. Cost of spam is flattening -- our 2009 predictions. Ferris Research, 2009.Google Scholar
T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'05, Salvador, Brazil, 2005. Google ScholarDigital Library
A. Joshi, R. Kumar, B. Reed, and A. Tomkins. Anchor-based proximity measures. In Proceedings of the 16th International Conference on World Wide Web, WWW'07, Banff, Alberta, 2007. Google ScholarDigital Library
G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, Vol. 48, 1998. Google ScholarDigital Library
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46, Sept. 1999. Google ScholarDigital Library
P. Kolari, A. Java, T. Finin, T. Oates, and A. Joshi. Detecting spam blogs: a machine learning approach. In Proceedings of the 21st National Conference on Artificial Intelligence, volume Vol. 2, Boston, MA, 2006. AAAI Press. Google ScholarDigital Library
Z. Kou and W. W. Cohen. Stacked graphical models for efficient inference in markov random fields. In Proceedings of the Seventh SIAM International Conference on Data Mining, SDM'07, Minneapolis, Minnesota, April 2007.Google ScholarCross Ref
V. Krishnan and R. Raj. Web spam detection with anti-trust rank, 2006.Google Scholar
A. Langville and C. Meyer. Deeper inside pagerank. Internet Mathematics, Vol. 1, 2004.Google ScholarCross Ref
R. Lempel and S. Moran. SALSA: the stochastic approach for link-structure analysis. ACM Trans. Inf. Syst., 19, April 2001. Google ScholarDigital Library
L. Li, Y. Shang, and W. Zhang. Improvement of hitsbased algorithms on web documents. In Proceedings of the 11th International Conference on World Wide Web, WWW'02, Honolulu, Hawaii, 2002. Google ScholarDigital Library
J.-L. Lin. Detection of cloaked web spam by using tagbased methods. Expert Syst. Appl., 36, May 2009. Google ScholarDigital Library
Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. Tseng. Splog detection using self-similarity analysis on blog temporal dynamics. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'07, Banff, Alberta, 2007. Google ScholarDigital Library
Y. Liu, B. Gao, T.-Y. Liu, Y. Zhang, Z. Ma, S. He, and H. Li. Browserank: letting web users vote for page importance. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'08, Singapore, 2008. Google ScholarDigital Library
Y. Liu, M. Zhang, S. Ma, and L. Ru. User behavior oriented web spam detection. In Proceeding of the 17th International Conference on World Wide Web, WWW'08, Beijing, China, 2008. Google ScholarDigital Library
C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, 2008. Google ScholarDigital Library
O. A. Mcbryan. GENVL and WWWW: Tools for taming the web. In Proceedings of the First World Wide Web Conference, WWW'94, Geneva, Switzerland, May 1994.Google Scholar
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'05, Chiba, Japan, May 2005.Google Scholar
M. Najork. Web spam detection, 2006.Google Scholar
A. Y. Ng, A. X. Zheng, and M. I. Jordan. Link analysis, eigenvectors and stability. In Proceedings of the 17th International Joint Conference on Artificial intelligence, Seattle, WA, 2001. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
S. Nomura, S. Oyama, T. Hayamizu, and T. Ishida. Analysis and improvement of hits algorithm for detecting web communities. Syst. Comput. Japan, 35, Nov. 2004. Google ScholarDigital Library
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web, WWW'06, Edinburgh, Scotland, 2006. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1998.Google Scholar
G. Pandurangan, P. Raghavan, and E. Upfal. Using pagerank to characterize web structure. In Proceedings of the 8th Annual International Conference on Computing and Combinatorics, COCOON'02, London, UK, 2002. Springer-Verlag. Google ScholarDigital Library
Y. Peng, L. Zhang, J. M. Chang, and Y. Guan. An effective method for combating malicious scripts clickbots. In Proceedings of the 14th European Conference on Research in Computer Security, ESORICS'09, Berlin, Heidelberg, 2009. Google ScholarDigital Library
J. Piskorski, M. Sydow, and D. Weiss. Exploring linguistic features for web spam detection: a preliminary study. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'08, Beijing, China. Google ScholarDigital Library
B. Poblete, C. Castillo, and A. Gionis. Dr. searcher and mr. browser: a unified hyperlink-click graph. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM'08, 2008. Google ScholarDigital Library
M. Rabin. Fingerprinting by Random Polynomials. Technical report, Center for Research in Computing Technology, Harvard University, 1981.Google Scholar
F. Radlinski. Addressing malicious noise in clickthrough data. In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, AIRWeb'07, Banff, Canada, 2007.Google Scholar
G. Roberts and J. Rosenthal. Downweighting tightly knit communities in World Wide Web rankings. Advances and Applications in Statistics (ADAS), 2003.Google Scholar
S. Robertson, H. Zaragoza, and M. Taylor. Simple bm25 extension to multiple weighted fields. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, CIKM'04, Washington, D.C., 2004. Google ScholarDigital Library
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A bayesian approach to filtering junk e-mail. AAAI'98, Madison, Wisconsin, July 1998.Google Scholar
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, Vol.18, Nov. 1975. Google ScholarDigital Library
searchengineland.com. http://searchengineland.com/businessweek-dives-deep-into-googles-search-quality-27317, 2011.Google Scholar
C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 33, Sept. 1999. Google ScholarDigital Library
M. Sobek. Pr0 - google's pagerank 0 penalty. badrank. http://pr.efactory.de/e-pr0.shtml, 2002.Google Scholar
K. M. Svore, Q. Wu, C. J. C. Burges, and A. Raman. Improving web spam classification using ranktime features. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'07, Banff, Alberta, 2007. Google ScholarDigital Library
M. Sydow, J. Piskorski, D. Weiss, and C. Castillo. Application of machine learning in combating web spam, 2007.Google Scholar
A. Tikhonov and V.Arsenin. Solutions of ill-posed problems, 1977.Google Scholar
T. Urvoy, T. Lavergne, and P. Filoche. Tracking Web Spam with Hidden Style Similarity. In Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'06, Seattle, Washington, Aug. 2006.Google Scholar
G. Wahba. Spline models for observational data. CBMS-NSF Regional Conference Series in Applied Mathematics, Vol. 59, 1990.Google ScholarCross Ref
Y.-M. Wang, M. Ma, Y. Niu, and H. Chen. Spam double-funnel: connecting web spammers with advertisers. In Proceedings of the 16th International Conference on World Wide Web, WWW'07, Banff, Alberta. Google ScholarDigital Library
S. Webb, J. Caverlee, and C. Pu. Characterizing web spam using content and HTTP session analysis. In Proceedings of CEAS, 2007.Google Scholar
S. Webb, J. Caverlee, and C. Pu. Predicting web spam with HTTP session information. In Proceeding of the 17th ACM Conference on Information and Knowledge Management, CIKM'08, 2008. Google ScholarDigital Library
B. Wu and B. Davison. Cloaking and redirection: A preliminary study, 2005.Google Scholar
B. Wu and B. D. Davison. Identifying link farm spam pages. In Special interest tracks and posters of the 14th International Conference on World Wide Web, WWW'05, Chiba, Japan, 2005. Google ScholarDigital Library
B. Wu and B. D. Davison. Detecting semantic cloaking on the web. In Proceedings of the 15th International Conference on World Wide Web, WWW'06, Edinburgh, Scotland, 2006. Google ScholarDigital Library
B. Wu and B. D. Davison. Undue influence: eliminating the impact of link plagiarism on web search rankings. In Proceedings of the 2006 ACM symposium on Applied computing, SAC'06, Dijon, France, 2006. Google ScholarDigital Library
B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Proceedings of the Workshop on Models of Trust for the Web, Edinburgh, Scotland, May 2006.Google Scholar
K. Yoshida, F. Adachi, T. Washio, H. Motoda, T. Homma, A. Nakashima, H. Fujikawa, and K. Yamazaki. Density-based spam detector. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'04. Google ScholarDigital Library
C. Zhai. Statistical Language Models for Information Retrieval. Now Publishers Inc., Hanover, MA, 2008. Google ScholarDigital Library
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. Van Roy. Making Eigenvector-Based Reputation Systems Robust to Collusion. LNCS Vol. 3243. Springer Berlin, Heidelberg, 2004.Google Scholar
B. Zhou and J. Pei. OSD: An online web spam detection system. In In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'09, Paris, France.Google Scholar
B. Zhou and J. Pei. Sketching landscapes of page farms. In Proceedings of the SIAM International Conference on Data Mining, SDM'07, April.Google Scholar
B. Zhou and J. Pei. Link spam target detection using page farms. ACM Trans. Knowl. Discov. Data, 3, July 2009. Google ScholarDigital Library
B. Zhou, J. Pei, and Z. Tang. A spamicity approach to web spam detection. In Proceedings of the SIAM International Conference on Data Mining, SDM'08, Atlanta, Georgia, April 2008.Google ScholarCross Ref
D. Zhou, O. Bousquet, T. N. Lal, J. Weston, B. Schölkopf, and B. S. Olkopf. Learning with Local and Global Consistency. In Proceedings of the Advances in Neural Information Processing Systems 16, volume Vol. 16, 2003.Google Scholar
D. Zhou, C. J. C. Burges, and T. Tao. Transductive link spam detection. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'07, Banff, Alberta, 2007. Google ScholarDigital Library

Recommendations

Russian web spam evolution: yandex experience
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

Web spam has a negative impact on the search quality and users' satisfaction and forces search engines to waste resources to crawl, index, and rank it. Thus search engines are compelled to make significant efforts in order to fight web spam. Traffic ...
Read More
Web Spam: A Study of the Page Language Effect on the Spam Detection Features
ICMLA '12: Proceedings of the 2012 11th International Conference on Machine Learning and Applications - Volume 02

Although search engines have deployed various techniques to detect and filter out Web spam, Web stammers continue to develop new tactics to influence the result of search engines ranking algorithms, for the purpose of obtaining an undeservedly high ...
Read More
Improving web spam detection with re-extracted features
WWW '08: Proceedings of the 17th international conference on World Wide Web

Web spam detection has become one of the top challenges for the Internet search industry. Instead of using some heuristic rules, we propose a feature re-extraction strategy to optimize the detection result. Based on the predicted spamicity obtained by ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 13, Issue 2
December 2011
101 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/2207243
Issue’s Table of Contents

Copyright © 2012 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2012
Check for updates
Author Tags
classification
cloaking
clustering
collusion
content spam
graph regularization
labels propagation
link farm
link spam
pagerank
random walk
user behaviour
web search
web spam detection
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 176
  Total Citations
  View Citations
- 1,937
  Total Downloads
- Downloads (Last 12 months)56
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Recommendations

Russian web spam evolution: yandex experience

Web Spam: A Study of the Page Language Effect on the Spam Detection Features

Improving web spam detection with re-extracted features

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Recommendations

Russian web spam evolution: yandex experience

Web Spam: A Study of the Page Language Effect on the Spam Detection Features

Improving web spam detection with re-extracted features

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media