research-article

Identifying Web Spam with the Wisdom of the Crowds

Authors:

Liyun RuAuthors Info & Claims

ACM Transactions on the Web (TWEB), Volume 6, Issue 1

Article No.: 2, Pages 1 - 30

https://doi.org/10.1145/2109205.2109207

Published: 01 March 2012 Publication History

Abstract

Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam-detection techniques are usually designed for specific, known types of Web spam and are incapable of dealing with newly appearing spam types efficiently. With user-behavior analyses from Web access logs, a spam page-detection algorithm is proposed based on a learning scheme. The main contributions are the following. (1) User-visiting patterns of spam pages are studied, and a number of user-behavior features are proposed for separating Web spam pages from ordinary pages. (2) A novel spam-detection framework is proposed that can detect various kinds of Web spam, including newly appearing ones, with the help of the user-behavior analysis. Experiments on large-scale practical Web access log data show the effectiveness of the proposed features and the detection framework.

References

[1]

Abernethy, J., Chapelle, O., and Castillo, C. 2008. WITCH: A new approach to Web spam detection. Yahoo! Res. rep. no. YR-2008-001.

[2]

Agichtein, E., Brill, E., and Dumaism, S. 2006. Improving Web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, NY, 19--26.

Digital Library

[3]

Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A. 2003. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia. ACM, New York, NY, 38--47.

Digital Library

[4]

Bacarella, V., Giannotti, F., Nanni, M., and Pedreschi, D. 2004. Discovery of ads Web hosts through traffic data analysis. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM, New York, NY, 76--81.

Digital Library

[5]

Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. 2006. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis.

[6]

Bilenko, M. and White, R. W. 2008. Mining the search trails of surfing crowds: Identifying relevant websites from user activity. In Proceeding of the 17th International World Wide Web Conference. ACM, New York, NY, 51--60.

Digital Library

[7]

Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th International World Wide Web Conference. 107--117.

Digital Library

[8]

Buehrer, G., Stokes, J. W., and Chellapilla, K. 2008. A large-scale study of automated web search traffic. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’08). ACM, New York, NY, 1--8.

Digital Library

[9]

Cai, D., Yu, S., Wen, J., and Ma, W. 2004. Block-based web search. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, New York, NY, 456--463.

Digital Library

[10]

Castillo, C. and Davison, B. 2011. Adversarial Web search. Found. Trends Inform. Retrieval 4, 5, 377--486.

Digital Library

[11]

Castillo, C., Corsi, C., Donato, D., Ferragina, P., and Gionis, A. 2008. Query-log mining for detecting spam. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’08). ACM, New York, NY, 17--20.

Digital Library

[12]

Chellapilla, K. and Chickering, D. M. 2006. Improving cloaking detection using search query popularity and monetizability. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web. 17--24.

[13]

CNNIC (China Internet Network Information Center). 2009. Search engine user behavior research report.

[14]

Cormack, G. V., Smucker, M. D., and Clarke, C. L. A. 2011. Efficient and effective spam filtering and re-ranking for large Web datasets. Inform. Retrieval. 1--25.

Digital Library

[15]

Craswell, N., Hawking, D., and Robertson, S. 2001. Effective site finding using link anchor information. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01). ACM, New York, NY, 250--257.

Digital Library

[16]

Davison, B. 2000. Recognizing nepotistic links on the Web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search. Tech. rep. WS-00-01. 23--28.

[17]

Denis, F. 1998. PAC learning from positive statistical queries. In Proceedings of the 9th International Conference on Algorithmic Learning Theory. Lecture Notes in Computer Science, vol. 1501, 112--126.

Digital Library

[18]

Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam Webpages. In Proceedings of the 7th International Workshop on the Web and Databases. 1--6.

Digital Library

[19]

Fuxman, A., Tsaparas, P., Achan, K., and Agrawal, R. 2008. Using the wisdom of the crowds for keyword generation. In Proceeding of the 17th International World Wide Web Conference. ACM, New York, NY, 61--70.

Digital Library

[20]

Geng, G., Wang, C., Li, Q., Xu, L., and Jin, X. 2007. Boosting the performance of web spam detection with ensemble under-sampling classification. In Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD’07). 583--587.

Digital Library

[21]

Gyongyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web. 1--9.

[22]

Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with trustrank. In Proceedings of the 13th International Conference on Very Large Data Bases. 576--587.

Digital Library

[23]

Henzinger, M. R., Motwani, R., and Silverstein, C. 2003. Challenges in Web search engines. In Proceedings of the 18th International Joint Conference on Artificial Intelligence. 1573--1579.

Digital Library

[24]

Jansen, J. B. 2007. Click fraud. Comput. 40, 7, 85--86.

Digital Library

[25]

Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632.

Digital Library

[26]

Krishnan, V. and Raj, R. 2006. Web spam detection with anti-trust-rank. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).

[27]

Liu, Y., Gao, B., Liu, T., Zhang, Y., Ma, Z., He, S., and Li, H. 2008. BrowseRank: Letting Web users vote for page importance. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, NY, 451--458.

Digital Library

[28]

Liu, Y., Cen, R., Zhang, M., Ma, S., and Ru, L. 2008a. Identifying Web spam with user behavior analysis. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’08). ACM, New York.

Digital Library

[29]

Liu, Y., Zhang, M., Ma, S., and Ru, L. 2008b. User behavior oriented Web spam detection. In Proceeding of the 17th International World Wide Web Conference (WWW’08). ACM, New York, NY, 1039--1040.

Digital Library

[30]

Liu, Y., Zhang, M., Ma, S., and Ru, L. 2009. User browsing graph: Structure, evolution, and application. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09).

[31]

Manevitz, L. M. and Yousef, M. 2002. One-class SVMs for document classification. Mach. Learn. 2, 139--154.

Digital Library

[32]

Mitchell, T. 1997. Chapter 6: Bayesian Learning, Machine Learning, McGraw-Hill Education, New York, NY.

[33]

Nigam, K., Mccallum, A. K., Thrun, S., and Mitchell, T. 2000. Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 2--3, 103--134.

Digital Library

[34]

Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam Web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW’06). ACM Press, New York, NY, 83--92.

Digital Library

[35]

Piskorski, J., Sydow, M., and Weiss, D. 2008. Exploring linguistic features for Web spam Detection: A preliminary study. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb). ACM, New York, NY, 25--28.

Digital Library

[36]

Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1, 6--12.

Digital Library

[37]

Song, R., Liu, H., Wen, J., and Ma, W. 2004. Learning block importance models for webpages. In Proceedings of the 13th international World Wide Web Conference (WWW’04). ACM, New York, NY, 203--211.

Digital Library

[38]

Svore, K., Wu, Q., Burges, C. and Raman, A. 2007. Improving Web spam classification using rank-time features. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’07).

Digital Library

[39]

Voorhees, E. M. 2001. The philosophy of information retrieval evaluation. In Revised Papers from the 2nd Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems (CLEF’01). 355--370.

Digital Library

[40]

Wang, Y., Ma, M., Niu, Y., and Chen, H. 2007. Spam double-funnel: Connecting Web spammers with advertisers. In Proceedings of the 16th International World Wide Web Conference (WWW’07). ACM, New York, NY, 291--300.

Digital Library

[41]

Wu, B. and Davison, B. 2005. Cloaking and redirection: A preliminary study. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.

[42]

Yu, H., Han, J., and Chang, K. C. 2004. PEBL: Web page classification without negative examples. IEEE Trans. Knowl. Data Engin. 16, 1, 70--81.

Digital Library

Cited By

Lu XGu DZhang HSong ZCai QZhao HWu H(2022)Semi-Supervised Sentiment Classification on E-Commerce Reviews Using Tripartite Graph and ClusteringInternational Journal of Data Warehousing and Mining10.4018/IJDWM.30790418:1(1-20)Online publication date: 1-Jan-2022
https://doi.org/10.4018/IJDWM.307904
Chen XMao JLiu YZhang MMa S(2022)Investigating human reading behavior during sentiment judgmentInternational Journal of Machine Learning and Cybernetics10.1007/s13042-022-01523-913:8(2283-2296)Online publication date: 6-Mar-2022
https://doi.org/10.1007/s13042-022-01523-9
Pes FSciarrone FTemperini M(2022)A Deep Learning System to Help Students Build Concept MapsLearning Technologies and Systems10.1007/978-3-031-33023-0_29(321-332)Online publication date: 21-Nov-2022
https://dl.acm.org/doi/10.1007/978-3-031-33023-0_29
Show More Cited By

Index Terms

Identifying Web Spam with the Wisdom of the Crowds

Recommendations

Identifying web spam with user behavior analysis
AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web

Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for newly-appeared spam. With user ...
User behavior oriented web spam detection
WWW '08: Proceedings of the 17th international conference on World Wide Web

Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for recently-appeared spam. With user ...
Fighting against web spam: a novel propagation method based on click-through data
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Combating Web spam is one of the greatest challenges for Web search engines. State-of-the-art anti-spam techniques focus mainly on detecting varieties of spam strategies, such as content spamming and link-based spamming. Although these anti-spam ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web

ACM Transactions on the Web Volume 6, Issue 1

March 2012

109 pages

ISSN:1559-1131

EISSN:1559-114X

DOI:10.1145/2109205

Issue’s Table of Contents

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2012

Accepted: 01 June 2011

Revised: 01 March 2011

Received: 01 November 2009

Published in TWEB Volume 6, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
861
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lu XGu DZhang HSong ZCai QZhao HWu H(2022)Semi-Supervised Sentiment Classification on E-Commerce Reviews Using Tripartite Graph and ClusteringInternational Journal of Data Warehousing and Mining10.4018/IJDWM.30790418:1(1-20)Online publication date: 1-Jan-2022
https://doi.org/10.4018/IJDWM.307904
Chen XMao JLiu YZhang MMa S(2022)Investigating human reading behavior during sentiment judgmentInternational Journal of Machine Learning and Cybernetics10.1007/s13042-022-01523-913:8(2283-2296)Online publication date: 6-Mar-2022
https://doi.org/10.1007/s13042-022-01523-9
Pes FSciarrone FTemperini M(2022)A Deep Learning System to Help Students Build Concept MapsLearning Technologies and Systems10.1007/978-3-031-33023-0_29(321-332)Online publication date: 21-Nov-2022
https://dl.acm.org/doi/10.1007/978-3-031-33023-0_29
Wang S(2022)The Use of Verbs in International Chinese Language EducationLearning Technologies and Systems10.1007/978-3-031-33023-0_25(282-289)Online publication date: 21-Nov-2022
https://dl.acm.org/doi/10.1007/978-3-031-33023-0_25
Wang S(2022)The Syntactic Features of Chinese Verbs of SalutingChinese Lexical Semantics10.1007/978-3-031-28953-8_33(448-463)Online publication date: 14-May-2022
https://dl.acm.org/doi/10.1007/978-3-031-28953-8_33
Cheng QRen ZLin YRen PChen ZLiu Xde Rijke M(2021)Long Short-Term Session Search: Joint Personalized Reranking and Next Query PredictionProceedings of the Web Conference 202110.1145/3442381.3449941(239-248)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3442381.3449941
Wang JHan LZhou MQian WAn D(2021)Adaptive evaluation model of web spam based on link relationTransactions on Emerging Telecommunications Technologies10.1002/ett.404732:5Online publication date: 7-May-2021
https://dl.acm.org/doi/10.1002/ett.4047
Liu JSu YLv SHuang C(2020)Detecting Web Spam Based on Novel Features from Web Page Source CodeSecurity and Communication Networks10.1155/2020/66621662020Online publication date: 17-Dec-2020
https://dl.acm.org/doi/10.1155/2020/6662166
Teng HLiu YYin HLi Y(2020)Adaptive Multi-stage Multi-strategy Word Representation Learning in HowNetProceedings of the ACM Turing Celebration Conference - China10.1145/3393527.3393553(151-155)Online publication date: 22-May-2020
https://dl.acm.org/doi/10.1145/3393527.3393553
Kaur PGosain A(2020)GT2FS-SMOTE: An Intelligent Oversampling Approach Based Upon General Type-2 Fuzzy Sets to Detect Web SpamArabian Journal for Science and Engineering10.1007/s13369-020-04995-546:4(3033-3050)Online publication date: 15-Oct-2020
https://doi.org/10.1007/s13369-020-04995-5
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents