skip to main content
research-article

Identifying Web Spam with the Wisdom of the Crowds

Published: 01 March 2012 Publication History

Abstract

Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam-detection techniques are usually designed for specific, known types of Web spam and are incapable of dealing with newly appearing spam types efficiently. With user-behavior analyses from Web access logs, a spam page-detection algorithm is proposed based on a learning scheme. The main contributions are the following. (1) User-visiting patterns of spam pages are studied, and a number of user-behavior features are proposed for separating Web spam pages from ordinary pages. (2) A novel spam-detection framework is proposed that can detect various kinds of Web spam, including newly appearing ones, with the help of the user-behavior analysis. Experiments on large-scale practical Web access log data show the effectiveness of the proposed features and the detection framework.

References

[1]
Abernethy, J., Chapelle, O., and Castillo, C. 2008. WITCH: A new approach to Web spam detection. Yahoo! Res. rep. no. YR-2008-001.
[2]
Agichtein, E., Brill, E., and Dumaism, S. 2006. Improving Web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, NY, 19--26.
[3]
Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A. 2003. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia. ACM, New York, NY, 38--47.
[4]
Bacarella, V., Giannotti, F., Nanni, M., and Pedreschi, D. 2004. Discovery of ads Web hosts through traffic data analysis. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM, New York, NY, 76--81.
[5]
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. 2006. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis.
[6]
Bilenko, M. and White, R. W. 2008. Mining the search trails of surfing crowds: Identifying relevant websites from user activity. In Proceeding of the 17th International World Wide Web Conference. ACM, New York, NY, 51--60.
[7]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th International World Wide Web Conference. 107--117.
[8]
Buehrer, G., Stokes, J. W., and Chellapilla, K. 2008. A large-scale study of automated web search traffic. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’08). ACM, New York, NY, 1--8.
[9]
Cai, D., Yu, S., Wen, J., and Ma, W. 2004. Block-based web search. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, New York, NY, 456--463.
[10]
Castillo, C. and Davison, B. 2011. Adversarial Web search. Found. Trends Inform. Retrieval 4, 5, 377--486.
[11]
Castillo, C., Corsi, C., Donato, D., Ferragina, P., and Gionis, A. 2008. Query-log mining for detecting spam. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’08). ACM, New York, NY, 17--20.
[12]
Chellapilla, K. and Chickering, D. M. 2006. Improving cloaking detection using search query popularity and monetizability. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web. 17--24.
[13]
CNNIC (China Internet Network Information Center). 2009. Search engine user behavior research report.
[14]
Cormack, G. V., Smucker, M. D., and Clarke, C. L. A. 2011. Efficient and effective spam filtering and re-ranking for large Web datasets. Inform. Retrieval. 1--25.
[15]
Craswell, N., Hawking, D., and Robertson, S. 2001. Effective site finding using link anchor information. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01). ACM, New York, NY, 250--257.
[16]
Davison, B. 2000. Recognizing nepotistic links on the Web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search. Tech. rep. WS-00-01. 23--28.
[17]
Denis, F. 1998. PAC learning from positive statistical queries. In Proceedings of the 9th International Conference on Algorithmic Learning Theory. Lecture Notes in Computer Science, vol. 1501, 112--126.
[18]
Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam Webpages. In Proceedings of the 7th International Workshop on the Web and Databases. 1--6.
[19]
Fuxman, A., Tsaparas, P., Achan, K., and Agrawal, R. 2008. Using the wisdom of the crowds for keyword generation. In Proceeding of the 17th International World Wide Web Conference. ACM, New York, NY, 61--70.
[20]
Geng, G., Wang, C., Li, Q., Xu, L., and Jin, X. 2007. Boosting the performance of web spam detection with ensemble under-sampling classification. In Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD’07). 583--587.
[21]
Gyongyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web. 1--9.
[22]
Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with trustrank. In Proceedings of the 13th International Conference on Very Large Data Bases. 576--587.
[23]
Henzinger, M. R., Motwani, R., and Silverstein, C. 2003. Challenges in Web search engines. In Proceedings of the 18th International Joint Conference on Artificial Intelligence. 1573--1579.
[24]
Jansen, J. B. 2007. Click fraud. Comput. 40, 7, 85--86.
[25]
Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632.
[26]
Krishnan, V. and Raj, R. 2006. Web spam detection with anti-trust-rank. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).
[27]
Liu, Y., Gao, B., Liu, T., Zhang, Y., Ma, Z., He, S., and Li, H. 2008. BrowseRank: Letting Web users vote for page importance. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, NY, 451--458.
[28]
Liu, Y., Cen, R., Zhang, M., Ma, S., and Ru, L. 2008a. Identifying Web spam with user behavior analysis. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’08). ACM, New York.
[29]
Liu, Y., Zhang, M., Ma, S., and Ru, L. 2008b. User behavior oriented Web spam detection. In Proceeding of the 17th International World Wide Web Conference (WWW’08). ACM, New York, NY, 1039--1040.
[30]
Liu, Y., Zhang, M., Ma, S., and Ru, L. 2009. User browsing graph: Structure, evolution, and application. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09).
[31]
Manevitz, L. M. and Yousef, M. 2002. One-class SVMs for document classification. Mach. Learn. 2, 139--154.
[32]
Mitchell, T. 1997. Chapter 6: Bayesian Learning, Machine Learning, McGraw-Hill Education, New York, NY.
[33]
Nigam, K., Mccallum, A. K., Thrun, S., and Mitchell, T. 2000. Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 2--3, 103--134.
[34]
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam Web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW’06). ACM Press, New York, NY, 83--92.
[35]
Piskorski, J., Sydow, M., and Weiss, D. 2008. Exploring linguistic features for Web spam Detection: A preliminary study. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb). ACM, New York, NY, 25--28.
[36]
Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1, 6--12.
[37]
Song, R., Liu, H., Wen, J., and Ma, W. 2004. Learning block importance models for webpages. In Proceedings of the 13th international World Wide Web Conference (WWW’04). ACM, New York, NY, 203--211.
[38]
Svore, K., Wu, Q., Burges, C. and Raman, A. 2007. Improving Web spam classification using rank-time features. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’07).
[39]
Voorhees, E. M. 2001. The philosophy of information retrieval evaluation. In Revised Papers from the 2nd Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems (CLEF’01). 355--370.
[40]
Wang, Y., Ma, M., Niu, Y., and Chen, H. 2007. Spam double-funnel: Connecting Web spammers with advertisers. In Proceedings of the 16th International World Wide Web Conference (WWW’07). ACM, New York, NY, 291--300.
[41]
Wu, B. and Davison, B. 2005. Cloaking and redirection: A preliminary study. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.
[42]
Yu, H., Han, J., and Chang, K. C. 2004. PEBL: Web page classification without negative examples. IEEE Trans. Knowl. Data Engin. 16, 1, 70--81.

Cited By

View all
  • (2022)Semi-Supervised Sentiment Classification on E-Commerce Reviews Using Tripartite Graph and ClusteringInternational Journal of Data Warehousing and Mining10.4018/IJDWM.30790418:1(1-20)Online publication date: 1-Jan-2022
  • (2022)Investigating human reading behavior during sentiment judgmentInternational Journal of Machine Learning and Cybernetics10.1007/s13042-022-01523-913:8(2283-2296)Online publication date: 6-Mar-2022
  • (2022)A Deep Learning System to Help Students Build Concept MapsLearning Technologies and Systems10.1007/978-3-031-33023-0_29(321-332)Online publication date: 21-Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web
ACM Transactions on the Web  Volume 6, Issue 1
March 2012
109 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/2109205
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2012
Accepted: 01 June 2011
Revised: 01 March 2011
Received: 01 November 2009
Published in TWEB Volume 6, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Spam detection
  2. Web search engine
  3. user behavior analysis

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Semi-Supervised Sentiment Classification on E-Commerce Reviews Using Tripartite Graph and ClusteringInternational Journal of Data Warehousing and Mining10.4018/IJDWM.30790418:1(1-20)Online publication date: 1-Jan-2022
  • (2022)Investigating human reading behavior during sentiment judgmentInternational Journal of Machine Learning and Cybernetics10.1007/s13042-022-01523-913:8(2283-2296)Online publication date: 6-Mar-2022
  • (2022)A Deep Learning System to Help Students Build Concept MapsLearning Technologies and Systems10.1007/978-3-031-33023-0_29(321-332)Online publication date: 21-Nov-2022
  • (2022)The Use of Verbs in International Chinese Language EducationLearning Technologies and Systems10.1007/978-3-031-33023-0_25(282-289)Online publication date: 21-Nov-2022
  • (2022)The Syntactic Features of Chinese Verbs of SalutingChinese Lexical Semantics10.1007/978-3-031-28953-8_33(448-463)Online publication date: 14-May-2022
  • (2021)Long Short-Term Session Search: Joint Personalized Reranking and Next Query PredictionProceedings of the Web Conference 202110.1145/3442381.3449941(239-248)Online publication date: 19-Apr-2021
  • (2021)Adaptive evaluation model of web spam based on link relationTransactions on Emerging Telecommunications Technologies10.1002/ett.404732:5Online publication date: 7-May-2021
  • (2020)Detecting Web Spam Based on Novel Features from Web Page Source CodeSecurity and Communication Networks10.1155/2020/66621662020Online publication date: 17-Dec-2020
  • (2020)Adaptive Multi-stage Multi-strategy Word Representation Learning in HowNetProceedings of the ACM Turing Celebration Conference - China10.1145/3393527.3393553(151-155)Online publication date: 22-May-2020
  • (2020)GT2FS-SMOTE: An Intelligent Oversampling Approach Based Upon General Type-2 Fuzzy Sets to Detect Web SpamArabian Journal for Science and Engineering10.1007/s13369-020-04995-546:4(3033-3050)Online publication date: 15-Oct-2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media