research-article

Identifying web spam with user behavior analysis

Authors:

Liyun RuAuthors Info & Claims

AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web

Pages 9 - 16

https://doi.org/10.1145/1451983.1451986

Published: 22 April 2008 Publication History

Abstract

Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for newly-appeared spam. With user behavior analyses into Web access logs, we propose a spam page detection algorithm based on Bayesian Learning. The main contributions of our work are: (1) User visiting patterns of spam pages are studied and three user behavior features are proposed to separate Web spam from ordinary ones. (2) A novel spam detection framework is proposed that can detect unknown spam types and newly-appeared spam with the help of user behavior analysis. Preliminary experiments on large scale Web access log data (containing over 2.74 billion user clicks) show the effectiveness of the proposed features and detection framework.

References

[1]

CNNIC (China Internet Network Information Center), the 16th report in development of Internet in China. Online at http://www.cnnic.net.cn/uploadfiles/pdf/2005/7/20/210342.pdf.

[2]

Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large web search engine query log. SIGIR Forum 33, 1 (Sep. 1999), 6--12.

Digital Library

[3]

Gyongyi, Z. and Garcia-Molina, H. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web, 2005.

[4]

Henzinger, M. R., Motwani, R., Silverstein, C. 2003. Challenges in Web Search Engines. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (2003) 1573--1579.

Digital Library

[5]

Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh international Conference on World Wide Web 7 (Brisbane, Australia). 107--117.

Digital Library

[6]

Kleinberg. J. M. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM, 1999, 46(5):604--632.

Digital Library

[7]

Wu, B. and Davison, B. Cloaking and redirection: a preliminary study. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), May 2005.

[8]

Wang, Y., Ma, M., Niu, Y., and Chen, H. Spam double-funnel: Connecting web spammers with advertisers. In Proc. of the 16^th International Conference World Wide Web (WWW), May 2007.

Digital Library

[9]

Fetterly, D., Manasse, M. and Najork, M. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In S. Amer-Yahia and L. Gravano, editors, WebDB, pages 1--6, 2004.

Digital Library

[10]

Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23--26, 2006). WWW '06. ACM Press, New York, NY, 83--92.

Digital Library

[11]

Davison B. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, July 2000. Presented at the AAAI-2000 workshop on Artificial Intelligence for Web Search, Technical Report WS-00-01.

[12]

Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A. 2003. The connectivity sonar: detecting site functionality by structural patterns. In Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia (Nottingham, UK, August 26--30, 2003). HYPERTEXT '03.

Digital Library

[13]

Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating web spam with trustrank. In Proceedings of the Thirtieth international Conference on Very Large Data Bases - Volume 30. 576--587.

Digital Library

[14]

Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. In the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), August 2006.

[15]

L. Becchetti, C. Castillol D. Donatol, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. In Proc. of WebKDD'06, August 2006.

[16]

Geng, G., Wang, C., Li, Q., Xu, L., and Jin, X. 2007. Boosting the Performance of Web Spam Detection with Ensemble Under-Sampling Classification. In Proceedings of the Fourth international Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) Vol. 4 - Volume 04 (August 24--27, 2007). FSKD. IEEE Computer Society, Washington, DC, 583--587.

Digital Library

[17]

Svore, K., Wu, Q., Burges, C. and Raman, A. Improving Web Spam Classification using Rank-time Features. In Third International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '07), May 2007.

Digital Library

[18]

Sullivan D. 2006. Searches Per Day. Retrieved from search engine watch web site http://searchenginewatch.com/reports/article.php/2156461.

[19]

Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large web search engine query log. SIGIR Forum 33, 1 (Sep. 1999), 6--12.

Digital Library

[20]

Yu, H., Liu, Y., Zhang, M. and Ma, S. Research in Search Engine User Behavior Based on Log Analysis. Journal of Chinese Information Processing. Vol. 21(1): pp. 109--114, 2007.

[21]

Yu, H., Han, J., and Chang, K. C. 2004. PEBL: Web Page Classification without Negative Examples. IEEE Transactions on Knowledge and Data Engineering 16, 1 (Jan. 2004), 70--81.

Digital Library

[22]

Voorhees. E. M. 2001. The philosophy of information retrieval evaluation. In Proceedings of the Second Workshop of the Cross-Language Evaluation Forum, (CLEF 2001). pages 355--370.

Digital Library

[23]

Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning. 39(2--3): 103--134.

Digital Library

[24]

Denis, F. PAC Learning from Positive Statistical Queries (pp. 112--126). Proceedings of the 9th international Conference on Algorithmic Learning theory. Lecture Notes In Computer Science, vol. 1501. London: Springer-Verlag, 1998.

Digital Library

[25]

Manevitz, L. M. & Yousef, M. One-class SVMs for document classification. Machine Learning. Res. 2: 139--154.

Digital Library

[26]

Mitchell, T. Chapter 6: Bayesian Learning, in Mitchell, T., Machine Learning, McGraw-Hill Education, 1997.

[27]

Web Spam Challenge Website: http://webspam.lip6.fr/

Cited By

Liu Y(2025)Signed Latent Factors for Spamming Activity DetectionIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.351657320(651-664)Online publication date: 2025
https://doi.org/10.1109/TIFS.2024.3516573
Hooda AWallace MJhunjhunwalla KFernandes EFawaz K(2022)SkillFenceProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35172326:1(1-26)Online publication date: 29-Mar-2022
https://dl.acm.org/doi/10.1145/3517232
Ding ZLiu LYu DHuang SZhang HLiu K(2021)Detection of Anomaly User Behaviors Based on Deep Neural Networks*2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom53373.2021.00169(1240-1245)Online publication date: Oct-2021
https://doi.org/10.1109/TrustCom53373.2021.00169
Show More Cited By

Index Terms

Identifying web spam with user behavior analysis
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

User behavior oriented web spam detection
WWW '08: Proceedings of the 17th international conference on World Wide Web

Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for recently-appeared spam. With user ...
Identifying Web Spam with the Wisdom of the Crowds

Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam-detection techniques are usually designed for specific, known types of Web spam and are incapable of dealing with newly appearing spam types ...
Fighting against web spam: a novel propagation method based on click-through data
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Combating Web spam is one of the greatest challenges for Web search engines. State-of-the-art anti-spam techniques focus mainly on detecting varieties of spam strategies, such as content spamming and link-based spamming. Although these anti-spam ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web

April 2008

81 pages

ISBN:9781605581590

DOI:10.1145/1451983

Editors:
Carlos Castillo
Yahoo! Research
,
Kumar Chellapilla
Microsoft Live Labs
,
Dennis Fetterly
Microsoft Research

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

AIRWeb'08

AIRWeb'08: AIRWeb '08, Third International Workshop on Adversarial Information Retrieval on the Web

April 22, 2008

Beijing, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
780
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu Y(2025)Signed Latent Factors for Spamming Activity DetectionIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.351657320(651-664)Online publication date: 2025
https://doi.org/10.1109/TIFS.2024.3516573
Hooda AWallace MJhunjhunwalla KFernandes EFawaz K(2022)SkillFenceProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35172326:1(1-26)Online publication date: 29-Mar-2022
https://dl.acm.org/doi/10.1145/3517232
Ding ZLiu LYu DHuang SZhang HLiu K(2021)Detection of Anomaly User Behaviors Based on Deep Neural Networks*2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom53373.2021.00169(1240-1245)Online publication date: Oct-2021
https://doi.org/10.1109/TrustCom53373.2021.00169
Ur Rahman RVerma RBansal HSingh Tomar D(2020)Classification of Spamming Attacks to Blogging Websites and Their Security TechniquesEncyclopedia of Criminal Activities and the Deep Web10.4018/978-1-5225-9715-5.ch058(864-880)Online publication date: 2020
https://doi.org/10.4018/978-1-5225-9715-5.ch058
Liu JSu YLv SHuang C(2020)Detecting Web Spam Based on Novel Features from Web Page Source CodeSecurity and Communication Networks10.1155/2020/66621662020Online publication date: 17-Dec-2020
https://dl.acm.org/doi/10.1155/2020/6662166
Liu Yd'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)Recommending Inferior Results: A General and Feature-Free Model for Spam DetectionProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3411900(955-974)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3411900
Niu XLiu GYang Q(2020)Trustworthy Website Detection Based on Social Hyperlink Network AnalysisIEEE Transactions on Network Science and Engineering10.1109/TNSE.2018.28660667:1(54-65)Online publication date: 1-Jan-2020
https://doi.org/10.1109/TNSE.2018.2866066
Toba HJomei CSetiawan LKarnalim OLi H(2020)Predicting Users’ Revisitation Behaviour Based on Web Access Contextual Clusters2020 8th International Conference on Information and Communication Technology (ICoICT)10.1109/ICoICT49345.2020.9166179(1-6)Online publication date: Jun-2020
https://doi.org/10.1109/ICoICT49345.2020.9166179
Veloso BAssunção RFerreira AZiviani N(2019)In Search of a Stochastic Model for the E-News ReaderACM Transactions on Knowledge Discovery from Data10.1145/336269513:6(1-27)Online publication date: 13-Nov-2019
https://dl.acm.org/doi/10.1145/3362695
Makkar AObaidat MKumar N(2018)FS2RNN: Feature Selection Scheme for Web Spam Detection Using Recurrent Neural Networks2018 IEEE Global Communications Conference (GLOBECOM)10.1109/GLOCOM.2018.8647294(1-6)Online publication date: 9-Dec-2018
https://dl.acm.org/doi/10.1109/GLOCOM.2018.8647294
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten