skip to main content
10.1145/1451983.1451986acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
research-article

Identifying web spam with user behavior analysis

Published: 22 April 2008 Publication History

Abstract

Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for newly-appeared spam. With user behavior analyses into Web access logs, we propose a spam page detection algorithm based on Bayesian Learning. The main contributions of our work are: (1) User visiting patterns of spam pages are studied and three user behavior features are proposed to separate Web spam from ordinary ones. (2) A novel spam detection framework is proposed that can detect unknown spam types and newly-appeared spam with the help of user behavior analysis. Preliminary experiments on large scale Web access log data (containing over 2.74 billion user clicks) show the effectiveness of the proposed features and detection framework.

References

[1]
CNNIC (China Internet Network Information Center), the 16th report in development of Internet in China. Online at http://www.cnnic.net.cn/uploadfiles/pdf/2005/7/20/210342.pdf.
[2]
Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large web search engine query log. SIGIR Forum 33, 1 (Sep. 1999), 6--12.
[3]
Gyongyi, Z. and Garcia-Molina, H. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web, 2005.
[4]
Henzinger, M. R., Motwani, R., Silverstein, C. 2003. Challenges in Web Search Engines. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (2003) 1573--1579.
[5]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh international Conference on World Wide Web 7 (Brisbane, Australia). 107--117.
[6]
Kleinberg. J. M. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM, 1999, 46(5):604--632.
[7]
Wu, B. and Davison, B. Cloaking and redirection: a preliminary study. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), May 2005.
[8]
Wang, Y., Ma, M., Niu, Y., and Chen, H. Spam double-funnel: Connecting web spammers with advertisers. In Proc. of the 16th International Conference World Wide Web (WWW), May 2007.
[9]
Fetterly, D., Manasse, M. and Najork, M. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In S. Amer-Yahia and L. Gravano, editors, WebDB, pages 1--6, 2004.
[10]
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23--26, 2006). WWW '06. ACM Press, New York, NY, 83--92.
[11]
Davison B. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, July 2000. Presented at the AAAI-2000 workshop on Artificial Intelligence for Web Search, Technical Report WS-00-01.
[12]
Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A. 2003. The connectivity sonar: detecting site functionality by structural patterns. In Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia (Nottingham, UK, August 26--30, 2003). HYPERTEXT '03.
[13]
Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating web spam with trustrank. In Proceedings of the Thirtieth international Conference on Very Large Data Bases - Volume 30. 576--587.
[14]
Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. In the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), August 2006.
[15]
L. Becchetti, C. Castillol D. Donatol, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. In Proc. of WebKDD'06, August 2006.
[16]
Geng, G., Wang, C., Li, Q., Xu, L., and Jin, X. 2007. Boosting the Performance of Web Spam Detection with Ensemble Under-Sampling Classification. In Proceedings of the Fourth international Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) Vol. 4 - Volume 04 (August 24--27, 2007). FSKD. IEEE Computer Society, Washington, DC, 583--587.
[17]
Svore, K., Wu, Q., Burges, C. and Raman, A. Improving Web Spam Classification using Rank-time Features. In Third International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '07), May 2007.
[18]
Sullivan D. 2006. Searches Per Day. Retrieved from search engine watch web site http://searchenginewatch.com/reports/article.php/2156461.
[19]
Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large web search engine query log. SIGIR Forum 33, 1 (Sep. 1999), 6--12.
[20]
Yu, H., Liu, Y., Zhang, M. and Ma, S. Research in Search Engine User Behavior Based on Log Analysis. Journal of Chinese Information Processing. Vol. 21(1): pp. 109--114, 2007.
[21]
Yu, H., Han, J., and Chang, K. C. 2004. PEBL: Web Page Classification without Negative Examples. IEEE Transactions on Knowledge and Data Engineering 16, 1 (Jan. 2004), 70--81.
[22]
Voorhees. E. M. 2001. The philosophy of information retrieval evaluation. In Proceedings of the Second Workshop of the Cross-Language Evaluation Forum, (CLEF 2001). pages 355--370.
[23]
Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning. 39(2--3): 103--134.
[24]
Denis, F. PAC Learning from Positive Statistical Queries (pp. 112--126). Proceedings of the 9th international Conference on Algorithmic Learning theory. Lecture Notes In Computer Science, vol. 1501. London: Springer-Verlag, 1998.
[25]
Manevitz, L. M. & Yousef, M. One-class SVMs for document classification. Machine Learning. Res. 2: 139--154.
[26]
Mitchell, T. Chapter 6: Bayesian Learning, in Mitchell, T., Machine Learning, McGraw-Hill Education, 1997.
[27]
Web Spam Challenge Website: http://webspam.lip6.fr/

Cited By

View all
  • (2025)Signed Latent Factors for Spamming Activity DetectionIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.351657320(651-664)Online publication date: 2025
  • (2022)SkillFenceProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35172326:1(1-26)Online publication date: 29-Mar-2022
  • (2021)Detection of Anomaly User Behaviors Based on Deep Neural Networks*2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom53373.2021.00169(1240-1245)Online publication date: Oct-2021
  • Show More Cited By

Index Terms

  1. Identifying web spam with user behavior analysis

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web
      April 2008
      81 pages
      ISBN:9781605581590
      DOI:10.1145/1451983
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 April 2008

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. spam detection
      2. user behavior analysis
      3. web search engine

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      AIRWeb'08

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)7
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 27 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Signed Latent Factors for Spamming Activity DetectionIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.351657320(651-664)Online publication date: 2025
      • (2022)SkillFenceProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35172326:1(1-26)Online publication date: 29-Mar-2022
      • (2021)Detection of Anomaly User Behaviors Based on Deep Neural Networks*2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom53373.2021.00169(1240-1245)Online publication date: Oct-2021
      • (2020)Classification of Spamming Attacks to Blogging Websites and Their Security TechniquesEncyclopedia of Criminal Activities and the Deep Web10.4018/978-1-5225-9715-5.ch058(864-880)Online publication date: 2020
      • (2020)Detecting Web Spam Based on Novel Features from Web Page Source CodeSecurity and Communication Networks10.1155/2020/66621662020Online publication date: 17-Dec-2020
      • (2020)Recommending Inferior Results: A General and Feature-Free Model for Spam DetectionProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3411900(955-974)Online publication date: 19-Oct-2020
      • (2020)Trustworthy Website Detection Based on Social Hyperlink Network AnalysisIEEE Transactions on Network Science and Engineering10.1109/TNSE.2018.28660667:1(54-65)Online publication date: 1-Jan-2020
      • (2020)Predicting Users’ Revisitation Behaviour Based on Web Access Contextual Clusters2020 8th International Conference on Information and Communication Technology (ICoICT)10.1109/ICoICT49345.2020.9166179(1-6)Online publication date: Jun-2020
      • (2019)In Search of a Stochastic Model for the E-News ReaderACM Transactions on Knowledge Discovery from Data10.1145/336269513:6(1-27)Online publication date: 13-Nov-2019
      • (2018)FS2RNN: Feature Selection Scheme for Web Spam Detection Using Recurrent Neural Networks2018 IEEE Global Communications Conference (GLOBECOM)10.1109/GLOCOM.2018.8647294(1-6)Online publication date: 9-Dec-2018
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media