skip to main content
10.1145/2348283.2348338acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Fighting against web spam: a novel propagation method based on click-through data

Authors Info & Claims
Published:12 August 2012Publication History

ABSTRACT

Combating Web spam is one of the greatest challenges for Web search engines. State-of-the-art anti-spam techniques focus mainly on detecting varieties of spam strategies, such as content spamming and link-based spamming. Although these anti-spam approaches have had much success, they encounter problems when fighting against a continuous barrage of new types of spamming techniques. We attempt to solve the problem from a new perspective, by noticing that queries that are more likely to lead to spam pages/sites have the following characteristics: 1) they are popular or reflect heavy demands for search engine users and 2) there are usually few key resources or authoritative results for them. From these observations, we propose a novel method that is based on click-through data analysis by propagating the spamicity score iteratively between queries and URLs from a few seed pages/sites. Once we obtain the seed pages/sites, we use the link structure of the click-through bipartite graph to discover other pages/sites that are likely to be spam. Experiments show that our algorithm is both efficient and effective in detecting Web spam. Moreover, combining our method with some popular anti-spam techniques such as TrustRank achieves improvement compared with each technique taken individually.

References

  1. Agichtein, E., Brill, E. and Dumais, S. 2006. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (Seattle, Washington, August 6--11, 2006).SIGIR '06. ACM, New York, NY, 19--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Attenberg, J. and Suel, T. 2008. Cleaning search results using term distance features. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (Beijing, China, April 22, 2008). AIRWeb '08. ACM, New York, NY, 21--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Castillo, C. and Davison, B.D. 2011. Adversarial Web Search. Foundations and trends in Information Retrieval. 4, 5 (2011), 377--488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Castillo, C., Donato, D., Gionis, A., Murdock, V. and Silvestri, F. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Amsterdam, The Netherlands, July 23--27, 2007). SIGIR '07. ACM, New York, NY, 423--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chellapilla, K. and Chickering, D.M. 2006. Improving cloaking detection using search query popularity and monetizability. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (Seattle, Washington, August 10, 2006). AIRWeb '06. ACM, New York, NY, 17--24.Google ScholarGoogle Scholar
  6. Chellapilla, K., and Maykov, A. 2007. A taxonomy of JavaScript redirection spam. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (Banff, Alberta, Canada, May 8, 2007). AIRWeb '07. ACM, New York, NY, 81--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cheng, Z., Gao, B., Sun, C., Jiang, Y. and Liu, T. 2011. Let Web Spammers Expose Themselves. In Proceedings of the fourth ACM international conference on Web search and data mining (Hong Kong, China, February 9--12, 2011). WSDM '11, ACM, New York, NY, 525--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Erdélyi, M., Garzó, A. and Benczúr, A.A. 2011. Web spam classification: a few features worth more. In Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality (Hyderabad, India, March 28, 2011). WebQuality '11, ACM, New York, NY, 27--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gyöngyi, Z. and Garcia-Molina, H. 2005. Spam: It's Not Just for Inboxes Anymore. IEEE Computer Magzine. 38, 10 (2005), 28--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gyöngyi, Z., Garcia-Molina, H. and Pedersen, J. 2004. Combating Web Spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (Toronto, Canada, August 29 -- September 3, 2004). VLDB '04. VLDB Endowment, US, 576--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (Chiba, Japan, May 10, 2005). AIRWeb '05. ACM, New York, NY, 39--47.Google ScholarGoogle Scholar
  12. Liu, Y., Cen, R., Zhang, M., Ma, S., and Ru, L. 2008. Identifying Web spam with user behavior analysis. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (Beijing, China, April 22, 2008). AIRWeb '08. ACM, New York, NY, 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Liu Y., Gao B., Liu TY., Zhang Y., Ma Z., He S. and Li H. 2008. BrowserRank: letting web users vote for page importance. In Proceedings of the 31th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, July 20--24, 2008). SIGIR '08. ACM, New York, NY, 451--458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Martinez-Romo, J. and Araujo, L. 2009. Web spam identification through language model analysis. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web (Madrid, Spain, April 21, 2009). AIRWeb '09. ACM, New York, NY, 21--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Moshchuk, A., Bragin, T., Gribble, D.S. and Levy, M. H. 2006. A crawler-based study of spyware on the web. In Proceedings of the thirteenth Annual Symposium on Network and Distributed System Security (San Diego, California, US, February, 2006). NDSS '06.Google ScholarGoogle Scholar
  16. Nie, L., Wu, B. and Davison, D.B. 2007. Winnowing wheat from the chaff: Propagating trust to sift spam from the Web. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Amsterdam, The Netherlands, July 23--27, 2007). SIGIR '07. ACM, New York, NY, 869--870. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ntoulas, A., Najork, M., Manasse, M. and Fetterly, D. 2006. Detecting Spam Web Pages through Content Analysis. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23--26, 2006). WWW '06. ACM, New York, NY, 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Piskorski, J., Sydow, M. and Weiss, D. 2008. Exploring linguistic features for Web spam detection: A preliminary study. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (Beijing, China, April 22, 2008). AIRWeb '08. ACM, New York, NY, 25--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Silverstein, C., Marais H., Henzinger M., and Moricz M. 1999. Analysis of a Very Large Web Search Engine Query Log. Association for Computer Machinery, SIGIR Forum, 33, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Singhal, A. Challenges in running a commercial search engine. 2005. Keynote presentation at SIGIR 2005, August 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sobek, M. 2002. PR0 -- Google's PageRank 0 penalty, http://pr.efactory.de/e-pr0.shtml, 2002.Google ScholarGoogle Scholar
  22. Urvoy, T., Chauveau, E., Filoche, P. and Lavergne, T. Tracking Web spam with HTML style similarities. ACM Transactions on the Web. 2, 1 (February, 2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Urvoy, T., Lavergne, T. and Filoche, P. 2006. Tracking Web spam with hidden style similarity. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (Seattle, Washington, August 10, 2006). AIRWeb '06. ACM, New York, NY, 25--32.Google ScholarGoogle Scholar
  24. Wu, B. and Davison, D.B. 2006. Detecting semantic cloaking on the Web. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23--26, 2006). WWW '06. ACM, New York, NY, 819--828. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Wu, B. and Davison, D. B. 2005. Identifying link farm spam pages. In Special interest tracks and posters of the 14th International Conference on World Wide Web (Chiba, Japan, May 10--14, 2005). WWW '05. ACM, New York, NY, 820--829. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Wu, B., Goel, V. and Davison, D.B. 2006. Propagating trust and distrust to demote Web spam. In Workshop on Models of Trust for the Web (Edinburgh, Scotland, May 22, 2006). MTW '06.Google ScholarGoogle Scholar
  27. Zhu, X. and Ghahramani, Z. 2002. Learning from Labeled and Unlabeled Data with Label Propagation. Carnegie Mellon University CALD technical report Carnegie Mellon University-CALD-02--107.Google ScholarGoogle Scholar
  28. http://www.yr-bcn.es/webspam/datasets/uk2006-info/Google ScholarGoogle Scholar

Index Terms

  1. Fighting against web spam: a novel propagation method based on click-through data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
      August 2012
      1236 pages
      ISBN:9781450314725
      DOI:10.1145/2348283

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 August 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader