skip to main content
10.1145/2556195.2556214acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Search engine click spam detection based on bipartite graph propagation

Published:24 February 2014Publication History

ABSTRACT

Using search engines to retrieve information has become an important part of people's daily lives. For most search engines, click information is an important factor in document ranking. As a result, some websites cheat to obtain a higher rank by fraudulently increasing clicks to their pages, which is referred to as "Click Spam". Based on an analysis of the features of fraudulent clicks, a novel automatic click spam detection approach is proposed in this paper, which consists of 1. modeling user sessions with a triple sequence, which, to the best of our knowledge, takes into account not only the user action but also the action objective and the time interval between actions for the first time; 2. using the user-session bipartite graph propagation algorithm to take advantage of cheating users to find more cheating sessions; and 3. using the pattern-session bipartite graph propagation algorithm to obtain cheating session patterns to achieve higher precision and recall of click spam detection. Experimental results based on a Chinese commercial search engine using real-world log data containing approximately 80 million user clicks per day show that 2.6% of all clicks were detected as spam with a precision of up to 97%.

References

  1. E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 19--26. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Agrawal and R. Srikant. Mining sequential patterns. In Data Engineering, 1995. Proceedings of the Eleventh International Conference on, pages 3--14. IEEE, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. A. Baeza-Yates. Link-based characterization and detection of web spam. In AIRWeb, pages 1--8, 2006.Google ScholarGoogle Scholar
  4. O. Chapelle and Y. Zhang. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th international conference on World wide web, pages 1--10. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 87--94. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Gu, R. Perdisci, J. Zhang, W. Lee, et al. Botminer: Clustering analysis of network traffic for protocol-and structure-independent botnet detection. In USENIX Security Symposium, pages 139--154, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. Guo, C. Liu, A. Kannan, T. Minka, M. Taylor, Y.-M. Wang, and C. Faloutsos. Click chain model in web search. In Proceedings of the 18th international conference on World wide web, pages 11--20. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 576--587. VLDB Endowment, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. Freespan: frequent pattern-projected sequential pattern mining. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 355--359. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Han, J. Pei, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In proceedings of the 17th international conference on data engineering, pages 215--224, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. J. Jansen. Click fraud. Computer, 40(7):85--86, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422--446, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133--142. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Kang, K. Wang, D. Soukal, F. Behr, and Z. Zheng. Large-scale bot detection for search engines. In Proceedings of the 19th international conference on World wide web, pages 501--510. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Karasaridis, B. Rexroad, and D. Hoeflin. Wide-scale botnet detection and characterization. In Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets, volume 7. Cambridge, MA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. V. Krishnan and R. Raj. Web spam detection with anti-trust rank. In AIRWeb, volume 6, pages 37--40, 2006.Google ScholarGoogle Scholar
  17. Y. Liu, R. Cen, M. Zhang, S. Ma, and L. Ru. Identifying web spam with user behavior analysis. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pages 9--16. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Marchiori. The quest for correct information on the web: Hyper search engines. Computer Networks and ISDN Systems, 29(8):1225--1235, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Metwally, D. Agrawal, and A. E. Abbadi. Using association rules for fraud detection in web advertising networks. In Proceedings of the 31st international conference on Very large data bases, pages 169--180. VLDB Endowment, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: bringing order to the web. 1999.Google ScholarGoogle Scholar
  21. J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. Mining sequential patterns by pattern-growth: The prefixspan approach. Knowledge and Data Engineering, IEEE Transactions on, 16(11):1424--1440, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. Radlinski. Addressing malicious noise in clickthrough data. In Learning to Rank for Information Retrieval Workshop at SIGIR, volume 2007, 2007.Google ScholarGoogle Scholar
  23. N. Sadagopan and J. Li. Characterizing typical and atypical user sessions in clickstreams. In Proceedings of the 17th international conference on World Wide Web, pages 885--894. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Schluessler, S. Goglin, and E. Johnson. Is a bot at the controls?: Detecting input data attacks. In Proceedings of the 6th ACM SIGCOMM workshop on Network and system support for games, pages 1--6. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. X. Yan, J. Han, and R. Afshar. Clospan: Mining closed sequential patterns in large datasets. In Proc. 2003 SIAM Int'l Conf. Data Mining (SDM'03), pages 166--177, 2003.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Search engine click spam detection based on bipartite graph propagation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining
      February 2014
      712 pages
      ISBN:9781450323512
      DOI:10.1145/2556195

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 February 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      WSDM '14 Paper Acceptance Rate64of355submissions,18%Overall Acceptance Rate498of2,863submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader