ABSTRACT
Using search engines to retrieve information has become an important part of people's daily lives. For most search engines, click information is an important factor in document ranking. As a result, some websites cheat to obtain a higher rank by fraudulently increasing clicks to their pages, which is referred to as "Click Spam". Based on an analysis of the features of fraudulent clicks, a novel automatic click spam detection approach is proposed in this paper, which consists of 1. modeling user sessions with a triple sequence, which, to the best of our knowledge, takes into account not only the user action but also the action objective and the time interval between actions for the first time; 2. using the user-session bipartite graph propagation algorithm to take advantage of cheating users to find more cheating sessions; and 3. using the pattern-session bipartite graph propagation algorithm to obtain cheating session patterns to achieve higher precision and recall of click spam detection. Experimental results based on a Chinese commercial search engine using real-world log data containing approximately 80 million user clicks per day show that 2.6% of all clicks were detected as spam with a precision of up to 97%.
- E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 19--26. ACM, 2006. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Mining sequential patterns. In Data Engineering, 1995. Proceedings of the Eleventh International Conference on, pages 3--14. IEEE, 1995. Google ScholarDigital Library
- L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. A. Baeza-Yates. Link-based characterization and detection of web spam. In AIRWeb, pages 1--8, 2006.Google Scholar
- O. Chapelle and Y. Zhang. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th international conference on World wide web, pages 1--10. ACM, 2009. Google ScholarDigital Library
- N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 87--94. ACM, 2008. Google ScholarDigital Library
- G. Gu, R. Perdisci, J. Zhang, W. Lee, et al. Botminer: Clustering analysis of network traffic for protocol-and structure-independent botnet detection. In USENIX Security Symposium, pages 139--154, 2008. Google ScholarDigital Library
- F. Guo, C. Liu, A. Kannan, T. Minka, M. Taylor, Y.-M. Wang, and C. Faloutsos. Click chain model in web search. In Proceedings of the 18th international conference on World wide web, pages 11--20. ACM, 2009. Google ScholarDigital Library
- Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 576--587. VLDB Endowment, 2004. Google ScholarDigital Library
- J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. Freespan: frequent pattern-projected sequential pattern mining. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 355--359. ACM, 2000. Google ScholarDigital Library
- J. Han, J. Pei, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In proceedings of the 17th international conference on data engineering, pages 215--224, 2001. Google ScholarDigital Library
- B. J. Jansen. Click fraud. Computer, 40(7):85--86, 2007. Google ScholarDigital Library
- K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422--446, 2002. Google ScholarDigital Library
- T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133--142. ACM, 2002. Google ScholarDigital Library
- H. Kang, K. Wang, D. Soukal, F. Behr, and Z. Zheng. Large-scale bot detection for search engines. In Proceedings of the 19th international conference on World wide web, pages 501--510. ACM, 2010. Google ScholarDigital Library
- A. Karasaridis, B. Rexroad, and D. Hoeflin. Wide-scale botnet detection and characterization. In Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets, volume 7. Cambridge, MA, 2007. Google ScholarDigital Library
- V. Krishnan and R. Raj. Web spam detection with anti-trust rank. In AIRWeb, volume 6, pages 37--40, 2006.Google Scholar
- Y. Liu, R. Cen, M. Zhang, S. Ma, and L. Ru. Identifying web spam with user behavior analysis. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pages 9--16. ACM, 2008. Google ScholarDigital Library
- M. Marchiori. The quest for correct information on the web: Hyper search engines. Computer Networks and ISDN Systems, 29(8):1225--1235, 1997. Google ScholarDigital Library
- A. Metwally, D. Agrawal, and A. E. Abbadi. Using association rules for fraud detection in web advertising networks. In Proceedings of the 31st international conference on Very large data bases, pages 169--180. VLDB Endowment, 2005. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: bringing order to the web. 1999.Google Scholar
- J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. Mining sequential patterns by pattern-growth: The prefixspan approach. Knowledge and Data Engineering, IEEE Transactions on, 16(11):1424--1440, 2004. Google ScholarDigital Library
- F. Radlinski. Addressing malicious noise in clickthrough data. In Learning to Rank for Information Retrieval Workshop at SIGIR, volume 2007, 2007.Google Scholar
- N. Sadagopan and J. Li. Characterizing typical and atypical user sessions in clickstreams. In Proceedings of the 17th international conference on World Wide Web, pages 885--894. ACM, 2008. Google ScholarDigital Library
- T. Schluessler, S. Goglin, and E. Johnson. Is a bot at the controls?: Detecting input data attacks. In Proceedings of the 6th ACM SIGCOMM workshop on Network and system support for games, pages 1--6. ACM, 2007. Google ScholarDigital Library
- X. Yan, J. Han, and R. Afshar. Clospan: Mining closed sequential patterns in large datasets. In Proc. 2003 SIAM Int'l Conf. Data Mining (SDM'03), pages 166--177, 2003.Google ScholarCross Ref
Index Terms
- Search engine click spam detection based on bipartite graph propagation
Recommendations
Fighting against web spam: a novel propagation method based on click-through data
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalCombating Web spam is one of the greatest challenges for Web search engines. State-of-the-art anti-spam techniques focus mainly on detecting varieties of spam strategies, such as content spamming and link-based spamming. Although these anti-spam ...
Comments