Skip to main content
Log in

Spam query detection using stream clustering

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Nowadays, search engines play a gateway role for users to access their needed information in the Web. However, malicious users can also use them to facilitate their attacks by submitting excessive amounts of bot-generated queries, called spam queries. In this paper, we propose a novel semi-supervised method which can effectively detect spam queries in a practical manner. We first train a model to characterize normal and malicious users, using the linguistic properties of queries as well as the behavioral characteristics of users and IP addresses. Then, we use the trained model to predict the label of arriving requests with a fast and efficient algorithm which works based on the stream clustering approach. The results of our evaluation with the real log of a local search engine show that the proposed algorithm yields an accuracy of about %94, while incurring a low response-time and memory overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7

Similar content being viewed by others

References

  1. Aggarwal, C.C., Watson, T.J., Ctr, R., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on very Large Data Bases, pp 81–92 (2003)

    Google Scholar 

  2. Buehrer, G., Stokes, J.W., Chellapilla, K.: A large-scale study of automated Web search traffic Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, pp. 1–8 (2008)

    Google Scholar 

  3. Convey, E.: Porn Sneaks Way Back on Web. The Boston Herald, pp. 0–28 (1996)

  4. Daswani, N., Stoppelman, M.: The google click quality and security teams. the anatomy of clickbot. a The First Workshop in Understanding Botnets (2007)

  5. Dave, V., Guha, S., Zhang, Y.: Viceroi: catching click-spam in search ad networks Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, pp 765–776. ACM (2013)

  6. Dou, Z., Song, R., Yuan, X., Wen, J.R.: Are click-through data adequate for learning Web search rankings? In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp 73–82. ACM (2008)

  7. Haddadi, H.: Fighting online click-fraud using bluff ads. ACM SIGCOMM Computer Communication Review 40(2), 21–25 (2010)

    Article  Google Scholar 

  8. Henzinger, M.R., Motwani, R., Silverstein, C.: Challenges in Web search engines ACM SIGIR Forum, vol. 36, pp. 11–22. ACM (2002)

  9. Hong, C., Yu, F., Xie, Y.: Populated IP addresses Proceedings of the 2012 ACM Conference on Computer and Communications Security, pp. 329–340 (2012)

  10. Immorlica, N., Jain, K., Mahdian, M., Talwar, K.: Click fraud resistant methods for learning click-through rates International Workshop on Internet and Network Economics, pp. 34–45. Springer (2005)

  11. Jung, J., Sit, E.: An empirical study of spam traffic and the use of Dns black lists Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, pp. 370–375. ACM (2004)

  12. Kang, H., Wang, K., Soukal, D., Behr, F., Zheng, Z.: Large-scale bot detection for search engines Proceedings of the 19th International Conference on World Wide Web - WWW ’10, pp. 501–510 (2010)

  13. Kitts, B., Zhang, J.Y., Roux, A., Mills, R.: Click fraud detection with bot signatures 2013 IEEE International Conference on Intelligence and Security Informatics (ISI), pp. 146–150. IEEE (2013)

  14. Kitts, B., Zhang, J.Y., Wu, G., Brandi, W., Beasley, J., Morrill, K., Ettedgui, J., Siddhartha, S., Yuan, H., Gao, F., etal: Click fraud detection: adversarial pattern recognition over 5 years at microsoft Real World Data Mining Applications, pp. 181–201. Springer (2015)

  15. Li, X., Zhang, M., Liu, Y., Ma, S., Jin, Y., Ru, L.: Search engine click spam detection based on bipartite graph propagation Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 93–102 (2014)

  16. Metwally, A., Agrawal, D., El Abbad, A., Zheng, Q.: On hit inflation techniques and detection in streams of Web advertising networks 27th International Conference on Distributed Computing Systems (ICDCS’07), pp. 52–52. IEEE (2007)

    Google Scholar 

  17. Oentaryo, R.J., Lim, E.P., Finegold, M., Lo, D., Zhu, F., Phua, C., Cheu, E.Y., Yap, G.E., Sim, K., Nguyen, M.N., etal: Detecting click fraud in online advertising: a data mining approach. J. Mach. Learn. Res. 15(1), 99–140 (2014)

    MathSciNet  Google Scholar 

  18. Peng, Y., Zhang, L., Chang, J.M., Guan, Y.: An effective method for combating malicious scripts clickbots European Symposium on Research in Computer Security, pp. 523–538. Springer (2009)

    Google Scholar 

  19. Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Proceedings of Advances in Large Margin Classifiers, 61–74 (1999)

  20. Provos, N., McClain, J., Wang, K.: Search worms. In: Proceedings of the 4th ACM Workshop on Recurring Malcode, pp. 1–8. ACM (2006)

    Google Scholar 

  21. Radlinski, F.: Addressing malicious noise in clickthrough data. In: Learning to Rank for Information Retrieval Workshop at SIGIR, vol. 2007 (2007)

    Google Scholar 

  22. Redis: Redis. http://redis.io/ (2016)

  23. Sadagopan, N., Li, J.: Characterizing typical and atypical user sessions in clickstreams. In: Proceedings of the 17th International Conference on World Wide Web, pp. 885–894 (2008)

    Google Scholar 

  24. Spirin, N., Han, J.: Survey on Web spam detection: principles and algorithms. ACM SIGKDD Explorations Newsletter 13(2), 50–64 (2012)

    Article  Google Scholar 

  25. Stone-Gross, B., Stevens, R., Zarras, A., Kemmerer, R., Kruegel, C., Vigna, G.: Understanding fraudulent activities in online ad exchanges. In: Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, pp. 279–294. ACM (2011)

    Google Scholar 

  26. Stringhini, G., Holz, T., Stone-Gross, B., Kruegel, C., Vigna, G.: Botmagnifier: locating spambots on the internet. In: USENIX Security symposium, pp. 1–32 (2011)

    Google Scholar 

  27. Wang, G., Konolige, T., Wilson, C., Wang, X., Zheng, H., Zhao, B.Y.: You are how you click: clickstream analysis for sybil detection. In: Presented as Part of the 22nd USENIX Security Symposium (USENIX Security 13), pp. 241–256 (2013)

    Google Scholar 

  28. Wang, G., Zhang, X., Tang, S., Zheng, H., Zhao, B.Y.: Unsupervised clickstream clustering for user behavior analysis. In: SIGCHI Conference on Human Factors in Computing Systems (2016)

    Google Scholar 

  29. Wikipedia: Pearson Correlation Coefficient. https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient (2016)

  30. Xie, Y., Yu, F., Achan, K., Panigrahy, R., Hulten, G., Osipkov, I.: Spamming botnets: signatures and characteristics. ACM SIGCOMM Computer Communication Review 38(4), 171–182 (2008)

    Article  Google Scholar 

  31. Yu, F., John, J.P., Xie, Y., Abadi, M., Krishnamurthy, A.: Searching the searchers with searchaudit Proceedings of the 19th USENIX Conference on Security, pp. 9–9 (2010)

    Google Scholar 

  32. Yu, F., Xie, Y., Ke, Q.: SBOtminer: large scale search bot detection. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, pp. 421–430 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sajjad Zarifzadeh.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shakiba, T., Zarifzadeh, S. & Derhami, V. Spam query detection using stream clustering. World Wide Web 21, 557–572 (2018). https://doi.org/10.1007/s11280-017-0471-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-017-0471-z

Keywords

Navigation