Abstract
Nowadays, search engines play a gateway role for users to access their needed information in the Web. However, malicious users can also use them to facilitate their attacks by submitting excessive amounts of bot-generated queries, called spam queries. In this paper, we propose a novel semi-supervised method which can effectively detect spam queries in a practical manner. We first train a model to characterize normal and malicious users, using the linguistic properties of queries as well as the behavioral characteristics of users and IP addresses. Then, we use the trained model to predict the label of arriving requests with a fast and efficient algorithm which works based on the stream clustering approach. The results of our evaluation with the real log of a local search engine show that the proposed algorithm yields an accuracy of about %94, while incurring a low response-time and memory overhead.
Similar content being viewed by others
References
Aggarwal, C.C., Watson, T.J., Ctr, R., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on very Large Data Bases, pp 81–92 (2003)
Buehrer, G., Stokes, J.W., Chellapilla, K.: A large-scale study of automated Web search traffic Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, pp. 1–8 (2008)
Convey, E.: Porn Sneaks Way Back on Web. The Boston Herald, pp. 0–28 (1996)
Daswani, N., Stoppelman, M.: The google click quality and security teams. the anatomy of clickbot. a The First Workshop in Understanding Botnets (2007)
Dave, V., Guha, S., Zhang, Y.: Viceroi: catching click-spam in search ad networks Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, pp 765–776. ACM (2013)
Dou, Z., Song, R., Yuan, X., Wen, J.R.: Are click-through data adequate for learning Web search rankings? In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp 73–82. ACM (2008)
Haddadi, H.: Fighting online click-fraud using bluff ads. ACM SIGCOMM Computer Communication Review 40(2), 21–25 (2010)
Henzinger, M.R., Motwani, R., Silverstein, C.: Challenges in Web search engines ACM SIGIR Forum, vol. 36, pp. 11–22. ACM (2002)
Hong, C., Yu, F., Xie, Y.: Populated IP addresses Proceedings of the 2012 ACM Conference on Computer and Communications Security, pp. 329–340 (2012)
Immorlica, N., Jain, K., Mahdian, M., Talwar, K.: Click fraud resistant methods for learning click-through rates International Workshop on Internet and Network Economics, pp. 34–45. Springer (2005)
Jung, J., Sit, E.: An empirical study of spam traffic and the use of Dns black lists Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, pp. 370–375. ACM (2004)
Kang, H., Wang, K., Soukal, D., Behr, F., Zheng, Z.: Large-scale bot detection for search engines Proceedings of the 19th International Conference on World Wide Web - WWW ’10, pp. 501–510 (2010)
Kitts, B., Zhang, J.Y., Roux, A., Mills, R.: Click fraud detection with bot signatures 2013 IEEE International Conference on Intelligence and Security Informatics (ISI), pp. 146–150. IEEE (2013)
Kitts, B., Zhang, J.Y., Wu, G., Brandi, W., Beasley, J., Morrill, K., Ettedgui, J., Siddhartha, S., Yuan, H., Gao, F., etal: Click fraud detection: adversarial pattern recognition over 5 years at microsoft Real World Data Mining Applications, pp. 181–201. Springer (2015)
Li, X., Zhang, M., Liu, Y., Ma, S., Jin, Y., Ru, L.: Search engine click spam detection based on bipartite graph propagation Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 93–102 (2014)
Metwally, A., Agrawal, D., El Abbad, A., Zheng, Q.: On hit inflation techniques and detection in streams of Web advertising networks 27th International Conference on Distributed Computing Systems (ICDCS’07), pp. 52–52. IEEE (2007)
Oentaryo, R.J., Lim, E.P., Finegold, M., Lo, D., Zhu, F., Phua, C., Cheu, E.Y., Yap, G.E., Sim, K., Nguyen, M.N., etal: Detecting click fraud in online advertising: a data mining approach. J. Mach. Learn. Res. 15(1), 99–140 (2014)
Peng, Y., Zhang, L., Chang, J.M., Guan, Y.: An effective method for combating malicious scripts clickbots European Symposium on Research in Computer Security, pp. 523–538. Springer (2009)
Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Proceedings of Advances in Large Margin Classifiers, 61–74 (1999)
Provos, N., McClain, J., Wang, K.: Search worms. In: Proceedings of the 4th ACM Workshop on Recurring Malcode, pp. 1–8. ACM (2006)
Radlinski, F.: Addressing malicious noise in clickthrough data. In: Learning to Rank for Information Retrieval Workshop at SIGIR, vol. 2007 (2007)
Redis: Redis. http://redis.io/ (2016)
Sadagopan, N., Li, J.: Characterizing typical and atypical user sessions in clickstreams. In: Proceedings of the 17th International Conference on World Wide Web, pp. 885–894 (2008)
Spirin, N., Han, J.: Survey on Web spam detection: principles and algorithms. ACM SIGKDD Explorations Newsletter 13(2), 50–64 (2012)
Stone-Gross, B., Stevens, R., Zarras, A., Kemmerer, R., Kruegel, C., Vigna, G.: Understanding fraudulent activities in online ad exchanges. In: Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, pp. 279–294. ACM (2011)
Stringhini, G., Holz, T., Stone-Gross, B., Kruegel, C., Vigna, G.: Botmagnifier: locating spambots on the internet. In: USENIX Security symposium, pp. 1–32 (2011)
Wang, G., Konolige, T., Wilson, C., Wang, X., Zheng, H., Zhao, B.Y.: You are how you click: clickstream analysis for sybil detection. In: Presented as Part of the 22nd USENIX Security Symposium (USENIX Security 13), pp. 241–256 (2013)
Wang, G., Zhang, X., Tang, S., Zheng, H., Zhao, B.Y.: Unsupervised clickstream clustering for user behavior analysis. In: SIGCHI Conference on Human Factors in Computing Systems (2016)
Wikipedia: Pearson Correlation Coefficient. https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient (2016)
Xie, Y., Yu, F., Achan, K., Panigrahy, R., Hulten, G., Osipkov, I.: Spamming botnets: signatures and characteristics. ACM SIGCOMM Computer Communication Review 38(4), 171–182 (2008)
Yu, F., John, J.P., Xie, Y., Abadi, M., Krishnamurthy, A.: Searching the searchers with searchaudit Proceedings of the 19th USENIX Conference on Security, pp. 9–9 (2010)
Yu, F., Xie, Y., Ke, Q.: SBOtminer: large scale search bot detection. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, pp. 421–430 (2010)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shakiba, T., Zarifzadeh, S. & Derhami, V. Spam query detection using stream clustering. World Wide Web 21, 557–572 (2018). https://doi.org/10.1007/s11280-017-0471-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-017-0471-z