Spam query detection using stream clustering

Shakiba, Tahere; Zarifzadeh, Sajjad; Derhami, Vali

doi:10.1007/s11280-017-0471-z

Spam query detection using stream clustering

Published: 05 June 2017

Volume 21, pages 557–572, (2018)
Cite this article

World Wide Web Aims and scope Submit manuscript

Tahere Shakiba¹,
Sajjad Zarifzadeh¹ &
Vali Derhami¹

424 Accesses
6 Citations
Explore all metrics

Abstract

Nowadays, search engines play a gateway role for users to access their needed information in the Web. However, malicious users can also use them to facilitate their attacks by submitting excessive amounts of bot-generated queries, called spam queries. In this paper, we propose a novel semi-supervised method which can effectively detect spam queries in a practical manner. We first train a model to characterize normal and malicious users, using the linguistic properties of queries as well as the behavioral characteristics of users and IP addresses. Then, we use the trained model to predict the label of arriving requests with a fast and efficient algorithm which works based on the stream clustering approach. The results of our evaluation with the real log of a local search engine show that the proposed algorithm yields an accuracy of about %94, while incurring a low response-time and memory overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal, C.C., Watson, T.J., Ctr, R., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on very Large Data Bases, pp 81–92 (2003)
Google Scholar
Buehrer, G., Stokes, J.W., Chellapilla, K.: A large-scale study of automated Web search traffic Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, pp. 1–8 (2008)
Google Scholar
Convey, E.: Porn Sneaks Way Back on Web. The Boston Herald, pp. 0–28 (1996)
Daswani, N., Stoppelman, M.: The google click quality and security teams. the anatomy of clickbot. a The First Workshop in Understanding Botnets (2007)
Dave, V., Guha, S., Zhang, Y.: Viceroi: catching click-spam in search ad networks Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, pp 765–776. ACM (2013)
Dou, Z., Song, R., Yuan, X., Wen, J.R.: Are click-through data adequate for learning Web search rankings? In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp 73–82. ACM (2008)
Haddadi, H.: Fighting online click-fraud using bluff ads. ACM SIGCOMM Computer Communication Review 40(2), 21–25 (2010)
Article Google Scholar
Henzinger, M.R., Motwani, R., Silverstein, C.: Challenges in Web search engines ACM SIGIR Forum, vol. 36, pp. 11–22. ACM (2002)
Hong, C., Yu, F., Xie, Y.: Populated IP addresses Proceedings of the 2012 ACM Conference on Computer and Communications Security, pp. 329–340 (2012)
Immorlica, N., Jain, K., Mahdian, M., Talwar, K.: Click fraud resistant methods for learning click-through rates International Workshop on Internet and Network Economics, pp. 34–45. Springer (2005)
Jung, J., Sit, E.: An empirical study of spam traffic and the use of Dns black lists Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, pp. 370–375. ACM (2004)
Kang, H., Wang, K., Soukal, D., Behr, F., Zheng, Z.: Large-scale bot detection for search engines Proceedings of the 19th International Conference on World Wide Web - WWW ’10, pp. 501–510 (2010)
Kitts, B., Zhang, J.Y., Roux, A., Mills, R.: Click fraud detection with bot signatures 2013 IEEE International Conference on Intelligence and Security Informatics (ISI), pp. 146–150. IEEE (2013)
Kitts, B., Zhang, J.Y., Wu, G., Brandi, W., Beasley, J., Morrill, K., Ettedgui, J., Siddhartha, S., Yuan, H., Gao, F., etal: Click fraud detection: adversarial pattern recognition over 5 years at microsoft Real World Data Mining Applications, pp. 181–201. Springer (2015)
Li, X., Zhang, M., Liu, Y., Ma, S., Jin, Y., Ru, L.: Search engine click spam detection based on bipartite graph propagation Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 93–102 (2014)
Metwally, A., Agrawal, D., El Abbad, A., Zheng, Q.: On hit inflation techniques and detection in streams of Web advertising networks 27th International Conference on Distributed Computing Systems (ICDCS’07), pp. 52–52. IEEE (2007)
Google Scholar
Oentaryo, R.J., Lim, E.P., Finegold, M., Lo, D., Zhu, F., Phua, C., Cheu, E.Y., Yap, G.E., Sim, K., Nguyen, M.N., etal: Detecting click fraud in online advertising: a data mining approach. J. Mach. Learn. Res. 15(1), 99–140 (2014)
MathSciNet Google Scholar
Peng, Y., Zhang, L., Chang, J.M., Guan, Y.: An effective method for combating malicious scripts clickbots European Symposium on Research in Computer Security, pp. 523–538. Springer (2009)
Google Scholar
Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Proceedings of Advances in Large Margin Classifiers, 61–74 (1999)
Provos, N., McClain, J., Wang, K.: Search worms. In: Proceedings of the 4th ACM Workshop on Recurring Malcode, pp. 1–8. ACM (2006)
Google Scholar
Radlinski, F.: Addressing malicious noise in clickthrough data. In: Learning to Rank for Information Retrieval Workshop at SIGIR, vol. 2007 (2007)
Google Scholar
Redis: Redis. http://redis.io/ (2016)
Sadagopan, N., Li, J.: Characterizing typical and atypical user sessions in clickstreams. In: Proceedings of the 17th International Conference on World Wide Web, pp. 885–894 (2008)
Google Scholar
Spirin, N., Han, J.: Survey on Web spam detection: principles and algorithms. ACM SIGKDD Explorations Newsletter 13(2), 50–64 (2012)
Article Google Scholar
Stone-Gross, B., Stevens, R., Zarras, A., Kemmerer, R., Kruegel, C., Vigna, G.: Understanding fraudulent activities in online ad exchanges. In: Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, pp. 279–294. ACM (2011)
Google Scholar
Stringhini, G., Holz, T., Stone-Gross, B., Kruegel, C., Vigna, G.: Botmagnifier: locating spambots on the internet. In: USENIX Security symposium, pp. 1–32 (2011)
Google Scholar
Wang, G., Konolige, T., Wilson, C., Wang, X., Zheng, H., Zhao, B.Y.: You are how you click: clickstream analysis for sybil detection. In: Presented as Part of the 22nd USENIX Security Symposium (USENIX Security 13), pp. 241–256 (2013)
Google Scholar
Wang, G., Zhang, X., Tang, S., Zheng, H., Zhao, B.Y.: Unsupervised clickstream clustering for user behavior analysis. In: SIGCHI Conference on Human Factors in Computing Systems (2016)
Google Scholar
Wikipedia: Pearson Correlation Coefficient. https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient (2016)
Xie, Y., Yu, F., Achan, K., Panigrahy, R., Hulten, G., Osipkov, I.: Spamming botnets: signatures and characteristics. ACM SIGCOMM Computer Communication Review 38(4), 171–182 (2008)
Article Google Scholar
Yu, F., John, J.P., Xie, Y., Abadi, M., Krishnamurthy, A.: Searching the searchers with searchaudit Proceedings of the 19th USENIX Conference on Security, pp. 9–9 (2010)
Google Scholar
Yu, F., Xie, Y., Ke, Q.: SBOtminer: large scale search bot detection. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, pp. 421–430 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Yazd University, Yazd, Iran
Tahere Shakiba, Sajjad Zarifzadeh & Vali Derhami

Authors

Tahere Shakiba
View author publications
You can also search for this author in PubMed Google Scholar
Sajjad Zarifzadeh
View author publications
You can also search for this author in PubMed Google Scholar
Vali Derhami
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sajjad Zarifzadeh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shakiba, T., Zarifzadeh, S. & Derhami, V. Spam query detection using stream clustering. World Wide Web 21, 557–572 (2018). https://doi.org/10.1007/s11280-017-0471-z

Download citation

Received: 25 November 2016
Revised: 07 May 2017
Accepted: 26 May 2017
Published: 05 June 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s11280-017-0471-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spam query detection using stream clustering

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Stratified random sampling from streaming and stored data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spam query detection using stream clustering

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Stratified random sampling from streaming and stored data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation