Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web Search

Yang, Dong-Hui; Li, Zhen-Yu; Wang, Xiao-Hui; Salamatian, Kavé; Xie, Gao-Gang

doi:10.1007/s11390-021-0218-2

Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web Search

Regular Paper
Published: 30 September 2021

Volume 36, pages 1167–1183, (2021)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Dong-Hui Yang^1,2,
Zhen-Yu Li^1,2,
Xiao-Hui Wang³,
Kavé Salamatian⁴ &
…
Gao-Gang Xie^2,5

246 Accesses
Explore all metrics

Abstract

Internet users heavily rely on web search engines for their intended information. The major revenue of search engines is advertisements (or ads). However, the search advertising suffers from fraud. Fraudsters generate fake traffic which does not reach the intended audience, and increases the cost of the advertisers. Therefore, it is critical to detect fraud in web search. Previous studies solve this problem through fraudster detection (especially bots) by leveraging fraudsters’ unique behaviors. However, they may fail to detect new means of fraud, such as crowdsourcing fraud, since crowd workers behave in part like normal users. To this end, this paper proposes an approach to detecting fraud in web search from the perspective of fraudulent keywords. We begin by using a unique dataset of 150 million web search logs to examine the discriminating features of fraudulent keywords. Specifically, we model the temporal correlation of fraudulent keywords as a graph, which reveals a very well-connected community structure. Next, we design DFW (detection of fraudulent keywords) that mines the temporal correlations between candidate fraudulent keywords and a given list of seeds. In particular, DFW leverages several refinements to filter out non-fraudulent keywords that co-occur with seeds occasionally. The evaluation using the search logs shows that DFW achieves high fraud detection precision (99%) and accuracy (93%). A further analysis reveals several typical temporal evolution patterns of fraudulent keywords and the co-existence of both bots and crowd workers as fraudsters for web search fraud.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Frauds in Online Social Networks: A Review

An Exhaustive Review on Detecting Online Click-Ad Frauds

Graph Fraud Detection Based on Accessibility Score Distributions

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Jansen B J. Click fraud. Computer, 2007, 40(7): 85-86. https://doi.org/10.1109/MC.2007.232.
Article Google Scholar
Jansen B J, Mullen T. Sponsored search: An overview of the concept, history, and technology. International Journal of Electronic Business, 2008, 6(2): 114-131.
Article Google Scholar
Ghose A, Yang S. An empirical analysis of search engine advertising: Sponsored search in electronic markets. Management Science, 2009, 55(10): 1605-1622. https://doi.org/10.1504/IJEB.2008.018068.
Article Google Scholar
Fain D C, Pedersen J O. Sponsored search: A brief history. Bulletin of the American Society for Information Science and Technology, 2006, 32(2): 12-13. https://doi.org/10.1002/bult.1720320206.
Article Google Scholar
Gallaugher J. Information Systems: A Manager’s Guide to Harnessing Technology. University of Minnesota Libraries Publishing, 2015.
Shakiba T, Zarifzadeh S, Derhami V. Spam query detection using stream clustering. World Wide Web, 2018, 21(2): 557-572. https://doi.org/10.1007/s11280-017-0471-z.
Article Google Scholar
Stone-Gross B, Stevens R, Zarras A, Kemmerer R, Kruegel C, Vigna G. Understanding fraudulent activities in online ad exchanges. In Proc. the 2011 ACM SIGCOMM Internet Measurement Conference, November 2011, pp.279-294. https://doi.org/10.1145/2068816.2068843.
Pandit S, Chau D H, Wang S, Faloutsos C. Netprobe: A fast and scalable system for fraud detection in online auction networks. In Proc. the 16th International Conference on World Wide Web, May 2007, pp.201-210. https://doi.org/10.1145/1242572.1242600.
Yu F, Xie Y L, Ke Q F. SBotMiner: Large scale search bot detection. In Proc. the 3rd ACM International Conference on Web Search and Data Mining, February 2010, pp.421-430. https://doi.org/10.1145/1718487.1718540.
Zhang J J, Xie Y L, Yu F, Soukal D, Lee W. Intention and origination: An inside look at large-scale bot queries. In Proc. the 20th Annual Network and Distributed System Security Symposium, February 2013.
Zhang L F, Guan Y. Detecting click fraud in pay-per-click streams of online advertising networks. In Proc. the 28th International Conference on Distributed Computing Systems, June 2008, pp.77-84. https://doi.org/10.1109/ICDCS.2008.98.
Buehrer G, Stokes J W, Chellapilla K, Platt J. Classification of automated web traffic. In Weaving Services and People on the World Wide Web, King I, Baeza-Yates R (eds.), Springer Verlag, 2009, pp.3-26. https://doi.org/10.1007/978-3-642-00570-1_1.
Sadagopan N, Li J. Characterizing typical and atypical user sessions in clickstreams. In Proc. the 17th International Conference on World Wide Web, April 2008, pp.885-894. https://doi.org/10.1145/1367497.1367617.
Duskin O, Feitelson D G. Distinguishing humans from robots in web search logs: Preliminary results using query rates and intervals. In Proc. the 2009 Workshop on Web Search Click Data, February 2009, pp.15-19. 10.1145/1507509.1507512.
Tian T, Zhu J, Xia F, Zhuang X, Zhang T. Crowd fraud detection in Internet advertising. In Proc. the 24th International Conference on World Wide Web, May 2015, pp.1100-1110. https://doi.org/10.1145/2736277.2741136.
Kang H W, Wang K S, Soukal D, Behr F, Zheng Z J. Largescale bot detection for search engines. In Proc. the 19th International Conference on World Wide Web, April 2010, pp.501-510. https://doi.org/10.1145/1772690.1772742.
Haidar R, Elbassuoni S. Website navigation behavior analysis for bot detection. In Proc. the 2017 IEEE International Conference on Data Science and Advanced Analytics, October 2017, pp.60-68. https://doi.org/10.1109/DSAA.2017.13.
Guo Y, Shi J, Cao Z, Kang C, Xiong G, Li Z. Machine learning based cloudbot detection using multi-layer traffic statistics. In Proc. the 21st IEEE International Conference on High Performance Computing and Communications; the 17th IEEE International Conference on Smart City and the 5th IEEE International Conference on Data Science and Systems, August 2019, pp.2428-2435. https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00339.
Toffalini F, Abbà M, Carra D, Balzarotti D. Google dorks: Analysis, creation, and new defenses. In Proc. the 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, July 2016, pp.255-275. https://doi.org/10.1007/978-3-319-40667-1_13.
Metwally A, Agrawal D, Abbadi A E. Duplicate detection in click streams. In Proc. the 14th International Conference on World Wide Web, May 2005, pp.12-21. https://doi.org/10.1145/1060745.1060753.
Metwally A, Agrawal D, Abbadi A E. Detectives: Detecting coalition hit inflation attacks in advertising networks streams. In Proc. the 16th International Conference on World Wide Web, May 2007, pp.241-250. https://doi.org/10.1145/1242572.1242606.
Immorlica N, Jain K, Mahdian M, Talwar K. Click fraud resistant methods for learning click-through rates. In Proc. the 1st International Workshop on Internet and Network Economics, December 2005, pp.34-45. https://doi.org/10.1007/11600930_5.
Dave V, Guha S, Zhang Y. ViceROI: Catching click-spam in search ad networks. In Proc. the 2013 ACM SIGSAC Conference on Computer & Communications Security, November 2013, pp.765-776. https://doi.org/10.1145/2508859.2516688.
Li X, Zhang M, Liu Y Q, Ma S P, Jin Y J, Ru L Y. Search engine click spam detection based on bipartite graph propagation. In Proc. the 7th ACM International Conference on Web Search and Data Mining, February 2014, pp.93-102. https://doi.org/10.1145/2556195.2556214.
Nagaraja S, Shah R. Clicktok: Click fraud detection using traffic analysis. In Proc. the 12th Conference on Security and Privacy in Wireless and Mobile Networks, May 2019, pp.105-116. https://doi.org/10.1145/3317549.3323407.
DeBlasio J, Guha S, Voelker G M, Snoeren A C. Exploring the dynamics of search advertiser fraud. In Proc. the 2017 Internet Measurement Conference, November 2017, pp.157-170. https://doi.org/10.1145/3131365.3131393.
Wei C, Liu Y Q, Zhang M, Ma S P, Ru L Y, Zhang K. Fighting against web spam: A novel propagation method based on click-through data. In Proc. the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, August 2012, pp.395-404. https://doi.org/10.1145/2348283.2348338.
Haider C M R, Iqbal A, Rahman A H, Rahman M S. An ensemble learning based approach for impression fraud detection in mobile advertising. Journal of Network and Computer Applications, 2018, 112: 126-141. https://doi.org/10.1016/j.jnca.2018.02.021.
Article Google Scholar
Dong F, Wang H Y, Li L, Guo Y, Bissyandé T F, Liu T M, Xu G A, Klein J. FraudDroid: Automated ad fraud detection for Android apps. In Proc. the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, November 2018, pp.257-268. https://doi.org/10.1145/3236024.3236045.
Halfaker A, Keyes O, Kluver D, Thebault-Spieker J, Nguyen T, Shores K, Uduwage A, Warncke-Wang M. User session identification based on strong regularities in interactivity time. In Proc. the 24th International Conference on World Wide Web, May 2015, pp.410-418. https://doi.org/10.1145/2736277.2741117.
Jones R, Klinkner K L. Beyond the session timeout: Automatic hierarchical segmentation of search topics in query logs. In Proc. the 17th ACM Conference on Information and Knowledge Management, October 2008, pp.699-708. https://doi.org/10.1145/1458082.1458176.
Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 1974, 19(6): 716-723. https://doi.org/10.1109/TAC.1974.1100705.
Article MathSciNet MATH Google Scholar
Fruchterman TMJ, Reingold E M. Graph drawing by force-directed placement. Software: Practice and Experience, 1991, 21(11): 1129-1164. https://doi.org/10.1002/spe.4380211102.
Article Google Scholar
Saramäki J, Kivelä M, Onnela J P, Kaski K, Kertesz J. Generalizations of the clustering coefficient to weighted complex networks. Physical Review E, 2007, 75(2): Article No. 027105. https://doi.org/10.1103/PhysRevE.75.027105.
Schütze H, Manning C D, Raghavan P. Introduction to Information Retrieval. Cambridge University Press, 2008.
Pelleg D, Moore A. X-means: Extending K-means with efficient estimation of the number of clusters. In Proc. the 17th International Conference on Machine Learning, June 29–July 2, 2000, pp.727-734.
Onnela J P, Saramäki J, Hyvönen J, Szabó G, Lazer D, Kaski K, Kertész J, Barabási A L. Structure and tie strengths in mobile communication networks. Proceedings of the National Academy of Sciences, 2007, 104(18): 7332-7336. https://doi.org/10.1073/pnas.0610245104.
Article Google Scholar
Whang J J, Jung Y, Kang S, Yoo D, Dhillon I S. Scalable Anti-TrustRank with qualified site-level seeds for link-based web spam detection. In Proc. the 2020 Web Conference, April 2020, pp.593-602. https://doi.org/10.1145/3366424.3385773.
Wang H D, Xu F L, Li Y, Zhang P Y, Jin D P. Understanding mobile traffic patterns of large scale cellular towers in urban environment. In Proc. the 2015 Internet Measurement Conference, October 2015, pp.225-238. https://doi.org/10.1145/2815675.2815680.
Corpet F. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Research, 1988, 16(22): 10881-10890. https://doi.org/10.1093/nar/16.22.10881.
Article Google Scholar
Davies D L, Bouldin D W. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1979, 1(2): 224-227. https://doi.org/10.1109/TPAMI.1979.4766909.
Article Google Scholar
Maulik U, Bandyopadhyay S. Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(12): 1650-1654. https://doi.org/10.1109/TPAMI.2002.1114856.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Network Technology Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Dong-Hui Yang & Zhen-Yu Li
University of Chinese Academy of Sciences, Beijing, 100049, China
Dong-Hui Yang, Zhen-Yu Li & Gao-Gang Xie
Global Energy Interconnection Research Institute Co., Ltd., Beijing, 102209, China
Xiao-Hui Wang
LISTIC Laboratory of Computer Science, Systems, Information and Knowledge Processing, Université Savoie Mont, Blanc, 73011, Chambéry, France
Kavé Salamatian
Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100190, China
Gao-Gang Xie

Authors

Dong-Hui Yang
View author publications
You can also search for this author inPubMed Google Scholar
Zhen-Yu Li
View author publications
You can also search for this author inPubMed Google Scholar
Xiao-Hui Wang
View author publications
You can also search for this author inPubMed Google Scholar
Kavé Salamatian
View author publications
You can also search for this author inPubMed Google Scholar
Gao-Gang Xie
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Zhen-Yu Li.

Supplementary Information

ESM 1

(PDF 111 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, DH., Li, ZY., Wang, XH. et al. Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web Search. J. Comput. Sci. Technol. 36, 1167–1183 (2021). https://doi.org/10.1007/s11390-021-0218-2

Download citation

Received: 11 December 2019
Accepted: 01 July 2021
Published: 30 September 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s11390-021-0218-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web Search

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Frauds in Online Social Networks: A Review

An Exhaustive Review on Detecting Online Click-Ad Frauds

Graph Fraud Detection Based on Accessibility Score Distributions

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now