research-article

Fighting against web spam: a novel propagation method based on click-through data

Authors:

Kuo ZhangAuthors Info & Claims

SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Pages 395 - 404

https://doi.org/10.1145/2348283.2348338

Published: 12 August 2012 Publication History

Abstract

Combating Web spam is one of the greatest challenges for Web search engines. State-of-the-art anti-spam techniques focus mainly on detecting varieties of spam strategies, such as content spamming and link-based spamming. Although these anti-spam approaches have had much success, they encounter problems when fighting against a continuous barrage of new types of spamming techniques. We attempt to solve the problem from a new perspective, by noticing that queries that are more likely to lead to spam pages/sites have the following characteristics: 1) they are popular or reflect heavy demands for search engine users and 2) there are usually few key resources or authoritative results for them. From these observations, we propose a novel method that is based on click-through data analysis by propagating the spamicity score iteratively between queries and URLs from a few seed pages/sites. Once we obtain the seed pages/sites, we use the link structure of the click-through bipartite graph to discover other pages/sites that are likely to be spam. Experiments show that our algorithm is both efficient and effective in detecting Web spam. Moreover, combining our method with some popular anti-spam techniques such as TrustRank achieves improvement compared with each technique taken individually.

References

[1]

Agichtein, E., Brill, E. and Dumais, S. 2006. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (Seattle, Washington, August 6--11, 2006).SIGIR '06. ACM, New York, NY, 19--26.

Digital Library

[2]

Attenberg, J. and Suel, T. 2008. Cleaning search results using term distance features. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (Beijing, China, April 22, 2008). AIRWeb '08. ACM, New York, NY, 21--24.

Digital Library

[3]

Castillo, C. and Davison, B.D. 2011. Adversarial Web Search. Foundations and trends in Information Retrieval. 4, 5 (2011), 377--488.

Digital Library

[4]

Castillo, C., Donato, D., Gionis, A., Murdock, V. and Silvestri, F. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Amsterdam, The Netherlands, July 23--27, 2007). SIGIR '07. ACM, New York, NY, 423--430.

Digital Library

[5]

Chellapilla, K. and Chickering, D.M. 2006. Improving cloaking detection using search query popularity and monetizability. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (Seattle, Washington, August 10, 2006). AIRWeb '06. ACM, New York, NY, 17--24.

[6]

Chellapilla, K., and Maykov, A. 2007. A taxonomy of JavaScript redirection spam. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (Banff, Alberta, Canada, May 8, 2007). AIRWeb '07. ACM, New York, NY, 81--88.

Digital Library

[7]

Cheng, Z., Gao, B., Sun, C., Jiang, Y. and Liu, T. 2011. Let Web Spammers Expose Themselves. In Proceedings of the fourth ACM international conference on Web search and data mining (Hong Kong, China, February 9--12, 2011). WSDM '11, ACM, New York, NY, 525--534.

Digital Library

[8]

Erdélyi, M., Garzó, A. and Benczúr, A.A. 2011. Web spam classification: a few features worth more. In Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality (Hyderabad, India, March 28, 2011). WebQuality '11, ACM, New York, NY, 27--34.

Digital Library

[9]

Gyöngyi, Z. and Garcia-Molina, H. 2005. Spam: It's Not Just for Inboxes Anymore. IEEE Computer Magzine. 38, 10 (2005), 28--34.

Digital Library

[10]

Gyöngyi, Z., Garcia-Molina, H. and Pedersen, J. 2004. Combating Web Spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (Toronto, Canada, August 29 -- September 3, 2004). VLDB '04. VLDB Endowment, US, 576--587.

Digital Library

[11]

Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (Chiba, Japan, May 10, 2005). AIRWeb '05. ACM, New York, NY, 39--47.

[12]

Liu, Y., Cen, R., Zhang, M., Ma, S., and Ru, L. 2008. Identifying Web spam with user behavior analysis. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (Beijing, China, April 22, 2008). AIRWeb '08. ACM, New York, NY, 9--16.

Digital Library

[13]

Liu Y., Gao B., Liu TY., Zhang Y., Ma Z., He S. and Li H. 2008. BrowserRank: letting web users vote for page importance. In Proceedings of the 31th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, July 20--24, 2008). SIGIR '08. ACM, New York, NY, 451--458.

Digital Library

[14]

Martinez-Romo, J. and Araujo, L. 2009. Web spam identification through language model analysis. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web (Madrid, Spain, April 21, 2009). AIRWeb '09. ACM, New York, NY, 21--28.

Digital Library

[15]

Moshchuk, A., Bragin, T., Gribble, D.S. and Levy, M. H. 2006. A crawler-based study of spyware on the web. In Proceedings of the thirteenth Annual Symposium on Network and Distributed System Security (San Diego, California, US, February, 2006). NDSS '06.

[16]

Nie, L., Wu, B. and Davison, D.B. 2007. Winnowing wheat from the chaff: Propagating trust to sift spam from the Web. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Amsterdam, The Netherlands, July 23--27, 2007). SIGIR '07. ACM, New York, NY, 869--870.

Digital Library

[17]

Ntoulas, A., Najork, M., Manasse, M. and Fetterly, D. 2006. Detecting Spam Web Pages through Content Analysis. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23--26, 2006). WWW '06. ACM, New York, NY, 83--92.

Digital Library

[18]

Piskorski, J., Sydow, M. and Weiss, D. 2008. Exploring linguistic features for Web spam detection: A preliminary study. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (Beijing, China, April 22, 2008). AIRWeb '08. ACM, New York, NY, 25--28.

Digital Library

[19]

Silverstein, C., Marais H., Henzinger M., and Moricz M. 1999. Analysis of a Very Large Web Search Engine Query Log. Association for Computer Machinery, SIGIR Forum, 33, 3.

Digital Library

[20]

Singhal, A. Challenges in running a commercial search engine. 2005. Keynote presentation at SIGIR 2005, August 2005.

Digital Library

[21]

Sobek, M. 2002. PR0 -- Google's PageRank 0 penalty, http://pr.efactory.de/e-pr0.shtml, 2002.

[22]

Urvoy, T., Chauveau, E., Filoche, P. and Lavergne, T. Tracking Web spam with HTML style similarities. ACM Transactions on the Web. 2, 1 (February, 2008).

Digital Library

[23]

Urvoy, T., Lavergne, T. and Filoche, P. 2006. Tracking Web spam with hidden style similarity. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (Seattle, Washington, August 10, 2006). AIRWeb '06. ACM, New York, NY, 25--32.

[24]

Wu, B. and Davison, D.B. 2006. Detecting semantic cloaking on the Web. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23--26, 2006). WWW '06. ACM, New York, NY, 819--828.

Digital Library

[25]

Wu, B. and Davison, D. B. 2005. Identifying link farm spam pages. In Special interest tracks and posters of the 14th International Conference on World Wide Web (Chiba, Japan, May 10--14, 2005). WWW '05. ACM, New York, NY, 820--829.

Digital Library

[26]

Wu, B., Goel, V. and Davison, D.B. 2006. Propagating trust and distrust to demote Web spam. In Workshop on Models of Trust for the Web (Edinburgh, Scotland, May 22, 2006). MTW '06.

[27]

Zhu, X. and Ghahramani, Z. 2002. Learning from Labeled and Unlabeled Data with Label Propagation. Carnegie Mellon University CALD technical report Carnegie Mellon University-CALD-02--107.

[28]

http://www.yr-bcn.es/webspam/datasets/uk2006-info/

Cited By

Wu CZhang RGuo JDe Rijke MFan YCheng X(2023)PRADA: Practical Black-box Adversarial Attacks against Neural Ranking ModelsACM Transactions on Information Systems10.1145/357692341:4(1-27)Online publication date: 8-Apr-2023
https://dl.acm.org/doi/10.1145/3576923
Yang DLi ZWang XSalamatian KXie G(2021)Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web SearchJournal of Computer Science and Technology10.1007/s11390-021-0218-236:5(1167-1183)Online publication date: 30-Sep-2021
https://doi.org/10.1007/s11390-021-0218-2
Liu Yd'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)Recommending Inferior Results: A General and Feature-Free Model for Spam DetectionProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3411900(955-974)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3411900
Show More Cited By

Index Terms

Fighting against web spam: a novel propagation method based on click-through data
1. Information systems
  1. Information retrieval

Recommendations

Identifying web spam with user behavior analysis
AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web

Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for newly-appeared spam. With user ...
User behavior oriented web spam detection
WWW '08: Proceedings of the 17th international conference on World Wide Web

Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for recently-appeared spam. With user ...
Identifying Web Spam with the Wisdom of the Crowds

Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam-detection techniques are usually designed for specific, known types of Web spam and are incapable of dealing with newly appearing spam types ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

August 2012

1236 pages

ISBN:9781450314725

DOI:10.1145/2348283

General Chair:
William Hersh
Oregon Health & Science University, USA
,
Program Chairs:
Jamie Callan
Carnegie Mellon University, USA
,
Yoelle Maarek
Yahoo! Research, Israel
,
Mark Sanderson
Royal Melbourne Institute of Technology, Australia

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '12

Sponsor:

SIGIR

SIGIR '12: The 35th International ACM SIGIR conference on research and development in Information Retrieval

August 12 - 16, 2012

Oregon, Portland, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
451
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu CZhang RGuo JDe Rijke MFan YCheng X(2023)PRADA: Practical Black-box Adversarial Attacks against Neural Ranking ModelsACM Transactions on Information Systems10.1145/357692341:4(1-27)Online publication date: 8-Apr-2023
https://dl.acm.org/doi/10.1145/3576923
Yang DLi ZWang XSalamatian KXie G(2021)Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web SearchJournal of Computer Science and Technology10.1007/s11390-021-0218-236:5(1167-1183)Online publication date: 30-Sep-2021
https://doi.org/10.1007/s11390-021-0218-2
Liu Yd'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)Recommending Inferior Results: A General and Feature-Free Model for Spam DetectionProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3411900(955-974)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3411900
Gultekin BErdem S(2020)Omni-Channel Strategy in the Framework of the Search EnginesManaging Customer Experiences in an Omnichannel World: Melody of Online and Offline Environments in the Customer Journey10.1108/978-1-80043-388-520201017(211-232)Online publication date: 26-Nov-2020
https://doi.org/10.1108/978-1-80043-388-520201017
Shiu YGuo CZhang MLiu YMa S(2018)Identifying Price Sensitive Customers in E-commerce Platforms for Recommender SystemsInformation Retrieval10.1007/978-3-030-01012-6_18(225-236)Online publication date: 19-Sep-2018
https://doi.org/10.1007/978-3-030-01012-6_18
Bufnea DSotropa D(2017)Measuring and Visualizing the Scrappiness Level of a Website2017 19th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)10.1109/SYNASC.2017.00057(304-311)Online publication date: Sep-2017
https://doi.org/10.1109/SYNASC.2017.00057
Wang KXu GWang CHe X(2017)A Hybrid Abnormal Advertising Traffic Detection Method2017 IEEE International Conference on Big Knowledge (ICBK)10.1109/ICBK.2017.50(236-241)Online publication date: Aug-2017
https://doi.org/10.1109/ICBK.2017.50
Cai RLuo CLiu YMa SZhang M(2017)Incorporating Position Bias into Click-Through Bipartite GraphInformation Retrieval10.1007/978-3-319-68699-8_5(57-68)Online publication date: 21-Oct-2017
https://doi.org/10.1007/978-3-319-68699-8_5
Zhai ELi ZLi ZWu FChen G(2016)Resisting tag spam by leveraging implicit user behaviorsProceedings of the VLDB Endowment10.14778/3021924.302193910:3(241-252)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.14778/3021924.3021939
Zhang JJie LRahman AXie SChang YYu PBailey JMoffat AAggarwal Cde Rijke MKumar RMurdock VSellis TYu J(2015)Learning Entity Types from Query Logs via Graph-Based ModelingProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806498(603-612)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1145/2806416.2806498
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents