skip to main content
10.1145/3366424.3385773acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Scalable Anti-TrustRank with Qualified Site-level Seeds for Link-based Web Spam Detection

Published: 20 April 2020 Publication History

Abstract

Web spam detection is one of the most important and challenging tasks in web search. Since web spam pages tend to have a lot of spurious links, many web spam detection algorithms exploit the hyperlink structure between the web pages to detect the spam pages. In this paper, we conduct a comprehensive analysis of the link structure of web spam using real-world web graphs to systemically investigate the characteristics of the link-based web spam. By exploring the structure of the page-level graph as well as the site-level graph, we propose a scalable site-level seeding methodology for the Anti-TrustRank (ATR) algorithm. The key idea is to map a website into a feature space where we learn a classifier to prioritize the websites so that we can effectively select a set of good seeds for the ATR algorithm. This seeding method enables the ATR algorithm to detect the largest number of spam pages among the competitive baseline methods. Furthermore, we design work-efficient asynchronous ATR algorithms which are able to significantly reduce the computational cost of the traditional ATR algorithm without degrading the performance in detecting spam pages while guaranteeing the convergence.

References

[1]
J. Abernethy, O. Chapelle, and C. Castillo. 2010. Graph Regularization Methods for Web Spam Detection. Journal of Machine Learning 81 (2010), 207–225.
[2]
L. Araujo and J. Martinez-Romo. 2010. Web Spam Detection: New Classification Features Based on Qualified Link Analysis and Language Models. IEEE Transactions on Information Forensics and Security 5, 3(2010).
[3]
L. Becchetti, C. Castillo, D. Donato, R. Baeza-Yates, and S. Leonardi. 2008. Link Analysis for Web Spam Detection. ACM Transactions on the Web 2, 1 (2008).
[4]
A. Benczúr, C. Castillo, M. Erdélyi, Z. Gyöngyi, J. Masanes, and Michael Matthews. 2010. ECML/PKDD 2010 Discovery Challenge Data Set. https://dms.sztaki.hu/en/letoltes/ecmlpkdd-2010-discovery-challenge-data-set.
[5]
M. Bendersky, W. B. Croft, and Y. Diao. 2011. Quality-biased Ranking of Web Documents. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. 95–104.
[6]
L. Breiman. 2001. Random Forests. Machine Learning 45, 1 (Oct. 2001), 5–32.
[7]
S. Brin and L. Page. 1998. The Anatomy of a Large-scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems 30, 1-7 (1998).
[8]
C. Castillo, D. Donato, L. Becchetti, and P. Boldi. 2007. WEBSPAM-UK2007. http://chato.cl/webspam/datasets/uk2007/.
[9]
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. 2007. Know your Neighbors: Web Spam Detection using the Web Topology. In Proceedings of the 30th International ACM SIGIR conference on Research and Development in Information Retrieval. 423–430.
[10]
Z. Cheng, B. Gao, C. Sun, Y. Jiang, and T. Liu. 2011. Let Web Spammers Expose Themselves. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. 525–534.
[11]
M. Erdelyi, A. Garzo, and A. Benczur. 2011. Web Spam Classification: a Few Features Worth More. In Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality. 27–34.
[12]
D. Gleich and M. Polito. 2006. Approximating Personalized PageRank with Minimal Use of Web Graph Data. Internet Mathematics(2006).
[13]
D. F. Gleich, L. Zhukov, and P. Berkhin. 2004. Fast Parallel PageRank: A Linear System Approach. Yahoo! Research Labs Technical Report YRL-2004-038 (2004).
[14]
A. Grover and J. Leskovec. 2016. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[15]
Z. Gyongyi, P. Berkhin, H. Garcia-Molina, and J. Pedersen. 2006. Link Spam Detection Based on Mass Estimation. In Proceedings of the 32nd International Conference on Very Large Data Bases. 439–450.
[16]
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. 2004. Combating Web Spam with Trustrank. In Proceedings of the 30th International Conference on Very Large Data Bases. 576–587.
[17]
J. Kleinberg. 1999. Authoritative Sources in a Hyperlinked Environment. J. ACM 46, 5 (1999), 604–632.
[18]
V. Krishnan and R. Raj. 2006. Web Spam Detection with Anti-Trust Rank. In Proceedings of the ACM SIGIR Workshop on Adversarial Information Retrieval on the Web. 37–40.
[19]
G. Lee, S. Kang, and J. J. Whang. 2019. Hyperlink Classification via Structured Graph Embedding. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1017–1020.
[20]
J. Leskovec and C. Faloutsos. 2006. Sampling from Large Graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[21]
Q. Liu, Z. Li, J. Lui, and J. Cheng. 2016. PowerWalk: Scalable Personalized PageRank via Random Walks with Vertex-Centric Decomposition. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management.
[22]
P. Lofgren, S. Banerjee, and A. Goel. 2016. Personalized PageRank Estimation and Search: A Bidirectional Approach. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 163–172.
[23]
F. McSherry. 2005. A Uniform Approach to Accelerated PageRank Computation. In Proceedings of the 14th International Conference on World Wide Web. 575–582.
[24]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
[25]
X. Qi and B. Davison. 2009. Web Page Classification: Features and Algorithms. Comput. Surveys 41, 2 (2009), 12:1–12:31.
[26]
R. Silva, A. Yamakami, and T. Almeida. 2012. An Analysis of Machine Learning Methods for Spam Host Detection. In Proceedings of the 11th International Conference on Machine Learning and Applications. 227–232.
[27]
N. Spirin and J. Han. 2012. Survey on Web Spam Detection: Principles and Algorithms. ACM SIGKDD Explorations Newsletter 13, 2 (2012), 50–64.
[28]
C. Wei, Y. Liu, M. Zhang, S. Ma, L. Ru, and K. Zhang. 2012. Fighting Against Web Spam: A Novel Propagation Method Based on Click-through Data. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 395–404.
[29]
J. J. Whang, Y. Hou, D. F. Gleich, and I. S. Dhillon. 2019. Non-exhaustive, Overlapping Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 11(2019), 2644–2659.
[30]
J. J. Whang, Y. Jung, I. S. Dhillon, S. Kang, and J. Lee. 2018. Fast Asynchronous Anti-TrustRank for Web Spam Detection. In ACM International Conference on Web Search and Data Mining Workshop on MIS2: Misinformation and Misbehavior Mining on the Web.
[31]
J. J. Whang, A. Lenharth, I. Dhillon, and K. Pingali. 2015. Scalable Data-driven PageRank: Algorithms, System Issues, and Lessons Learned. In Proceedings of the 21st International European Conference on Parallel and Distributed Computing. 438–450.
[32]
J. J. Whang, X. Sui, and I. Dhillon. 2012. Scalable and Memory-Efficient Clustering of Large-Scale Social Networks. In Proceedings of the 12th International Conference on Data Mining. 705–714.
[33]
B. Wu and B. Davison. 2005. Identifying Link Farm Spam Pages. In Proceedings of the 14th International Conference on World Wide Web.
[34]
B. Wu, V. Goel, and B. Davision. 2006. Topical TrustRank: Using Topicality to Combat Web Spam. In Proceedings of the 15th International Conference on World Wide Web. 63–72.
[35]
X. Zhang, B. Han, and W. Liang. 2009. Automatic Seed Set Expansion for Trust Propagation Based Anti-spamming Algorithms. In Proceedings of the 11th International Workshop on Web Information and Data Management.
[36]
X. Zhang, Y. Wang, N. Mou, and W. Liang. 2014. Propagating Both Trust and Distrust with Target Differentiation for Combating Link-Based Web Spam. In Proceedings of the 25th International Conference on Association for the Advancement of Artificial Intelligence.

Cited By

View all
  • (2025)Misinformation Resilient Search Rankings with Webgraph-Based InterventionsACM Transactions on Intelligent Systems and Technology10.1145/367041016:1(1-27)Online publication date: 2-Jan-2025
  • (2023)Adversarial Spam Detector With Character Similarity NetworkIEEE Transactions on Industrial Informatics10.1109/TII.2022.317772619:3(2541-2551)Online publication date: Mar-2023
  • (2022)Distributed Triangle Approximately Counting Algorithms in Simple Graph StreamACM Transactions on Knowledge Discovery from Data10.1145/349456216:4(1-43)Online publication date: 8-Jan-2022
  • Show More Cited By

Index Terms

  1. Scalable Anti-TrustRank with Qualified Site-level Seeds for Link-based Web Spam Detection
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image ACM Conferences
            WWW '20: Companion Proceedings of the Web Conference 2020
            April 2020
            854 pages
            ISBN:9781450370240
            DOI:10.1145/3366424
            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Sponsors

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 20 April 2020

            Permissions

            Request permissions for this article.

            Check for updates

            Author Tags

            1. Anti-TrustRank
            2. Link Analysis.
            3. Seeds
            4. Web Spam Detection

            Qualifiers

            • Research-article
            • Research
            • Refereed limited

            Conference

            WWW '20
            Sponsor:
            WWW '20: The Web Conference 2020
            April 20 - 24, 2020
            Taipei, Taiwan

            Acceptance Rates

            Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)16
            • Downloads (Last 6 weeks)3
            Reflects downloads up to 02 Mar 2025

            Other Metrics

            Citations

            Cited By

            View all
            • (2025)Misinformation Resilient Search Rankings with Webgraph-Based InterventionsACM Transactions on Intelligent Systems and Technology10.1145/367041016:1(1-27)Online publication date: 2-Jan-2025
            • (2023)Adversarial Spam Detector With Character Similarity NetworkIEEE Transactions on Industrial Informatics10.1109/TII.2022.317772619:3(2541-2551)Online publication date: Mar-2023
            • (2022)Distributed Triangle Approximately Counting Algorithms in Simple Graph StreamACM Transactions on Knowledge Discovery from Data10.1145/349456216:4(1-43)Online publication date: 8-Jan-2022
            • (2021)Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web SearchJournal of Computer Science and Technology10.1007/s11390-021-0218-236:5(1167-1183)Online publication date: 30-Sep-2021

            View Options

            Login options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format.

            HTML Format

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media