skip to main content
10.1145/2740908.2742127acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Identification of Web Spam through Clustering of Website Structures

Published: 18 May 2015 Publication History

Abstract

Spam websites are domains whose owners are not interested in using them as gates for their activities but they are parked to be sold in the secondary market of web domains. To transform the costs of the annual registration fees in an opportunity of revenues, spam websites most often host a large amount of ads in the hope that someone who lands on the site by chance clicks on some ads. Since parking has become a widespread activity, a large number of specialized companies have come out and made parking a straightforward task that simply requires to set the domain's name servers appropriately.
Although parking is a legal activity, spam websites have a deep negative impact on the information quality of the web and can significantly deteriorate the performances of most web mining tools. For example these websites can influence search engines results or introduce an extra burden for crawling systems. In addition, spam websites represent a cost for ad bidders that are obliged to pay for impressions or clicks that have a negligible probability to produce revenues.
In this paper, we experimentally show that spam websites hosted by the same service provider tend to have similar look-and-feel. Exploiting this structural similarity we face the problem of the automatic identification of spam websites. In addition, we use the outcome of the classification for compiling the list of the name servers used by spam websites so that they can be discarded before the first connection just after the first DNS query. A dump of our dataset (including web pages and meta information) and the corresponding manual classification is freely available upon request.

References

[1]
M. Almishari and X. Yang. Ads-portal domains: Identifcation and measurements. ACM Trans. Web, 4(2):4:1-4:34, Apr. 2010.
[2]
C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. SIGIR Forum, 40(2):11-24, Dec. 2006.
[3]
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '07, pages 423-430, 2007.
[4]
M. Crane and A. Trotman. Effects of spam removal on search engine effciency and effectiveness. In Proceedings of the Seventeenth Australasian Document Computing Symposium, ADCS '12, pages 1-8, New York, NY, USA, 2012. ACM.
[5]
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the world wide web. Artificial intelligence, 118(1):69-113, 2000.
[6]
F. Geraci, M. Pellegrini, P. Pisati, and F. Sebastiani. A scalable algorithm for high-quality clustering of web snippets. In Proceedings of the 2006 ACM Symposium on Applied Computing, SAC '06, pages 1058-1062, New York, NY, USA, 2006. ACM.
[7]
T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293-306, 1985.
[8]
Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), April 2005.
[9]
T. Halvorson, J. Szurdi, G. Maier, M. Felegyhazi, C. Kreibich, N. Weaver, K. Levchenko, and V. Paxson. The biz top-level domain: Ten years later. In N. Taft and F. Ricciato, editors, Passive and Active Measurement, volume 7192 of Lecture Notes in Computer Science, pages 221-230. Springer Berlin Heidelberg, 2012.
[10]
P. Hayati, N. Firoozeh, V. Potdar, and K. Chai. How much money do spammers make from your website? In Proceedings of the CUBE International Information Technology Conference, CUBE '12, pages 732-739, New York, NY, USA, 2012. ACM.
[11]
S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443-453, 1970.
[12]
S. Pevtsov and S. Volkov. Russian web spam evolution: Yandex experience. In Proceedings of the 22nd International Conference on World Wide Web Companion, pages 1137-1140, 2013.
[13]
V. M. Prieto, M. Álvarez, R. López-García, and F. Cacheda. Architecture for a garbage-less and fresh content search engine. In KDIR, pages 378-381, 2012.
[14]
R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: (Statistical Methodology), 63(2):411-423, 2001.
[15]
T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne. Tracking web spam with html style similarities. ACM Trans. Web, 2(1):3:1-3:28, Mar. 2008.
[16]
J. Wang and J. Chen. Clustering to maximize the ratio of split to diameter. In 29th International Conference on Machine Learning ICML, 2012.
[17]
S. Webb, J. Caverlee, and C. Pu. Characterizing web spam using content and http session analysis. In CEAS, 2007.

Cited By

View all
  • (2023)Double-Constrained Consensus Clustering with Application to Online Anti-CounterfeitingApplied Sciences10.3390/app13181005013:18(10050)Online publication date: 6-Sep-2023
  • (2018)Web Crawling and Processing with Limited Resources for Business Intelligence and Analytics ApplicationsJournal of Software10.17706/jsw.13.5.300-31613:5(300-316)Online publication date: May-2018
  • (2017)A systematic framework to discover pattern for web spam classification2017 8th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)10.1109/IEMCON.2017.8117135(32-39)Online publication date: Oct-2017
  • Show More Cited By

Index Terms

  1. Identification of Web Spam through Clustering of Website Structures

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '15 Companion: Proceedings of the 24th International Conference on World Wide Web
    May 2015
    1602 pages
    ISBN:9781450334730
    DOI:10.1145/2740908

    Sponsors

    • IW3C2: International World Wide Web Conference Committee

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 May 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. adversarial information retrieval
    2. web spam

    Qualifiers

    • Research-article

    Conference

    WWW '15
    Sponsor:
    • IW3C2

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Double-Constrained Consensus Clustering with Application to Online Anti-CounterfeitingApplied Sciences10.3390/app13181005013:18(10050)Online publication date: 6-Sep-2023
    • (2018)Web Crawling and Processing with Limited Resources for Business Intelligence and Analytics ApplicationsJournal of Software10.17706/jsw.13.5.300-31613:5(300-316)Online publication date: May-2018
    • (2017)A systematic framework to discover pattern for web spam classification2017 8th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)10.1109/IEMCON.2017.8117135(32-39)Online publication date: Oct-2017
    • (undefined)Working Document on Sustainable Justice (A full translation of a Dutch paper: 'Werkdocument Duurzame Rechtspraak')SSRN Electronic Journal10.2139/ssrn.2210027

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media