skip to main content
10.1145/1531914.1531917acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
research-article

A study of link farm distribution and evolution using a time series of web snapshots

Published:21 April 2009Publication History

ABSTRACT

In this paper, we study the overall link-based spam structure and its evolution which would be helpful for the development of robust analysis tools and research for Web spamming as a social activity in the cyber space. First, we use strongly connected component (SCC) decomposition to separate many link farms from the largest SCC, so called the core. We show that denser link farms in the core can be extracted by node filtering and recursive application of SCC decomposition to the core. Surprisingly, we can find new large link farms during each iteration and this trend continues until at least 10 iterations. In addition, we measure the spamicity of such link farms. Next, the evolution of link farms is examined over two years. Results show that almost all large link farms do not grow anymore while some of them shrink, and many large link farms are created in one year.

References

  1. J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 668--677, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th international conference on World Wide Web, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Kitsuregawa, T. Tamura, M. Toyoda and N. Kaji. Socio-Sense:A system for analysing the societal behavior from long term Web archive, In Proceedings of 10th Asia-Pacific Web conference, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Toyoda and M. Kitsuregawa. Creating a web community chart for navigating related communities. In Proceedings of the 12th conference on Hypertext and Hypermedia, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Toyoda and M. Kitsuregawa. Extracting evolution of web communities from a series of Web archive. In Proceedings of the 14th ACM conference on hypertext and hypermedia, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Kumar, P. Raghavan S. Rajagopalan and A. Tomkins. Trawling the Web for emerging cyber-Communities. Proceedings of the 8th international conference on World Wide Web, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Saito, M. Toyoda, M. Kitsuregawa and K. Aihara. A large-scale study of link spam detection by graph algorithms In Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Fetterly, M. Manasse and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam Web pages. In Proceedings of the 7th International Workshop on the Web and Databases, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web. Computer Networks, Volume 33, Number 1, 2000, pp. 309--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st international workshop on Adversarial information retrieval on the Web, 2005.Google ScholarGoogle Scholar
  11. Z. Gyöngyi and H. Molina. Link Spam Alliance In Proceedings of the 31st international conference on Very large Data Bases, 2005.Google ScholarGoogle Scholar
  12. Z. Gyöngyi, H. Garcia-Molina and J. Pedersen. Combating Web spam with TrustRank. In Proceedings of the 30th international conference on Very Large Data Bases, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. A. Benczúr, K Csalogány, T Sarlós and M. Uher. SpamRank-fully automatic link spam detection. In Proceedings of the 1st international workshop on Adversarial information retrieval on the Web, 2005.Google ScholarGoogle Scholar
  14. L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates. Link-based characterization and detection of Web spam. In Proceedings of the 2nd international workshop on Adversarial information retrieval on the Web, 2006.Google ScholarGoogle Scholar
  15. A. Carvalho, P. Chirita, E. Moura and P. Calado. Site level noise removal for search engines. In Proceedings of the 15th international conference on World Wide Web. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. Qi, L. Nie and B. D. Davison. Measuring similarity to detect qualified links, In Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of the 10th international conference on World Wide Web, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Castillo, D. Donato, L. Becchetti and P. Boldi. A reference collection for Web spam. SIGIR Forum, 40(2), 2006, pp 11--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Internet Archive Wayback Machine. http://www.archive.org.Google ScholarGoogle Scholar
  20. Y. Fujiwara, C. Di Guilmi, H. Aoyama, M. Gallegati and W. Souma. Do Pareto-Zipf and Gibrat laws hold true? An analysis with European firms. Physica A(335), 2004, pp. 197--216.Google ScholarGoogle Scholar

Index Terms

  1. A study of link farm distribution and evolution using a time series of web snapshots

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
        April 2009
        67 pages
        ISBN:9781605584386
        DOI:10.1145/1531914

        Copyright © 2009 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 April 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader