ABSTRACT
In this paper, we study the overall link-based spam structure and its evolution which would be helpful for the development of robust analysis tools and research for Web spamming as a social activity in the cyber space. First, we use strongly connected component (SCC) decomposition to separate many link farms from the largest SCC, so called the core. We show that denser link farms in the core can be extracted by node filtering and recursive application of SCC decomposition to the core. Surprisingly, we can find new large link farms during each iteration and this trend continues until at least 10 iterations. In addition, we measure the spamicity of such link farms. Next, the evolution of link farms is examined over two years. Results show that almost all large link farms do not grow anymore while some of them shrink, and many large link farms are created in one year.
- J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 668--677, 1998. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th international conference on World Wide Web, 1998. Google ScholarDigital Library
- M. Kitsuregawa, T. Tamura, M. Toyoda and N. Kaji. Socio-Sense:A system for analysing the societal behavior from long term Web archive, In Proceedings of 10th Asia-Pacific Web conference, 2008. Google ScholarDigital Library
- M. Toyoda and M. Kitsuregawa. Creating a web community chart for navigating related communities. In Proceedings of the 12th conference on Hypertext and Hypermedia, 2001. Google ScholarDigital Library
- M. Toyoda and M. Kitsuregawa. Extracting evolution of web communities from a series of Web archive. In Proceedings of the 14th ACM conference on hypertext and hypermedia, 2003. Google ScholarDigital Library
- R. Kumar, P. Raghavan S. Rajagopalan and A. Tomkins. Trawling the Web for emerging cyber-Communities. Proceedings of the 8th international conference on World Wide Web, 1999. Google ScholarDigital Library
- H. Saito, M. Toyoda, M. Kitsuregawa and K. Aihara. A large-scale study of link spam detection by graph algorithms In Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web, 2007. Google ScholarDigital Library
- D. Fetterly, M. Manasse and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam Web pages. In Proceedings of the 7th International Workshop on the Web and Databases, 2004. Google ScholarDigital Library
- A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web. Computer Networks, Volume 33, Number 1, 2000, pp. 309--320. Google ScholarDigital Library
- Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st international workshop on Adversarial information retrieval on the Web, 2005.Google Scholar
- Z. Gyöngyi and H. Molina. Link Spam Alliance In Proceedings of the 31st international conference on Very large Data Bases, 2005.Google Scholar
- Z. Gyöngyi, H. Garcia-Molina and J. Pedersen. Combating Web spam with TrustRank. In Proceedings of the 30th international conference on Very Large Data Bases, 2004. Google ScholarDigital Library
- A. A. Benczúr, K Csalogány, T Sarlós and M. Uher. SpamRank-fully automatic link spam detection. In Proceedings of the 1st international workshop on Adversarial information retrieval on the Web, 2005.Google Scholar
- L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates. Link-based characterization and detection of Web spam. In Proceedings of the 2nd international workshop on Adversarial information retrieval on the Web, 2006.Google Scholar
- A. Carvalho, P. Chirita, E. Moura and P. Calado. Site level noise removal for search engines. In Proceedings of the 15th international conference on World Wide Web. 2006. Google ScholarDigital Library
- X. Qi, L. Nie and B. D. Davison. Measuring similarity to detect qualified links, In Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web, 2007. Google ScholarDigital Library
- M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of the 10th international conference on World Wide Web, 2001. Google ScholarDigital Library
- C. Castillo, D. Donato, L. Becchetti and P. Boldi. A reference collection for Web spam. SIGIR Forum, 40(2), 2006, pp 11--24. Google ScholarDigital Library
- Internet Archive Wayback Machine. http://www.archive.org.Google Scholar
- Y. Fujiwara, C. Di Guilmi, H. Aoyama, M. Gallegati and W. Souma. Do Pareto-Zipf and Gibrat laws hold true? An analysis with European firms. Physica A(335), 2004, pp. 197--216.Google Scholar
Index Terms
- A study of link farm distribution and evolution using a time series of web snapshots
Recommendations
Identifying link farm spam pages
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide WebWith the increasing importance of search in guiding today's web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines' ranking systems, ...
Identifying spam link generators for monitoring emerging web spam
WICOW '10: Proceedings of the 4th workshop on Information credibilityIn this paper, we address the question of how we can identify hosts that will generate links to web spam. Detecting such spam link generators is important because almost all new spam links are created by them. By monitoring spam link generators, we can ...
Detecting Link Hijacking by Web Spammers
PAKDD '09: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data MiningSince current search engines employ link-based ranking algorithms as an important tool to decide a ranking of sites, Web spammers are making a significant effort to manipulate the link structure of the Web, so called, link spamming. Link hijacking is an ...
Comments