research-article

A study of link farm distribution and evolution using a time series of web snapshots

Authors:
Young-joo Chung

University of Tokyo, Tokyo, Japan

University of Tokyo, Tokyo, Japan
View Profile

,
Masashi Toyoda

University of Tokyo, Tokyo, Japan

University of Tokyo, Tokyo, Japan
View Profile

,
Masaru Kitsuregawa

University of Tokyo, Tokyo, Japan

University of Tokyo, Tokyo, Japan
View Profile

AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the WebApril 2009Pages 9–16https://doi.org/10.1145/1531914.1531917

Published:21 April 2009Publication History

AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

Pages 9–16

ABSTRACT

In this paper, we study the overall link-based spam structure and its evolution which would be helpful for the development of robust analysis tools and research for Web spamming as a social activity in the cyber space. First, we use strongly connected component (SCC) decomposition to separate many link farms from the largest SCC, so called the core. We show that denser link farms in the core can be extracted by node filtering and recursive application of SCC decomposition to the core. Surprisingly, we can find new large link farms during each iteration and this trend continues until at least 10 iterations. In addition, we measure the spamicity of such link farms. Next, the evolution of link farms is examined over two years. Results show that almost all large link farms do not grow anymore while some of them shrink, and many large link farms are created in one year.

References

J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 668--677, 1998. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th international conference on World Wide Web, 1998. Google ScholarDigital Library
M. Kitsuregawa, T. Tamura, M. Toyoda and N. Kaji. Socio-Sense:A system for analysing the societal behavior from long term Web archive, In Proceedings of 10th Asia-Pacific Web conference, 2008. Google ScholarDigital Library
M. Toyoda and M. Kitsuregawa. Creating a web community chart for navigating related communities. In Proceedings of the 12th conference on Hypertext and Hypermedia, 2001. Google ScholarDigital Library
M. Toyoda and M. Kitsuregawa. Extracting evolution of web communities from a series of Web archive. In Proceedings of the 14th ACM conference on hypertext and hypermedia, 2003. Google ScholarDigital Library
R. Kumar, P. Raghavan S. Rajagopalan and A. Tomkins. Trawling the Web for emerging cyber-Communities. Proceedings of the 8th international conference on World Wide Web, 1999. Google ScholarDigital Library
H. Saito, M. Toyoda, M. Kitsuregawa and K. Aihara. A large-scale study of link spam detection by graph algorithms In Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web, 2007. Google ScholarDigital Library
D. Fetterly, M. Manasse and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam Web pages. In Proceedings of the 7th International Workshop on the Web and Databases, 2004. Google ScholarDigital Library
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web. Computer Networks, Volume 33, Number 1, 2000, pp. 309--320. Google ScholarDigital Library
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st international workshop on Adversarial information retrieval on the Web, 2005.Google Scholar
Z. Gyöngyi and H. Molina. Link Spam Alliance In Proceedings of the 31st international conference on Very large Data Bases, 2005.Google Scholar
Z. Gyöngyi, H. Garcia-Molina and J. Pedersen. Combating Web spam with TrustRank. In Proceedings of the 30th international conference on Very Large Data Bases, 2004. Google ScholarDigital Library
A. A. Benczúr, K Csalogány, T Sarlós and M. Uher. SpamRank-fully automatic link spam detection. In Proceedings of the 1st international workshop on Adversarial information retrieval on the Web, 2005.Google Scholar
L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates. Link-based characterization and detection of Web spam. In Proceedings of the 2nd international workshop on Adversarial information retrieval on the Web, 2006.Google Scholar
A. Carvalho, P. Chirita, E. Moura and P. Calado. Site level noise removal for search engines. In Proceedings of the 15th international conference on World Wide Web. 2006. Google ScholarDigital Library
X. Qi, L. Nie and B. D. Davison. Measuring similarity to detect qualified links, In Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web, 2007. Google ScholarDigital Library
M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of the 10th international conference on World Wide Web, 2001. Google ScholarDigital Library
C. Castillo, D. Donato, L. Becchetti and P. Boldi. A reference collection for Web spam. SIGIR Forum, 40(2), 2006, pp 11--24. Google ScholarDigital Library
Internet Archive Wayback Machine. http://www.archive.org.Google Scholar
Y. Fujiwara, C. Di Guilmi, H. Aoyama, M. Gallegati and W. Souma. Do Pareto-Zipf and Gibrat laws hold true? An analysis with European firms. Physica A(335), 2004, pp. 197--216.Google Scholar

Index Terms

A study of link farm distribution and evolution using a time series of web snapshots
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Identifying link farm spam pages
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

With the increasing importance of search in guiding today's web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines' ranking systems, ...
Read More
Identifying spam link generators for monitoring emerging web spam
WICOW '10: Proceedings of the 4th workshop on Information credibility

In this paper, we address the question of how we can identify hosts that will generate links to web spam. Detecting such spam link generators is important because almost all new spam links are created by them. By monitoring spam link generators, we can ...
Read More
Detecting Link Hijacking by Web Spammers
PAKDD '09: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining

Since current search engines employ link-based ranking algorithms as an important tool to decide a ranking of sites, Web spammers are making a significant effort to manipulate the link structure of the Web, so called, link spamming. Link hijacking is an ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
April 2009
67 pages
ISBN:9781605584386
DOI:10.1145/1531914
Editors:
Dennis Fetterly
Microsoft Research
,
Zoltán Gyöngyi
Google Research
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 April 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information retrieval
link analysis
web spam
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 25
  Total Citations
  View Citations
- 285
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A study of link farm distribution and evolution using a time series of web snapshots

AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Identifying link farm spam pages

Identifying spam link generators for monitoring emerging web spam

Detecting Link Hijacking by Web Spammers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A study of link farm distribution and evolution using a time series of web snapshots

AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Identifying link farm spam pages

Identifying spam link generators for monitoring emerging web spam

Detecting Link Hijacking by Web Spammers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media