research-article

Detecting splogs via temporal dynamics using self-similarity analysis

Authors:
Yu-Ru Lin

Arizona State University, AZ

Arizona State University, AZ
View Profile

,
Hari Sundaram

Arizona State University, AZ

Arizona State University, AZ
View Profile

,
Yun Chi

NEC Laboratories America, Cupertino, CA

NEC Laboratories America, Cupertino, CA
View Profile

,
Junichi Tatemura

NEC Laboratories America, Cupertino, CA

NEC Laboratories America, Cupertino, CA
View Profile

,
Belle L. Tseng

NEC Laboratories America, Cupertino, CA

NEC Laboratories America, Cupertino, CA
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 2 Issue 1Article No.: 4pp 1–35https://doi.org/10.1145/1326561.1326565

Published:03 March 2008Publication History

ACM Transactions on the Web

Abstract

This article addresses the problem of spam blog (splog) detection using temporal and structural regularity of content, post time and links. Splogs are undesirable blogs meant to attract search engine traffic, used solely for promoting affiliate sites. Blogs represent popular online media, and splogs not only degrade the quality of search engine results, but also waste network resources. The splog detection problem is made difficult due to the lack of stable content descriptors.

We have developed a new technique for detecting splogs, based on the observation that a blog is a dynamic, growing sequence of entries (or posts) rather than a collection of individual pages. In our approach, splogs are recognized by their temporal characteristics and content. There are three key ideas in our splog detection framework. (a) We represent the blog temporal dynamics using self-similarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts, to investigate the temporal changes of the post sequence. (b) We study the blog temporal characteristics using a visual representation derived from the self-similarity measures. The visual signature reveals correlation between attributes and posts, depending on the type of blogs (normal blogs and splogs). (c) We propose two types of novel temporal features to capture the splog temporal characteristics. In our splog detector, these novel features are combined with content based features. We extract a content based feature vector from blog home pages as well as from different parts of the blog. The dimensionality of the feature vector is reduced by Fisher linear discriminant analysis. We have tested an SVM-based splog detector using proposed features on real world datasets, with appreciable results (90% accuracy).

References

Benczur, A., Csalogany, K., Sarlos, T., and Uher, M. 2005. Spamrank-fully automatic link spam detection. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).Google Scholar
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Santini, M., and Vigna, S. 2006. A reference collection for web spam. SIGIR Forum (Dec.) ACM Press, 11--24. Google ScholarDigital Library
Cavnar, W. B. and Trenkle, J. M. 1994. N-gram-based text categorization. In Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval.Google Scholar
Chang, C.-C. and Lin, C.-J. 2001. Libsvm: A library for support vector machines. ntv.edu.two- cjlin/papers (libsvm, ps.gz).Google Scholar
Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification. John Wiley & Sons, Inc. New York. Google ScholarDigital Library
Eckmann, J., Kamphorst, S. O., and Ruelle, D. 1987. Recurrence plots of dynamical systems. Europhysics Lett. 4, 973--977.Google ScholarCross Ref
Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases. Colocated with ACM SIGMOD/PODS. 1--6. Google ScholarDigital Library
Fetterly, D., Manasse, M., and Najork, M. 2005. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 170--177. Google ScholarDigital Library
Fogaras, D. and Racz, B. 2005. Scaling link-based similarity search. In Proceedings of the 14th International Conference on World Wide Web. ACM Press. 641--650. Google ScholarDigital Library
Foote, J., Cooper, M., and Nam, U. 2002. Audio retrieval by rhythmic similarity. In Proceedings of the International Conference on Music Information Retrieval. 265--266.Google Scholar
Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with trustrank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04). Toronto, Canada. Morgan Kaufmann. 576--587. Google ScholarDigital Library
Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).Google Scholar
Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J. 2006. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB). Seoul, Korea. 439--450. Google ScholarDigital Library
Han, S., Ahn, Y., Moon, S., and Jeong, H. 2006. Collaborative blog spam filtering using adaptive percolation search. WWW2006 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics. Edinburgh.Google Scholar
Kolari, P. 2005. Welcome to the splogosphere: 75&percnt; of new pings are spings (splogs). http://ebiquity.umbc.edu/blogger/2005/12/15/welcome-to-the-splogosphere-75-of-new-blog-posts- are-spam/.Google Scholar
Kolari, P., Finin, T., and Joshi, A. 2006a. Svms for the blogosphere: Blog identification and splog detection. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs.Google Scholar
Kolari, P., Java, A., and Finin, T. 2006b. Characterizing the splogosphere. In Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wide Web Conference.Google Scholar
Kolari, P., Java, A., Finin, T., Mayfield, J., Joshi, A., and Martineau, J. 2006c. Blog track open task: Spam blog classification. TREC Blog Track Notebook.Google Scholar
Kolari, P., Java, A., Finin, T., Oates, T., and Joshi, A. 2006d. Detecting spam blogs: A machine learning approach. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI'06). Boston, MA. Google ScholarDigital Library
Lin, Y., Sundaram, H., Chi, Y., Tatemura, J., and Tseng, B. 2007. Splog detection using content, time and link structures. IEEE International Conference on Multimedia and Expo 2007: 2030--2033.Google Scholar
Lin, Y.-R., Chen, W.-Y., Shi, X., Sia, R., Song, X., Chi, Y., Hino, K., Sundaram, H., Tatemura, J., and Tseng, B. 2006. The splog detection task and a solution based on temporal and link properties. In Poceedings of the 15th Text REtrieval Conference (TREC'06).Google Scholar
Macdonald, C. and Ounis, I. 2006. The trec blogs06 collection: Creating and analyzing a blog test collection. TR-2006-224. Department of Computer Science, University of Glasgow.Google Scholar
Mishne, G., Carmel, D., and Lempel, R. 2005. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).Google Scholar
Narisawa, K., Yamada, Y., Ikeda, D., and Takeda, M. 2006. Detecting blog spams using the vocabulary size of all substrings in their copies. In Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem.Google Scholar
Newman, M. and Girvan, M. 2004. Finding and evaluating community structure in networks. Phys. Rev. E 69, 2, 26113.Google ScholarCross Ref
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam Web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web. Edinburgh, Scotland. ACM Press, 83--92. Google ScholarDigital Library
Salvetti, F. and Nicolov, N. Weblog classification for fast splog filtering: A url language model segmentation approach. In Proceedings of the Human Language Technology Conference of the NAACL. Companion Volume: Short Papers, 137--140. Google ScholarDigital Library
Shen, G., Gao, B., Liu, T.-Y., Feng, G., Song, S., and Li, H. 2006. Detecting link spam using temporal information. In Proceedings of the 6th International Conference on Data Mining. IEEE Computer Society. 1049--1053. Google ScholarDigital Library
SURBL Surbl---spam uri realtime blocklists. http://www.surbl.org/.Google Scholar
Swain, M. and Ballard, D. 1991. Color indexing. Int. J. Comput. Vision 7, 1, 11--32. Google ScholarDigital Library
UMBRIA. 2006. Spam in the blogosphere. http://www.umbrialistens.com/files/uploads/umbria_ splog.pdf.Google Scholar
Urvoy, T., Lavergne, T., and Filoche, P. 2006. Tracking web spam with hidden style similarity. AIRWEB, Seattle, WA.Google Scholar
Von Ahn, L., Blum, M., and Langford, J. 2004. Telling humans and computers apart automatically. Comm. ACM 47, 2, 56--60. Google ScholarDigital Library
Wikipedia. http://en.wikipedia.org/wiki/.Google Scholar
Wu, B. and Davison, B. 2005. Identifying link farm spam pages. In Proceedings of the International World Wide Web Conference. ACM Press. 820--829. Google ScholarDigital Library
Zawodny, J. 2005 Yahoo&excl; Search blog: A defense against comment spam. http://www.ysearchblog.com/archives/000069.html.Google Scholar

Index Terms

Detecting splogs via temporal dynamics using self-similarity analysis

Recommendations

Splog detection using self-similarity analysis on blog temporal dynamics
AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web

This paper focuses on spam blog (splog) detection. Blogs are highly popular, new media social communication mechanisms. The presence of splogs degrades blog search results as well as wastes network resources. In our approach we exploit unique blog ...
Read More
Temporal Effects on Hashtag Reuse in Twitter: A Cognitive-Inspired Hashtag Recommendation Approach
WWW '17: Proceedings of the 26th International Conference on World Wide Web

Hashtags have become a powerful tool in social platforms such as Twitter to categorize and search for content, and to spread short messages across members of the social network. In this paper, we study temporal hashtag usage practices in Twitter with ...
Read More
Temporal dynamics of posts and user engagement of influencers on Facebook and Instagram
ASONAM '21: Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

A relevant fraction of human interactions occurs on online social networks. Freshness of content seems to play an important role, with content popularity rapidly vanishing over time. In this paper, we investigate how influencers' generated content (i.e.,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on the Web Volume 2, Issue 1
February 2008
280 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/1326561
Issue’s Table of Contents

Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 March 2008
- Accepted: 1 October 2007
- Revised: 1 September 2007
- Received: 1 April 2007
Published in tweb Volume 2, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Blogs
regularity
self-similarity
spam
splog detection
temporal dynamics
topology
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 34
  Total Citations
  View Citations
- 928
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Detecting splogs via temporal dynamics using self-similarity analysis

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Splog detection using self-similarity analysis on blog temporal dynamics

Temporal Effects on Hashtag Reuse in Twitter: A Cognitive-Inspired Hashtag Recommendation Approach

Temporal dynamics of posts and user engagement of influencers on Facebook and Instagram

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Detecting splogs via temporal dynamics using self-similarity analysis

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Splog detection using self-similarity analysis on blog temporal dynamics

Temporal Effects on Hashtag Reuse in Twitter: A Cognitive-Inspired Hashtag Recommendation Approach

Temporal dynamics of posts and user engagement of influencers on Facebook and Instagram

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media