Abstract
This article addresses the problem of spam blog (splog) detection using temporal and structural regularity of content, post time and links. Splogs are undesirable blogs meant to attract search engine traffic, used solely for promoting affiliate sites. Blogs represent popular online media, and splogs not only degrade the quality of search engine results, but also waste network resources. The splog detection problem is made difficult due to the lack of stable content descriptors.
We have developed a new technique for detecting splogs, based on the observation that a blog is a dynamic, growing sequence of entries (or posts) rather than a collection of individual pages. In our approach, splogs are recognized by their temporal characteristics and content. There are three key ideas in our splog detection framework. (a) We represent the blog temporal dynamics using self-similarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts, to investigate the temporal changes of the post sequence. (b) We study the blog temporal characteristics using a visual representation derived from the self-similarity measures. The visual signature reveals correlation between attributes and posts, depending on the type of blogs (normal blogs and splogs). (c) We propose two types of novel temporal features to capture the splog temporal characteristics. In our splog detector, these novel features are combined with content based features. We extract a content based feature vector from blog home pages as well as from different parts of the blog. The dimensionality of the feature vector is reduced by Fisher linear discriminant analysis. We have tested an SVM-based splog detector using proposed features on real world datasets, with appreciable results (90% accuracy).
- Benczur, A., Csalogany, K., Sarlos, T., and Uher, M. 2005. Spamrank-fully automatic link spam detection. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).Google Scholar
- Castillo, C., Donato, D., Becchetti, L., Boldi, P., Santini, M., and Vigna, S. 2006. A reference collection for web spam. SIGIR Forum (Dec.) ACM Press, 11--24. Google ScholarDigital Library
- Cavnar, W. B. and Trenkle, J. M. 1994. N-gram-based text categorization. In Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval.Google Scholar
- Chang, C.-C. and Lin, C.-J. 2001. Libsvm: A library for support vector machines. ntv.edu.two- cjlin/papers (libsvm, ps.gz).Google Scholar
- Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification. John Wiley & Sons, Inc. New York. Google ScholarDigital Library
- Eckmann, J., Kamphorst, S. O., and Ruelle, D. 1987. Recurrence plots of dynamical systems. Europhysics Lett. 4, 973--977.Google ScholarCross Ref
- Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases. Colocated with ACM SIGMOD/PODS. 1--6. Google ScholarDigital Library
- Fetterly, D., Manasse, M., and Najork, M. 2005. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 170--177. Google ScholarDigital Library
- Fogaras, D. and Racz, B. 2005. Scaling link-based similarity search. In Proceedings of the 14th International Conference on World Wide Web. ACM Press. 641--650. Google ScholarDigital Library
- Foote, J., Cooper, M., and Nam, U. 2002. Audio retrieval by rhythmic similarity. In Proceedings of the International Conference on Music Information Retrieval. 265--266.Google Scholar
- Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with trustrank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04). Toronto, Canada. Morgan Kaufmann. 576--587. Google ScholarDigital Library
- Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).Google Scholar
- Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J. 2006. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB). Seoul, Korea. 439--450. Google ScholarDigital Library
- Han, S., Ahn, Y., Moon, S., and Jeong, H. 2006. Collaborative blog spam filtering using adaptive percolation search. WWW2006 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics. Edinburgh.Google Scholar
- Kolari, P. 2005. Welcome to the splogosphere: 75% of new pings are spings (splogs). http://ebiquity.umbc.edu/blogger/2005/12/15/welcome-to-the-splogosphere-75-of-new-blog-posts- are-spam/.Google Scholar
- Kolari, P., Finin, T., and Joshi, A. 2006a. Svms for the blogosphere: Blog identification and splog detection. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs.Google Scholar
- Kolari, P., Java, A., and Finin, T. 2006b. Characterizing the splogosphere. In Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wide Web Conference.Google Scholar
- Kolari, P., Java, A., Finin, T., Mayfield, J., Joshi, A., and Martineau, J. 2006c. Blog track open task: Spam blog classification. TREC Blog Track Notebook.Google Scholar
- Kolari, P., Java, A., Finin, T., Oates, T., and Joshi, A. 2006d. Detecting spam blogs: A machine learning approach. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI'06). Boston, MA. Google ScholarDigital Library
- Lin, Y., Sundaram, H., Chi, Y., Tatemura, J., and Tseng, B. 2007. Splog detection using content, time and link structures. IEEE International Conference on Multimedia and Expo 2007: 2030--2033.Google Scholar
- Lin, Y.-R., Chen, W.-Y., Shi, X., Sia, R., Song, X., Chi, Y., Hino, K., Sundaram, H., Tatemura, J., and Tseng, B. 2006. The splog detection task and a solution based on temporal and link properties. In Poceedings of the 15th Text REtrieval Conference (TREC'06).Google Scholar
- Macdonald, C. and Ounis, I. 2006. The trec blogs06 collection: Creating and analyzing a blog test collection. TR-2006-224. Department of Computer Science, University of Glasgow.Google Scholar
- Mishne, G., Carmel, D., and Lempel, R. 2005. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).Google Scholar
- Narisawa, K., Yamada, Y., Ikeda, D., and Takeda, M. 2006. Detecting blog spams using the vocabulary size of all substrings in their copies. In Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem.Google Scholar
- Newman, M. and Girvan, M. 2004. Finding and evaluating community structure in networks. Phys. Rev. E 69, 2, 26113.Google ScholarCross Ref
- Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam Web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web. Edinburgh, Scotland. ACM Press, 83--92. Google ScholarDigital Library
- Salvetti, F. and Nicolov, N. Weblog classification for fast splog filtering: A url language model segmentation approach. In Proceedings of the Human Language Technology Conference of the NAACL. Companion Volume: Short Papers, 137--140. Google ScholarDigital Library
- Shen, G., Gao, B., Liu, T.-Y., Feng, G., Song, S., and Li, H. 2006. Detecting link spam using temporal information. In Proceedings of the 6th International Conference on Data Mining. IEEE Computer Society. 1049--1053. Google ScholarDigital Library
- SURBL Surbl---spam uri realtime blocklists. http://www.surbl.org/.Google Scholar
- Swain, M. and Ballard, D. 1991. Color indexing. Int. J. Comput. Vision 7, 1, 11--32. Google ScholarDigital Library
- UMBRIA. 2006. Spam in the blogosphere. http://www.umbrialistens.com/files/uploads/umbria_ splog.pdf.Google Scholar
- Urvoy, T., Lavergne, T., and Filoche, P. 2006. Tracking web spam with hidden style similarity. AIRWEB, Seattle, WA.Google Scholar
- Von Ahn, L., Blum, M., and Langford, J. 2004. Telling humans and computers apart automatically. Comm. ACM 47, 2, 56--60. Google ScholarDigital Library
- Wikipedia. http://en.wikipedia.org/wiki/.Google Scholar
- Wu, B. and Davison, B. 2005. Identifying link farm spam pages. In Proceedings of the International World Wide Web Conference. ACM Press. 820--829. Google ScholarDigital Library
- Zawodny, J. 2005 Yahoo! Search blog: A defense against comment spam. http://www.ysearchblog.com/archives/000069.html.Google Scholar
Index Terms
- Detecting splogs via temporal dynamics using self-similarity analysis
Recommendations
Splog detection using self-similarity analysis on blog temporal dynamics
AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the webThis paper focuses on spam blog (splog) detection. Blogs are highly popular, new media social communication mechanisms. The presence of splogs degrades blog search results as well as wastes network resources. In our approach we exploit unique blog ...
Temporal Effects on Hashtag Reuse in Twitter: A Cognitive-Inspired Hashtag Recommendation Approach
WWW '17: Proceedings of the 26th International Conference on World Wide WebHashtags have become a powerful tool in social platforms such as Twitter to categorize and search for content, and to spread short messages across members of the social network. In this paper, we study temporal hashtag usage practices in Twitter with ...
Temporal dynamics of posts and user engagement of influencers on Facebook and Instagram
ASONAM '21: Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and MiningA relevant fraction of human interactions occurs on online social networks. Freshness of content seems to play an important role, with content popularity rapidly vanishing over time. In this paper, we investigate how influencers' generated content (i.e.,...
Comments