skip to main content
research-article

Detecting splogs via temporal dynamics using self-similarity analysis

Published:03 March 2008Publication History
Skip Abstract Section

Abstract

This article addresses the problem of spam blog (splog) detection using temporal and structural regularity of content, post time and links. Splogs are undesirable blogs meant to attract search engine traffic, used solely for promoting affiliate sites. Blogs represent popular online media, and splogs not only degrade the quality of search engine results, but also waste network resources. The splog detection problem is made difficult due to the lack of stable content descriptors.

We have developed a new technique for detecting splogs, based on the observation that a blog is a dynamic, growing sequence of entries (or posts) rather than a collection of individual pages. In our approach, splogs are recognized by their temporal characteristics and content. There are three key ideas in our splog detection framework. (a) We represent the blog temporal dynamics using self-similarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts, to investigate the temporal changes of the post sequence. (b) We study the blog temporal characteristics using a visual representation derived from the self-similarity measures. The visual signature reveals correlation between attributes and posts, depending on the type of blogs (normal blogs and splogs). (c) We propose two types of novel temporal features to capture the splog temporal characteristics. In our splog detector, these novel features are combined with content based features. We extract a content based feature vector from blog home pages as well as from different parts of the blog. The dimensionality of the feature vector is reduced by Fisher linear discriminant analysis. We have tested an SVM-based splog detector using proposed features on real world datasets, with appreciable results (90% accuracy).

References

  1. Benczur, A., Csalogany, K., Sarlos, T., and Uher, M. 2005. Spamrank-fully automatic link spam detection. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).Google ScholarGoogle Scholar
  2. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Santini, M., and Vigna, S. 2006. A reference collection for web spam. SIGIR Forum (Dec.) ACM Press, 11--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Cavnar, W. B. and Trenkle, J. M. 1994. N-gram-based text categorization. In Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval.Google ScholarGoogle Scholar
  4. Chang, C.-C. and Lin, C.-J. 2001. Libsvm: A library for support vector machines. ntv.edu.two- cjlin/papers (libsvm, ps.gz).Google ScholarGoogle Scholar
  5. Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification. John Wiley & Sons, Inc. New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Eckmann, J., Kamphorst, S. O., and Ruelle, D. 1987. Recurrence plots of dynamical systems. Europhysics Lett. 4, 973--977.Google ScholarGoogle ScholarCross RefCross Ref
  7. Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases. Colocated with ACM SIGMOD/PODS. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Fetterly, D., Manasse, M., and Najork, M. 2005. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 170--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Fogaras, D. and Racz, B. 2005. Scaling link-based similarity search. In Proceedings of the 14th International Conference on World Wide Web. ACM Press. 641--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Foote, J., Cooper, M., and Nam, U. 2002. Audio retrieval by rhythmic similarity. In Proceedings of the International Conference on Music Information Retrieval. 265--266.Google ScholarGoogle Scholar
  11. Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with trustrank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04). Toronto, Canada. Morgan Kaufmann. 576--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).Google ScholarGoogle Scholar
  13. Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J. 2006. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB). Seoul, Korea. 439--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Han, S., Ahn, Y., Moon, S., and Jeong, H. 2006. Collaborative blog spam filtering using adaptive percolation search. WWW2006 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics. Edinburgh.Google ScholarGoogle Scholar
  15. Kolari, P. 2005. Welcome to the splogosphere: 75% of new pings are spings (splogs). http://ebiquity.umbc.edu/blogger/2005/12/15/welcome-to-the-splogosphere-75-of-new-blog-posts- are-spam/.Google ScholarGoogle Scholar
  16. Kolari, P., Finin, T., and Joshi, A. 2006a. Svms for the blogosphere: Blog identification and splog detection. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs.Google ScholarGoogle Scholar
  17. Kolari, P., Java, A., and Finin, T. 2006b. Characterizing the splogosphere. In Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wide Web Conference.Google ScholarGoogle Scholar
  18. Kolari, P., Java, A., Finin, T., Mayfield, J., Joshi, A., and Martineau, J. 2006c. Blog track open task: Spam blog classification. TREC Blog Track Notebook.Google ScholarGoogle Scholar
  19. Kolari, P., Java, A., Finin, T., Oates, T., and Joshi, A. 2006d. Detecting spam blogs: A machine learning approach. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI'06). Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lin, Y., Sundaram, H., Chi, Y., Tatemura, J., and Tseng, B. 2007. Splog detection using content, time and link structures. IEEE International Conference on Multimedia and Expo 2007: 2030--2033.Google ScholarGoogle Scholar
  21. Lin, Y.-R., Chen, W.-Y., Shi, X., Sia, R., Song, X., Chi, Y., Hino, K., Sundaram, H., Tatemura, J., and Tseng, B. 2006. The splog detection task and a solution based on temporal and link properties. In Poceedings of the 15th Text REtrieval Conference (TREC'06).Google ScholarGoogle Scholar
  22. Macdonald, C. and Ounis, I. 2006. The trec blogs06 collection: Creating and analyzing a blog test collection. TR-2006-224. Department of Computer Science, University of Glasgow.Google ScholarGoogle Scholar
  23. Mishne, G., Carmel, D., and Lempel, R. 2005. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).Google ScholarGoogle Scholar
  24. Narisawa, K., Yamada, Y., Ikeda, D., and Takeda, M. 2006. Detecting blog spams using the vocabulary size of all substrings in their copies. In Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem.Google ScholarGoogle Scholar
  25. Newman, M. and Girvan, M. 2004. Finding and evaluating community structure in networks. Phys. Rev. E 69, 2, 26113.Google ScholarGoogle ScholarCross RefCross Ref
  26. Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam Web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web. Edinburgh, Scotland. ACM Press, 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Salvetti, F. and Nicolov, N. Weblog classification for fast splog filtering: A url language model segmentation approach. In Proceedings of the Human Language Technology Conference of the NAACL. Companion Volume: Short Papers, 137--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Shen, G., Gao, B., Liu, T.-Y., Feng, G., Song, S., and Li, H. 2006. Detecting link spam using temporal information. In Proceedings of the 6th International Conference on Data Mining. IEEE Computer Society. 1049--1053. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. SURBL Surbl---spam uri realtime blocklists. http://www.surbl.org/.Google ScholarGoogle Scholar
  30. Swain, M. and Ballard, D. 1991. Color indexing. Int. J. Comput. Vision 7, 1, 11--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. UMBRIA. 2006. Spam in the blogosphere. http://www.umbrialistens.com/files/uploads/umbria_ splog.pdf.Google ScholarGoogle Scholar
  32. Urvoy, T., Lavergne, T., and Filoche, P. 2006. Tracking web spam with hidden style similarity. AIRWEB, Seattle, WA.Google ScholarGoogle Scholar
  33. Von Ahn, L., Blum, M., and Langford, J. 2004. Telling humans and computers apart automatically. Comm. ACM 47, 2, 56--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Wikipedia. http://en.wikipedia.org/wiki/.Google ScholarGoogle Scholar
  35. Wu, B. and Davison, B. 2005. Identifying link farm spam pages. In Proceedings of the International World Wide Web Conference. ACM Press. 820--829. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zawodny, J. 2005 Yahoo! Search blog: A defense against comment spam. http://www.ysearchblog.com/archives/000069.html.Google ScholarGoogle Scholar

Index Terms

  1. Detecting splogs via temporal dynamics using self-similarity analysis

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on the Web
                ACM Transactions on the Web  Volume 2, Issue 1
                February 2008
                280 pages
                ISSN:1559-1131
                EISSN:1559-114X
                DOI:10.1145/1326561
                Issue’s Table of Contents

                Copyright © 2008 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 3 March 2008
                • Accepted: 1 October 2007
                • Revised: 1 September 2007
                • Received: 1 April 2007
                Published in tweb Volume 2, Issue 1

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader