skip to main content
10.1145/2487788.2488140acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Automatically generated spam detection based on sentence-level topic information

Authors Info & Claims
Published:13 May 2013Publication History

ABSTRACT

Spammers use a wide range of content generation techniques with low quality pages known as content spam to achieve their goals. We argue that content spam must be tackled using a wide range of content quality features. In this paper, we propose novel sentence-level diversity features based on the probabilistic topic model. We combine them with other content features to build a content spam classifier. Our experiments show that our method outperforms the conventional methods.

References

  1. I. Bíró, D. Siklósi, J. Szabó, and A. A. Benczúr. Linked latent dirichlet allocation in web spam filtering. In Proc. AIRWeb '09, AIRWeb '09, pages 37--40, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. I. Bíró, J. Szabó, and A. A. Benczúr. Latent dirichlet allocation in web spam filtering. In Proc. AIRWeb '08, AIRWeb '08, pages 29--32, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  4. M. Erdélyi, A. Garzó, and A. A. Benczúr. Web spam classification: a few features worth more. In Proc. WebQuality '11, WebQuality '11, pages 27--34, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871--1874, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proc. SIGIR '05, SIGIR '05, pages 170--177, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Fuchi and S. Takagi. Japanese morphological analyzer using word co-occurrence: Jtag. In Proc. COLING '98, pages 409--413, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. L. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101 (suppl. 1), pages 5228--5235, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  9. Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. AIRWeb '05, pages 39--47, 2005.Google ScholarGoogle Scholar
  10. Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proc. VLDB '04, pages 576--587, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Jo and A. H. Oh. Aspect and sentiment unification model for online review analysis. In Proc. WSDM '11, WSDM '11, pages 815--824, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. D. Manning and H. Schütze. Foundations of statistical natural language processing. MIT Press, Cambridge, MA, USA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Martinez-Romo and L. Araujo. Web spam identification through language model analysis. In Proc. AIRWeb '09, AIRWeb '09, pages 21--28, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proc. WWW '06, pages 83--92, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Pavlov and B. V. Dobrov. Detecting content spam on the web through text diversity analysis. In Proc. SYRCoDIS '11, pages 11--18, 2011.Google ScholarGoogle Scholar
  16. M. Riedl and C. Biemann. Sweeping through the topic space: bad luck? roll again! In Proc. ROBUS-UNSUP '12, ROBUS-UNSUP '12, pages 19--27, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Riedl and C. Biemann. Topictiling: a text segmentation algorithm based on lda. In Proc. ACL '12 Student Research Workshop, ACL '12, pages 37--42, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Spirin and J. Han. Survey on web spam detection: principles and algorithms. SIGKDD Explor. Newsl., 13(2):50--64, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Vallés and P. Rosso. Detection of near-duplicate user generated contents: the sms spam collection. In Proc. SMUC '11, SMUC '11, pages 27--34, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatically generated spam detection based on sentence-level topic information

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web
        May 2013
        1636 pages
        ISBN:9781450320382
        DOI:10.1145/2487788

        Copyright © 2013 Copyright is held by the International World Wide Web Conference Committee (IW3C2).

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 May 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        WWW '13 Companion Paper Acceptance Rate831of1,250submissions,66%Overall Acceptance Rate1,899of8,196submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader