ABSTRACT
Spammers use a wide range of content generation techniques with low quality pages known as content spam to achieve their goals. We argue that content spam must be tackled using a wide range of content quality features. In this paper, we propose novel sentence-level diversity features based on the probabilistic topic model. We combine them with other content features to build a content spam classifier. Our experiments show that our method outperforms the conventional methods.
- I. Bíró, D. Siklósi, J. Szabó, and A. A. Benczúr. Linked latent dirichlet allocation in web spam filtering. In Proc. AIRWeb '09, AIRWeb '09, pages 37--40, 2009. Google ScholarDigital Library
- I. Bíró, J. Szabó, and A. A. Benczúr. Latent dirichlet allocation in web spam filtering. In Proc. AIRWeb '08, AIRWeb '08, pages 29--32, 2008. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarCross Ref
- M. Erdélyi, A. Garzó, and A. A. Benczúr. Web spam classification: a few features worth more. In Proc. WebQuality '11, WebQuality '11, pages 27--34, 2011. Google ScholarDigital Library
- R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871--1874, 2008. Google ScholarDigital Library
- D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proc. SIGIR '05, SIGIR '05, pages 170--177, 2005. Google ScholarDigital Library
- T. Fuchi and S. Takagi. Japanese morphological analyzer using word co-occurrence: Jtag. In Proc. COLING '98, pages 409--413, 1998. Google ScholarDigital Library
- T. L. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101 (suppl. 1), pages 5228--5235, 2004.Google ScholarCross Ref
- Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. AIRWeb '05, pages 39--47, 2005.Google Scholar
- Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proc. VLDB '04, pages 576--587, 2004. Google ScholarDigital Library
- Y. Jo and A. H. Oh. Aspect and sentiment unification model for online review analysis. In Proc. WSDM '11, WSDM '11, pages 815--824, 2011. Google ScholarDigital Library
- C. D. Manning and H. Schütze. Foundations of statistical natural language processing. MIT Press, Cambridge, MA, USA, 1999. Google ScholarDigital Library
- J. Martinez-Romo and L. Araujo. Web spam identification through language model analysis. In Proc. AIRWeb '09, AIRWeb '09, pages 21--28, 2009. Google ScholarDigital Library
- A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proc. WWW '06, pages 83--92, 2006. Google ScholarDigital Library
- A. Pavlov and B. V. Dobrov. Detecting content spam on the web through text diversity analysis. In Proc. SYRCoDIS '11, pages 11--18, 2011.Google Scholar
- M. Riedl and C. Biemann. Sweeping through the topic space: bad luck? roll again! In Proc. ROBUS-UNSUP '12, ROBUS-UNSUP '12, pages 19--27, 2012. Google ScholarDigital Library
- M. Riedl and C. Biemann. Topictiling: a text segmentation algorithm based on lda. In Proc. ACL '12 Student Research Workshop, ACL '12, pages 37--42, 2012. Google ScholarDigital Library
- N. Spirin and J. Han. Survey on web spam detection: principles and algorithms. SIGKDD Explor. Newsl., 13(2):50--64, 2012. Google ScholarDigital Library
- E. Vallés and P. Rosso. Detection of near-duplicate user generated contents: the sms spam collection. In Proc. SMUC '11, SMUC '11, pages 27--34, 2011. Google ScholarDigital Library
Index Terms
- Automatically generated spam detection based on sentence-level topic information
Recommendations
Detecting blog spam hashtags using topic modeling
ICEC '16: Proceedings of the 18th Annual International Conference on Electronic Commerce: e-Commerce in Smart connected WorldTremendous amounts of data are generated daily. Accordingly, unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers as this data contains abundant information about various ...
A fuzzy logic approach for detecting redirection spam
Redirection spam is a relatively newer technique whereby spammers redirect the search user to an unwanted webpage or download malware on the victim's machine without his consent. Spammers are making use of chained redirections to hide their nefarious ...
Opinion spam detection framework using hybrid classification scheme
AbstractWith the advent of social networking sites, opinion-mining applications have attracted the interest of the online community on review sites to know about products for their purchase decisions. However, due to increasing trend of posting spam (fake)...
Comments