research-article

Automatically generated spam detection based on sentence-level topic information

Authors:

Yoshihiko Suhara,

Shuichi Nishioka,

Seiji SusakiAuthors Info & Claims

WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

Pages 1157 - 1160

https://doi.org/10.1145/2487788.2488140

Published: 13 May 2013 Publication History

Abstract

Spammers use a wide range of content generation techniques with low quality pages known as content spam to achieve their goals. We argue that content spam must be tackled using a wide range of content quality features. In this paper, we propose novel sentence-level diversity features based on the probabilistic topic model. We combine them with other content features to build a content spam classifier. Our experiments show that our method outperforms the conventional methods.

References

[1]

I. Bíró, D. Siklósi, J. Szabó, and A. A. Benczúr. Linked latent dirichlet allocation in web spam filtering. In Proc. AIRWeb '09, AIRWeb '09, pages 37--40, 2009.

Digital Library

[2]

I. Bíró, J. Szabó, and A. A. Benczúr. Latent dirichlet allocation in web spam filtering. In Proc. AIRWeb '08, AIRWeb '08, pages 29--32, 2008.

Digital Library

[3]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003.

[4]

M. Erdélyi, A. Garzó, and A. A. Benczúr. Web spam classification: a few features worth more. In Proc. WebQuality '11, WebQuality '11, pages 27--34, 2011.

Digital Library

[5]

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871--1874, 2008.

Digital Library

[6]

D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proc. SIGIR '05, SIGIR '05, pages 170--177, 2005.

Digital Library

[7]

T. Fuchi and S. Takagi. Japanese morphological analyzer using word co-occurrence: Jtag. In Proc. COLING '98, pages 409--413, 1998.

Digital Library

[8]

T. L. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101 (suppl. 1), pages 5228--5235, 2004.

[9]

Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. AIRWeb '05, pages 39--47, 2005.

[10]

Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proc. VLDB '04, pages 576--587, 2004.

Digital Library

[11]

Y. Jo and A. H. Oh. Aspect and sentiment unification model for online review analysis. In Proc. WSDM '11, WSDM '11, pages 815--824, 2011.

Digital Library

[12]

C. D. Manning and H. Schütze. Foundations of statistical natural language processing. MIT Press, Cambridge, MA, USA, 1999.

Digital Library

[13]

J. Martinez-Romo and L. Araujo. Web spam identification through language model analysis. In Proc. AIRWeb '09, AIRWeb '09, pages 21--28, 2009.

Digital Library

[14]

A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proc. WWW '06, pages 83--92, 2006.

Digital Library

[15]

A. Pavlov and B. V. Dobrov. Detecting content spam on the web through text diversity analysis. In Proc. SYRCoDIS '11, pages 11--18, 2011.

[16]

M. Riedl and C. Biemann. Sweeping through the topic space: bad luck? roll again! In Proc. ROBUS-UNSUP '12, ROBUS-UNSUP '12, pages 19--27, 2012.

Digital Library

[17]

M. Riedl and C. Biemann. Topictiling: a text segmentation algorithm based on lda. In Proc. ACL '12 Student Research Workshop, ACL '12, pages 37--42, 2012.

Digital Library

[18]

N. Spirin and J. Han. Survey on web spam detection: principles and algorithms. SIGKDD Explor. Newsl., 13(2):50--64, 2012.

Digital Library

[19]

E. Vallés and P. Rosso. Detection of near-duplicate user generated contents: the sms spam collection. In Proc. SMUC '11, SMUC '11, pages 27--34, 2011.

Digital Library

Cited By

Rani MSumathy S(2022)A Study on Diverse Methods and Performance Measures in Sentiment AnalysisRecent Patents on Engineering10.2174/187221211499920101915495416:3Online publication date: May-2022
https://doi.org/10.2174/1872212114999201019154954
Xu LZhang LLuo W(2022)Pseudo Base Station Spam SMS Identification Based on BiLSTM-Attention2022 11th International Conference on Communications, Circuits and Systems (ICCCAS)10.1109/ICCCAS55266.2022.9825128(216-219)Online publication date: 13-May-2022
https://doi.org/10.1109/ICCCAS55266.2022.9825128
Choi DPark IShin MKim EShin D(2021)Korean Erroneous Sentence Classification With Integrated Eojeol EmbeddingIEEE Access10.1109/ACCESS.2021.30858649(81778-81785)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3085864
Show More Cited By

Index Terms

Automatically generated spam detection based on sentence-level topic information
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Detecting blog spam hashtags using topic modeling
ICEC '16: Proceedings of the 18th Annual International Conference on Electronic Commerce: e-Commerce in Smart connected World

Tremendous amounts of data are generated daily. Accordingly, unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers as this data contains abundant information about various ...
A fuzzy logic approach for detecting redirection spam

Redirection spam is a relatively newer technique whereby spammers redirect the search user to an unwanted webpage or download malware on the victim's machine without his consent. Spammers are making use of chained redirections to hide their nefarious ...
Opinion spam detection framework using hybrid classification scheme
Abstract
With the advent of social networking sites, opinion-mining applications have attracted the interest of the online community on review sites to know about products for their purchase decisions. However, due to increasing trend of posting spam (fake)...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

May 2013

1636 pages

ISBN:9781450320382

DOI:10.1145/2487788

General Chairs:
Daniel Schwabe
PUC-Rio - Brazil
,
Virgílio Almeida
UFMG - Brazil
,
Hartmut Glaser
CGI.br - Brazil
,
Program Chairs:
Ricardo Baeza-Yates
Yahoo! Labs - Spain & Chile
,
Sue Moon
KAIST - South Korea

Copyright © 2013 Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Sponsors

NICBR: Nucleo de Informatcao e Coordenacao do Ponto BR
CGIBR: Comite Gestor da Internet no Brazil

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '13

Sponsor:

NICBR
CGIBR

WWW '13: 22nd International World Wide Web Conference

May 13 - 17, 2013

Rio de Janeiro, Brazil

Acceptance Rates

WWW '13 Companion Paper Acceptance Rate 831 of 1,250 submissions, 66%;

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
232
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rani MSumathy S(2022)A Study on Diverse Methods and Performance Measures in Sentiment AnalysisRecent Patents on Engineering10.2174/187221211499920101915495416:3Online publication date: May-2022
https://doi.org/10.2174/1872212114999201019154954
Xu LZhang LLuo W(2022)Pseudo Base Station Spam SMS Identification Based on BiLSTM-Attention2022 11th International Conference on Communications, Circuits and Systems (ICCCAS)10.1109/ICCCAS55266.2022.9825128(216-219)Online publication date: 13-May-2022
https://doi.org/10.1109/ICCCAS55266.2022.9825128
Choi DPark IShin MKim EShin D(2021)Korean Erroneous Sentence Classification With Integrated Eojeol EmbeddingIEEE Access10.1109/ACCESS.2021.30858649(81778-81785)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3085864
Asdaghi FSoleimani A(2019)An effective feature selection method for web spam detectionKnowledge-Based Systems10.1016/j.knosys.2018.12.026166(198-206)Online publication date: Feb-2019
https://doi.org/10.1016/j.knosys.2018.12.026
Zhou XOuyang JLi X(2018)Two time-efficient gibbs sampling inference algorithms for biterm topic modelApplied Intelligence10.1007/s10489-017-1004-248:3(730-754)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.1007/s10489-017-1004-2
Wei SZhu Y(2017)Cleaning Out Web Spam by Entropy-Based Cascade Outlier DetectionDatabase and Expert Systems Applications10.1007/978-3-319-64471-4_19(232-246)Online publication date: 2-Aug-2017
https://doi.org/10.1007/978-3-319-64471-4_19
Karami AZhou L(2015)Exploiting latent content based features for the detection of static SMS spamsProceedings of the American Society for Information Science and Technology10.1002/meet.2014.1450510115751:1(1-4)Online publication date: 24-Apr-2015
https://doi.org/10.1002/meet.2014.14505101157
Liu HZhang YLin HWu JWu ZZhang X(2013)How Many Zombies Around You?2013 IEEE 13th International Conference on Data Mining10.1109/ICDM.2013.166(1133-1138)Online publication date: Dec-2013
https://doi.org/10.1109/ICDM.2013.166

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten