Unsupervised Spam Detection by Document Complexity Estimation

Uemura, Takashi; Ikeda, Daisuke; Arimura, Hiroki

doi:10.1007/978-3-540-88411-8_30

Takashi Uemura²²,
Daisuke Ikeda²³ &
Hiroki Arimura²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5255))

Included in the following conference series:

International Conference on Discovery Science

907 Accesses
9 Citations

Abstract

In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kolari, P., Java, A., Finin, T.: Characterizing the splogosphere. In: Proc. WWE 2006 (2006)
Google Scholar
Graham, P.: A Plan for Spam (2002), http://paulgraham.com/spam.html
Narisawa, K., Yamada, Y., Ikeda, D., Takeda, M.: Detecting Blog Spams using the Vocabulary Size of All Substrings in Their Copies. In: Proc. WWE 2006 (2006)
Google Scholar
Narisawa, K., Bannai, H., Hatano, K., Takeda, M.: Unsupervised spam detection based on string alienness measures. In: Corruble, V., Takeda, M., Suzuki, E. (eds.) DS 2007. LNCS (LNAI), vol. 4755, pp. 161–172. Springer, Heidelberg (2007)
Chapter Google Scholar
Yoshida, K., Adachi, F., Washio, T., Motoda, H., Homma, T., Nakashima, A., Fujikawa, H., Yamazaki, K.: Density-based spam detector. In: Proc. ACM KDD 2004, pp. 486–493 (2004)
Google Scholar
Mishne, G., Carmel, D., Lempel, R.: Blocking Blog Spam with Language Model Disagreement. In: Proc. AIRWeb 2005 (2005)
Google Scholar
Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating Web Spam with TrustRank. In: Proc. VLDB 2004, pp. 576–587 (2004)
Google Scholar
Benczúr, A.A., Csalogány, K., Tamás Sarlós, M.U.: SpamRank – Fully Automatic Link Spam Detection. In: Proc. AIRWeb 2005 (2005)
Google Scholar
Bratko, A., Filipič, B., Cormack, G.V., Lynam, T.R., Zupan, B.: Spam filtering using statistical data compression models. JMLR 7, 2673–2698 (2006)
MATH Google Scholar
Ukkonen, E.: On-line construction of suffix-trees. Algorithmica 14(3), 249–260 (1995)
Article MathSciNet MATH Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley, Chichester (1938)
Google Scholar
Jagadish, H.V., Ng, R.T., Srivastava, D.: Substring selectivity estimation. In: Proc. PODS 1999, pp. 249–260 (1999)
Google Scholar
Krishnan, P., Vitter, J.S., Iyer, B.: Estimating alphanumeric selectivity in the presence of wildcards. In: Proc. SIGMOD 1996, pp. 282–293 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo, 060-0814, Japan
Takashi Uemura & Hiroki Arimura
Kyushu University, 744 Motooka Nishi-ku, Fukuoka, 819-0395, Japan
Daisuke Ikeda

Authors

Takashi Uemura
View author publications
You can also search for this author in PubMed Google Scholar
Daisuke Ikeda
View author publications
You can also search for this author in PubMed Google Scholar
Hiroki Arimura
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INSA Lyon, LIRIS CNRS UMR 5205, University of Lyon, 69621, Villeurbanne Cedex, France
Jean-François Jean-Fran
Department of Computer and Information Science, University of Konstanz, Box M 712, 78457, Konstanz, Germany
Michael R. Berthold
University of Bonn and Fraunhofer IAIS, Schloss Birlinghoven, 53754, Sankt Augustin, Germany
Tamás Horváth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Uemura, T., Ikeda, D., Arimura, H. (2008). Unsupervised Spam Detection by Document Complexity Estimation. In: Jean-Fran, JF., Berthold, M.R., Horváth, T. (eds) Discovery Science. DS 2008. Lecture Notes in Computer Science(), vol 5255. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88411-8_30

Download citation

DOI: https://doi.org/10.1007/978-3-540-88411-8_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88410-1
Online ISBN: 978-3-540-88411-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics