Abstract
In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kolari, P., Java, A., Finin, T.: Characterizing the splogosphere. In: Proc. WWE 2006 (2006)
Graham, P.: A Plan for Spam (2002), http://paulgraham.com/spam.html
Narisawa, K., Yamada, Y., Ikeda, D., Takeda, M.: Detecting Blog Spams using the Vocabulary Size of All Substrings in Their Copies. In: Proc. WWE 2006 (2006)
Narisawa, K., Bannai, H., Hatano, K., Takeda, M.: Unsupervised spam detection based on string alienness measures. In: Corruble, V., Takeda, M., Suzuki, E. (eds.) DS 2007. LNCS (LNAI), vol. 4755, pp. 161–172. Springer, Heidelberg (2007)
Yoshida, K., Adachi, F., Washio, T., Motoda, H., Homma, T., Nakashima, A., Fujikawa, H., Yamazaki, K.: Density-based spam detector. In: Proc. ACM KDD 2004, pp. 486–493 (2004)
Mishne, G., Carmel, D., Lempel, R.: Blocking Blog Spam with Language Model Disagreement. In: Proc. AIRWeb 2005 (2005)
Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating Web Spam with TrustRank. In: Proc. VLDB 2004, pp. 576–587 (2004)
Benczúr, A.A., Csalogány, K., Tamás Sarlós, M.U.: SpamRank – Fully Automatic Link Spam Detection. In: Proc. AIRWeb 2005 (2005)
Bratko, A., Filipič, B., Cormack, G.V., Lynam, T.R., Zupan, B.: Spam filtering using statistical data compression models. JMLR 7, 2673–2698 (2006)
Ukkonen, E.: On-line construction of suffix-trees. Algorithmica 14(3), 249–260 (1995)
Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley, Chichester (1938)
Jagadish, H.V., Ng, R.T., Srivastava, D.: Substring selectivity estimation. In: Proc. PODS 1999, pp. 249–260 (1999)
Krishnan, P., Vitter, J.S., Iyer, B.: Estimating alphanumeric selectivity in the presence of wildcards. In: Proc. SIGMOD 1996, pp. 282–293 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer Berlin Heidelberg
About this paper
Cite this paper
Uemura, T., Ikeda, D., Arimura, H. (2008). Unsupervised Spam Detection by Document Complexity Estimation. In: Jean-Fran, JF., Berthold, M.R., Horváth, T. (eds) Discovery Science. DS 2008. Lecture Notes in Computer Science(), vol 5255. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88411-8_30
Download citation
DOI: https://doi.org/10.1007/978-3-540-88411-8_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88410-1
Online ISBN: 978-3-540-88411-8
eBook Packages: Computer ScienceComputer Science (R0)