Skip to main content

An Automated Online Spam Detector Based on Deep Cascade Forest

  • Conference paper
  • First Online:
Science of Cyber Security (SciSec 2019)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11933))

Included in the following conference series:

  • 830 Accesses

Abstract

With the development of internet communication, spam is quite ubiquitous in our daily life. It not only disturbs users, but also cms. Although there exists many methods of spam detection in both the area of cyber security and natural language processing, their performance is still not capable to satisfy requirements. In this paper, we implemented deep cascade forest for spam detection, a deep model without using backpropagation. With less hyperparameters, the training cost can be easily controlled and declines compared with that in neutral network methods. Furthermore, the proposed deep cascade forest outperforms other machine learning models in the F1 Score of detection. Therefore, considering the lower training cost, it can be considered as a useful online tool for spam detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Almeida, T.A.: Sms spam collection data set. https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. Accessed 25 Apr 2019

  2. Almeida, T.A.: SMS spam collection data set from Tiago A. Almeida’s homepage. http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/. Accessed 25 Apr 2019

  3. Almeida, T.A.: Youtube spam collection data set from Tiago A. Almeida’s homepage. http://www.dt.fee.unicamp.br/~tiago//youtubespamcollection/. Accessed 25 Apr 2019

  4. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)

    MATH  Google Scholar 

  5. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)

    Article  Google Scholar 

  6. Dong, C., Zhou, B.: An ensemble learning framework for online web spam detection. In: 2013 12th International Conference on Machine Learning and Applications, vol. 1, pp. 40–45. IEEE (2013)

    Google Scholar 

  7. Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Trans. Neural Netw. 10(5), 1048–1054 (1999)

    Article  Google Scholar 

  8. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Article  MathSciNet  Google Scholar 

  9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  10. Jain, G., Sharma, M., Agarwal, B.: Optimizing semantic LSTM for spam detection. Int. J. Inf. Technol. 11(2), 239–250 (2019)

    Google Scholar 

  11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  12. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)

    Google Scholar 

  13. Vergelis, M., Shcherbakova, T., Sidorina, T.: Spam and phishing in 2018. https://securelist.com/spam-and-phishing-in-2018/89701/. Accessed 25 Apr 2019

  14. McCord, M., Chuah, M.: Spam detection on twitter using traditional classifiers. In: Calero, J.M.A., Yang, L.T., Mármol, F.G., García Villalba, L.J., Li, A.X., Wang, Y. (eds.) ATC 2011. LNCS, vol. 6906, pp. 175–186. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23496-5_13

    Chapter  Google Scholar 

  15. Ren, Y., Ji, D.: Neural networks for deceptive opinion spam detection: an empirical study. Inf. Sci. 385, 213–224 (2017)

    Article  Google Scholar 

  16. Salton, G.: Developments in automatic text retrieval. Science 253(5023), 974–980 (1991)

    Article  MathSciNet  Google Scholar 

  17. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  Google Scholar 

  18. Sanville, E., Kenny, S.D., Smith, R., Henkelman, G.: Improved grid-based algorithm for bader charge allocation. J. Comput. Chem. 28(5), 899–908 (2007)

    Article  Google Scholar 

  19. Tang, D., Qin, B., Liu, T.: Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1422–1432 (2015)

    Google Scholar 

  20. Tracy, M., Jansen, W., Bisker, S.: Guidelines on electronic mail security. NIST Special Publication (2002)

    Google Scholar 

  21. Alberto, T.C., Lochter, J.V., Almeida, T.A.: Youtube spam collection data set. http://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection. Accessed 25 Apr 2019

  22. Wu, T., Liu, S., Zhang, J., Xiang, Y.: Twitter spam detection based on deep learning. In: Proceedings of the Australasian Computer Science Week Multiconference, pp. 3:1–3:8. ACM (2017)

    Google Scholar 

  23. Yang, H., Liu, Q., Zhou, S., Luo, Y.: A spam filtering method based on multi-modal fusion. Appl. Sci. 9(6), 1152 (2019)

    Article  Google Scholar 

  24. Zhou, Z.H., Feng, J.: Deep forest: towards an alternative to deep neural networks. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 3553–3559. AAAI Press (2017)

    Google Scholar 

Download references

Acknowledegment

This work was supported by the National Natural Science Foundation, China (Nos. 61806096, 61872190, 61403208).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xingguo Chen .

Editor information

Editors and Affiliations

Appendices

A Performance Comparison Between Deep Cascade Forest and Other Classifiers

Tables 3 and 4 shows the precision, recall, precision, accuracy and training/testing time of different models by different kinds of text processing methods on SMS datasets and YouTube datasets, respectively.

Table 3. Performance comparison between deep cascade forest and other classifiers on SMS spam dataset
Table 4. Performance comparison between deep cascade forest and other classifiers on YouTube spam dataset

B Parameters of LSTM in our Experiment

The LSTM layer contains 64 units for SMS spam detection and 100 units for YouTube spam detection. On both datasets, the batch size is set to 128 In addition, an embedding layer is used to convert each word in the sequence into a dense vector in advance. The embedding layer follows the LSTM layer, a fully connected layer with 256 units, an activation layer using ReLu function, a dropout layer with a dropout rate of 0.1, a fully connected layer with 1 unit, and an activation layer using sigmoid function.

Fig. 5.
figure 5

A1. LSTM classification model

The structure of LSTM is shown in Fig. 5, suppose the texts to sequences processing approach is used, then the output integer sequence is fed to the LSTM model. After embedding each token in the sequence into a 50-dimension word vector x, it is then fed to the LSTM layer with 100 hidden units. Overall, the output 100-dimension vector is processed through a 256 units fully connected layer, a 256 units ReLu layer, a fully connected layer with 1 unit in turn, and finally gets the predicted label through the sigmoid mapping.

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, K., Zou, X., Chen, X., Wang, H. (2019). An Automated Online Spam Detector Based on Deep Cascade Forest. In: Liu, F., Xu, J., Xu, S., Yung, M. (eds) Science of Cyber Security. SciSec 2019. Lecture Notes in Computer Science(), vol 11933. Springer, Cham. https://doi.org/10.1007/978-3-030-34637-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-34637-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-34636-2

  • Online ISBN: 978-3-030-34637-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics