Skip to main content

Using NLP Specific Tools for Non-NLP Specific Tasks. A Web Security Application

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9492))

Included in the following conference series:

  • 2492 Accesses

Abstract

In this paper we look at the task of detecting URLs corresponding to infected web pages using Machine Learning and Natural Language Processing specific features. We show that these features render better performance than the previously used hand-crafted lexical features and render similar results to the more expensive host-based features. We also introduce a new adjacent task, that of identifying URLs pointing to the download of portable executable files, and show that our models perform very well on this task too.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/dmlc/xgboost

  2. 2.

    All datasets used in this paper can be accessed from here: https://github.com/mary-octavia/URL-datasets

  3. 3.

    http://www.bitdefender.com/oem/url-status.html

  4. 4.

    https://docs.python.org/2/library/urlparse.html

  5. 5.

    https://publicsuffix.org/list/effective_tld_names.dat

References

  1. Aldwairi, M., Alsalman, R.: Malurls: a lightweight malicious website classification based on url features. J. Emerg. Technol. Web Intell. 4, 128–133 (2012)

    Google Scholar 

  2. Baayen, R.H., van Halteren, H., Neijt, A., Tweedie, F.: An experiment in authorship attribution. In: IEEE Intelligent Systems and Their Applications (2002)

    Google Scholar 

  3. Chaski, C.E.: The computational-linguistic approach to forensic authorship attribution. In: Olsen, F., Lorz, A., Stein, D. (eds.) Law and Language: Theory and Practice. Düsseldorf University Press (2008)

    Google Scholar 

  4. Choi, H., Zhu, B.B., Lee, H.: Detecting malicious web links and identifying their attack types. In: Fox, A. (ed.) 2nd USENIX Conference on Web Application Development, WebApps 2011, 15–16 June 2011. USENIX Association, Portland (2011)

    Google Scholar 

  5. Dinu, L.P., Popescu, M., Dinu, A.: Authorship identification of romanian texts with controversial paternity. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May – 1 June 2008. European Language Resources Association, Marrakech (2008)

    Google Scholar 

  6. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  7. Goodfellow, I.J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., Bengio, Y.: Pylearn2: a machine learning research library (2013). arXiv preprint arXiv:1308.4214

  8. Harper, A., Harris, S., Ness, J., Eagle, C., Lenkey, G., Williams, T.: Gray Hat Hacking: The Ethical Hacker’s Handbook, 3rd edn. McGraw-Hill, New York (2011)

    Google Scholar 

  9. Kan, M., Thi, H.O.N.: Fast webpage classification using URL features. In: Herzog, O., Schek, H., Fuhr, N., Chowdhury, A., Teiken, W. (eds.) Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, pp. 325–326. ACM, Bremen, October 31 - November 5 (2005)

    Google Scholar 

  10. Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious urls. In: IV, J.F.E., Fogelman-Soulié, F., Flach, P.A., Zaki, M.J. (eds.) Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1245–1254. ACM (2009)

    Google Scholar 

  11. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  12. Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pp. 935–942. ACM, New York (2007). http://doi.acm.org/10.1145/1273496.1273614

  13. Wei, Q., Dunbrack Jr, R.L.: The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS One 8(7), e67863 (2013)

    Article  Google Scholar 

  14. Weiss, G., Provost, F.: The effect of class distribution on classifier learning: an empirical study. Technical report (2001)

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank the reviewers for their input and their Bitdefender collegues for the support given. All authors contributed equally. The work of Octavia-Maria Șulea was also supported by the strategic grant POSDRU/187/1.5/S/155559. The work of Liviu P. Dinu was supported by UEFISCDI, PNII-IDPCE-2011-3-0959.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Octavia-Maria Șulea .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Șulea, OM., Dinu, L.P., Peşte, A. (2015). Using NLP Specific Tools for Non-NLP Specific Tasks. A Web Security Application. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9492. Springer, Cham. https://doi.org/10.1007/978-3-319-26561-2_74

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26561-2_74

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26560-5

  • Online ISBN: 978-3-319-26561-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics