Abstract
In this paper we look at the task of detecting URLs corresponding to infected web pages using Machine Learning and Natural Language Processing specific features. We show that these features render better performance than the previously used hand-crafted lexical features and render similar results to the more expensive host-based features. We also introduce a new adjacent task, that of identifying URLs pointing to the download of portable executable files, and show that our models perform very well on this task too.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
All datasets used in this paper can be accessed from here: https://github.com/mary-octavia/URL-datasets
- 3.
- 4.
- 5.
References
Aldwairi, M., Alsalman, R.: Malurls: a lightweight malicious website classification based on url features. J. Emerg. Technol. Web Intell. 4, 128–133 (2012)
Baayen, R.H., van Halteren, H., Neijt, A., Tweedie, F.: An experiment in authorship attribution. In: IEEE Intelligent Systems and Their Applications (2002)
Chaski, C.E.: The computational-linguistic approach to forensic authorship attribution. In: Olsen, F., Lorz, A., Stein, D. (eds.) Law and Language: Theory and Practice. Düsseldorf University Press (2008)
Choi, H., Zhu, B.B., Lee, H.: Detecting malicious web links and identifying their attack types. In: Fox, A. (ed.) 2nd USENIX Conference on Web Application Development, WebApps 2011, 15–16 June 2011. USENIX Association, Portland (2011)
Dinu, L.P., Popescu, M., Dinu, A.: Authorship identification of romanian texts with controversial paternity. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May – 1 June 2008. European Language Resources Association, Marrakech (2008)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
Goodfellow, I.J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., Bengio, Y.: Pylearn2: a machine learning research library (2013). arXiv preprint arXiv:1308.4214
Harper, A., Harris, S., Ness, J., Eagle, C., Lenkey, G., Williams, T.: Gray Hat Hacking: The Ethical Hacker’s Handbook, 3rd edn. McGraw-Hill, New York (2011)
Kan, M., Thi, H.O.N.: Fast webpage classification using URL features. In: Herzog, O., Schek, H., Fuhr, N., Chowdhury, A., Teiken, W. (eds.) Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, pp. 325–326. ACM, Bremen, October 31 - November 5 (2005)
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious urls. In: IV, J.F.E., Fogelman-Soulié, F., Flach, P.A., Zaki, M.J. (eds.) Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1245–1254. ACM (2009)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pp. 935–942. ACM, New York (2007). http://doi.acm.org/10.1145/1273496.1273614
Wei, Q., Dunbrack Jr, R.L.: The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS One 8(7), e67863 (2013)
Weiss, G., Provost, F.: The effect of class distribution on classifier learning: an empirical study. Technical report (2001)
Acknowledgements
The authors would like to thank the reviewers for their input and their Bitdefender collegues for the support given. All authors contributed equally. The work of Octavia-Maria Șulea was also supported by the strategic grant POSDRU/187/1.5/S/155559. The work of Liviu P. Dinu was supported by UEFISCDI, PNII-IDPCE-2011-3-0959.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Șulea, OM., Dinu, L.P., Peşte, A. (2015). Using NLP Specific Tools for Non-NLP Specific Tasks. A Web Security Application. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9492. Springer, Cham. https://doi.org/10.1007/978-3-319-26561-2_74
Download citation
DOI: https://doi.org/10.1007/978-3-319-26561-2_74
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26560-5
Online ISBN: 978-3-319-26561-2
eBook Packages: Computer ScienceComputer Science (R0)