Using NLP Specific Tools for Non-NLP Specific Tasks. A Web Security Application

Șulea, Octavia-Maria; Dinu, Liviu P.; Peşte, Alexandra

doi:10.1007/978-3-319-26561-2_74

Octavia-Maria Șulea^17,18,19,
Liviu P. Dinu^17,18 &
Alexandra Peşte^18,19

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9492))

Included in the following conference series:

International Conference on Neural Information Processing

2492 Accesses

Abstract

In this paper we look at the task of detecting URLs corresponding to infected web pages using Machine Learning and Natural Language Processing specific features. We show that these features render better performance than the previously used hand-crafted lexical features and render similar results to the more expensive host-based features. We also introduce a new adjacent task, that of identifying URLs pointing to the download of portable executable files, and show that our models perform very well on this task too.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Combatting Phishing Threats: An NLP-Based Programming Approach for Detection of Malicious Emails and Texts

Applying NLP techniques to malware detection in a practical environment

Article Open access 06 June 2021

Malicious URL Detection Using Transformers’ NLP Models and Machine Learning

Notes

1.
https://github.com/dmlc/xgboost
2.
All datasets used in this paper can be accessed from here: https://github.com/mary-octavia/URL-datasets
3.
http://www.bitdefender.com/oem/url-status.html
4.
https://docs.python.org/2/library/urlparse.html
5.
https://publicsuffix.org/list/effective_tld_names.dat

References

Aldwairi, M., Alsalman, R.: Malurls: a lightweight malicious website classification based on url features. J. Emerg. Technol. Web Intell. 4, 128–133 (2012)
Google Scholar
Baayen, R.H., van Halteren, H., Neijt, A., Tweedie, F.: An experiment in authorship attribution. In: IEEE Intelligent Systems and Their Applications (2002)
Google Scholar
Chaski, C.E.: The computational-linguistic approach to forensic authorship attribution. In: Olsen, F., Lorz, A., Stein, D. (eds.) Law and Language: Theory and Practice. Düsseldorf University Press (2008)
Google Scholar
Choi, H., Zhu, B.B., Lee, H.: Detecting malicious web links and identifying their attack types. In: Fox, A. (ed.) 2nd USENIX Conference on Web Application Development, WebApps 2011, 15–16 June 2011. USENIX Association, Portland (2011)
Google Scholar
Dinu, L.P., Popescu, M., Dinu, A.: Authorship identification of romanian texts with controversial paternity. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May – 1 June 2008. European Language Resources Association, Marrakech (2008)
Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
MATH Google Scholar
Goodfellow, I.J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., Bengio, Y.: Pylearn2: a machine learning research library (2013). arXiv preprint arXiv:1308.4214
Harper, A., Harris, S., Ness, J., Eagle, C., Lenkey, G., Williams, T.: Gray Hat Hacking: The Ethical Hacker’s Handbook, 3rd edn. McGraw-Hill, New York (2011)
Google Scholar
Kan, M., Thi, H.O.N.: Fast webpage classification using URL features. In: Herzog, O., Schek, H., Fuhr, N., Chowdhury, A., Teiken, W. (eds.) Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, pp. 325–326. ACM, Bremen, October 31 - November 5 (2005)
Google Scholar
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious urls. In: IV, J.F.E., Fogelman-Soulié, F., Flach, P.A., Zaki, M.J. (eds.) Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1245–1254. ACM (2009)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pp. 935–942. ACM, New York (2007). http://doi.acm.org/10.1145/1273496.1273614
Wei, Q., Dunbrack Jr, R.L.: The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS One 8(7), e67863 (2013)
Article Google Scholar
Weiss, G., Provost, F.: The effect of class distribution on classifier learning: an empirical study. Technical report (2001)
Google Scholar

Download references

Acknowledgements

The authors would like to thank the reviewers for their input and their Bitdefender collegues for the support given. All authors contributed equally. The work of Octavia-Maria Șulea was also supported by the strategic grant POSDRU/187/1.5/S/155559. The work of Liviu P. Dinu was supported by UEFISCDI, PNII-IDPCE-2011-3-0959.

Author information

Authors and Affiliations

Center for Computational Linguistics, University of Bucharest, Bucharest, Romania
Octavia-Maria Șulea & Liviu P. Dinu
Faculty of Mathematics and Computer Science, University of Bucharest, 14 Academiei Street, sector 1, Bucharest, Romania
Octavia-Maria Șulea, Liviu P. Dinu & Alexandra Peşte
Bitdefender, 24 Delea Veche Street, sector 2, Bucharest, Romania
Octavia-Maria Șulea & Alexandra Peşte

Authors

Octavia-Maria Șulea
View author publications
You can also search for this author in PubMed Google Scholar
Liviu P. Dinu
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Peşte
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Octavia-Maria Șulea .

Editor information

Editors and Affiliations

University of Istanbul, Istanbul, Turkey
Sabri Arik
University at Qatar, Doha, Qatar
Tingwen Huang
Tunku Abdul Rahman University College, Kuala Lumpur, Malaysia
Weng Kin Lai
University of Science Technology, Wuhan, China
Qingshan Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Șulea, OM., Dinu, L.P., Peşte, A. (2015). Using NLP Specific Tools for Non-NLP Specific Tasks. A Web Security Application. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9492. Springer, Cham. https://doi.org/10.1007/978-3-319-26561-2_74

Download citation

DOI: https://doi.org/10.1007/978-3-319-26561-2_74
Published: 18 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26560-5
Online ISBN: 978-3-319-26561-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics