Skip to main content

Everything Is in the Name – A URL Based Approach for Phishing Detection

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11527))

Abstract

Phishing attack, in which a user is tricked into revealing sensitive information on a spoofed website, is one of the most common threat to cybersecurity. Most modern web browsers counter phishing attacks using a blacklist of confirmed phishing URLs. However, one major disadvantage of the blacklist method is that it is ineffective against newly generated phishes. Machine learning based techniques that rely on features extracted from URL (e.g., URL length and bag-of-words) or web page (e.g., TF-IDF and form fields) are considered to be more effective in identifying new phishing attacks. The main benefit of using URL based features over page based features is that the machine learning model can classify new URLs on-the-fly even before the page is loaded by the web browser, thus avoiding other potential dangers such as drive-by download attacks and cryptojacking attacks.

In this work, we focus on improving the performance of URL based detection techniques. We show that, although a classifier trained on traditional bag-of-words features (tokenized using special characters) works well in many cases, it fails to recognize a very prevalent class of phishing URLs that combines a popular brand with one or more words (e.g., www.paypalloginsecure.com and paypalhelpservice.simdif.com) among others. To overcome these flaws, we explore various alternative feature extraction techniques based on word segmentation and \(n-\)grams. We also construct and use a phishy-list of popular words that are highly indicative of phishing attacks. We verify the efficacy of each of these feature sets by training a logistic regression classifier on a large dataset consisting of 100,000 URLs. Our experimental results reveal that features based on word segmentation, phishy-list and numerical features (e.g., URL length) perform better than all other features, as measured by misclassification and false negative rates.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. APWG, February 2019. http://docs.apwg.org/reports/apwg_trends_report_q3_2018.pdf

  2. DMOZ, February 2019. http://dmoz-odp.org/

  3. Google Safe Browsing, February 2019. https://safebrowsing.google.com/

  4. PhishTank, February 2019. https://www.antiphishing.org/resources/apwg-reports/

  5. Python Word Segmentation, February 2019. http://www.grantjenks.com/docs/wordsegment/

  6. Alsharnouby, M., Alaca, F., Chiasson, S.: Why phishing still works: user strategies for combating phishing attacks. Int. J. Hum.-Comput. Stud. 82, 69–82 (2015)

    Article  Google Scholar 

  7. Ardi, C., Heidemann, J.: Auntietuna: personalized content-based phishing detection. In: Proceedings of the NDSS Workshop on Usable Security. The Internet Society, San Diego, California, USA, February 2016. http://www.isi.edu/%7ejohnh/PAPERS/Ardi16a.html

  8. Canova, G., Volkamer, M., Bergmann, C., Reinheimer, B.: NoPhish app evaluation: lab and retention study. Internet Society, USEC (2015)

    Google Scholar 

  9. CJ, G., Pandit, S., Vaddepalli, S., Tupsamudre, H., Banahatti, V., Lodha, S.: Phishy - a serious game to train enterprise users on phishing awareness. In: Proceedings of the 2018 Annual Symposium on Computer-Human Interaction in Play Companion Extended Abstracts, CHI PLAY 2018, pp. 169–181. ACM, New York (2018). https://doi.org/10.1145/3270316.3273042

  10. Dhamija, R., Tygar, J.D., Hearst, M.: Why phishing works. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2006, pp. 581–590. ACM, New York (2006). https://doi.org/10.1145/1124772.1124861

  11. Felt, A.P., et al.: Improving SSL warnings: comprehension and adherence. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI 2015, pp. 2893–2902. ACM, New York (2015). https://doi.org/10.1145/2702123.2702442

  12. Garera, S., Provos, N., Chew, M., Rubin, A.D.: A framework for detection and measurement of phishing attacks. In: Proceedings of the 2007 ACM Workshop on Recurring Malcode, WORM 2007, pp. 1–8. ACM, New York (2007). https://doi.org/10.1145/1314389.1314391

  13. Hong, J.: The state of phishing attacks. Commun. ACM 55(1), 74–81 (2012). https://doi.org/10.1145/2063176.2063197

    Article  Google Scholar 

  14. Khonji, M., Iraqi, Y., Jones, A.: Phishing detection: a literature survey. IEEE Commun. Surv. Tutor. 15(4), 2091–2121 (2013). https://doi.org/10.1109/SURV.2013.032213.00009

    Article  Google Scholar 

  15. Kintis, P., et al.: Hiding in plain sight: a longitudinal study of combosquatting abuse. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, pp. 569–586. ACM, New York (2017). https://doi.org/10.1145/3133956.3134002

  16. Le, A., Markopoulou, A., Faloutsos, M.: PhishDef: URL names say it all. In: 2011 Proceedings IEEE INFOCOM, pp. 191–195, April 2011. https://doi.org/10.1109/INFCOM.2011.5934995

  17. Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 1245–1254. ACM, New York (2009). https://doi.org/10.1145/1557019.1557153

  18. Marchal, S., François, J., State, R., Engel, T.: Phishstorm: detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manag. 11(4), 458–471 (2014). https://doi.org/10.1109/TNSM.2014.2377295

    Article  Google Scholar 

  19. Marchal, S., Saari, K., Singh, N., Asokan, N.: Know your phish: novel techniques for detecting phishing sites and their targets. In: 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS), pp. 323–333, June 2016. https://doi.org/10.1109/ICDCS.2016.10

  20. McGrath, D.K., Gupta, M.: Behind phishing: an examination of phisher modi operandi. In: Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats, LEET 2008, pp. 4:1–4:8. USENIX Association, Berkeley, CA, USA (2008). http://dl.acm.org/citation.cfm?id=1387709.1387713

  21. Norvig, P.: Natural Language Corpus Data: Beautiful Data, February 2019. http://norvig.com/ngrams/

  22. Reeder, R.W., Felt, A.P., Consolvo, S., Malkin, N., Thompson, C., Egelman, S.: An experience sampling study of user reactions to browser warnings in the field. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI 2018, pp. 512:1–512:13. ACM, New York (2018). https://doi.org/10.1145/3173574.3174086

  23. Sahoo, D., Liu, C., Hoi, S.C.: Malicious URL detection using machine learning: a survey. arXiv preprint arXiv:1701.07179 (2017)

  24. Sheng, S., et al.: Anti-phishing phil: the design and evaluation of a game that teaches people not to fall for phish. In: Proceedings of the 3rd Symposium on Usable Privacy and Security, SOUPS 2007, pp. 88–99. ACM, New York (2007). https://doi.org/10.1145/1280680.1280692

  25. Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J., Zhang, C.: An empirical analysis of phishing blacklists. In: Sixth Conference on Email and Anti-Spam (CEAS), California, USA (2009)

    Google Scholar 

  26. Verizon: 2018 data breach investigations report, February 2019. http://www.verizonenterprise.com/resources/reports/rp_DBIR_2018_Report_en_xg.pdf

  27. Verma, R., Das, A.: What’s in a URL: fast feature extraction and malicious URL detection. In: Proceedings of the 3rd ACM on International Workshop on Security and Privacy Analytics, IWSPA 2017, pp. 55–63. ACM, New York (2017). https://doi.org/10.1145/3041008.3041016

  28. Wang, W., Shirley, K.: Breaking bad: detecting malicious domains using word segmentation. arXiv preprint arXiv:1506.04111 (2015)

  29. Whittaker, C., Ryner, B., Nazif, M.: Large-scale automatic classification of phishing pages. In: NDSS 2010 (2010). http://www.isoc.org/isoc/conferences/ndss/10/pdf/08.pdf

  30. Yang, W., Zuo, W., Cui, B.: Detecting malicious urls via a keyword-based convolutional gated-recurrent-unit neural network. IEEE Access 7, 29891–29900 (2019). https://doi.org/10.1109/ACCESS.2019.2895751

    Article  Google Scholar 

  31. Zhang, Y., Hong, J.I., Cranor, L.F.: Cantina: a content-based approach to detecting phishing web sites. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 639–648. ACM, New York (2007). https://doi.org/10.1145/1242572.1242659

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Harshal Tupsamudre .

Editor information

Editors and Affiliations

Appendix A

Appendix A

The phishy-list consisting of 105 words extracted from the phishing dataset is given below:

{limited, securewebsession, confirmation, page, signin, team, sign, access, protection,active, manage, redirectme, http, secure, customer, account, client, information, recovery, verify, secured, busines, refund, help, safe, bank, event, promo, webservis, giveaway, card, webspace, user, notify, servico, store, device, payment, webnode, drive, shop, gold, violation, random, upgrade, webapp, dispute, setting, banking, activity, startup, review, email, approval, admin, browser, webapp, billing, advert, protect, case, temporary, alert, portal, login, servehttp, center, client, restore, secure, blob, smart, fortune, gift, server, security, page, confirm, notification, core, host, central, service, account, servise, support, apps, form, info, compute, verification, check, storage, setting, digital, update, token, required, resolution, ebayisapi, webscr, login, free, lucky, bonus}

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tupsamudre, H., Singh, A.K., Lodha, S. (2019). Everything Is in the Name – A URL Based Approach for Phishing Detection. In: Dolev, S., Hendler, D., Lodha, S., Yung, M. (eds) Cyber Security Cryptography and Machine Learning. CSCML 2019. Lecture Notes in Computer Science(), vol 11527. Springer, Cham. https://doi.org/10.1007/978-3-030-20951-3_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20951-3_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20950-6

  • Online ISBN: 978-3-030-20951-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics