Everything Is in the Name – A URL Based Approach for Phishing Detection

Tupsamudre, Harshal; Singh, Ajeet Kumar; Lodha, Sachin

doi:10.1007/978-3-030-20951-3_21

Everything Is in the Name – A URL Based Approach for Phishing Detection

Harshal Tupsamudre¹⁸,
Ajeet Kumar Singh¹⁸ &
Sachin Lodha¹⁸

Conference paper
First Online: 19 May 2019

1442 Accesses
18 Citations

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11527))

Abstract

Phishing attack, in which a user is tricked into revealing sensitive information on a spoofed website, is one of the most common threat to cybersecurity. Most modern web browsers counter phishing attacks using a blacklist of confirmed phishing URLs. However, one major disadvantage of the blacklist method is that it is ineffective against newly generated phishes. Machine learning based techniques that rely on features extracted from URL (e.g., URL length and bag-of-words) or web page (e.g., TF-IDF and form fields) are considered to be more effective in identifying new phishing attacks. The main benefit of using URL based features over page based features is that the machine learning model can classify new URLs on-the-fly even before the page is loaded by the web browser, thus avoiding other potential dangers such as drive-by download attacks and cryptojacking attacks.

In this work, we focus on improving the performance of URL based detection techniques. We show that, although a classifier trained on traditional bag-of-words features (tokenized using special characters) works well in many cases, it fails to recognize a very prevalent class of phishing URLs that combines a popular brand with one or more words (e.g., www.paypalloginsecure.com and paypalhelpservice.simdif.com) among others. To overcome these flaws, we explore various alternative feature extraction techniques based on word segmentation and \(n-\)grams. We also construct and use a phishy-list of popular words that are highly indicative of phishing attacks. We verify the efficacy of each of these feature sets by training a logistic regression classifier on a large dataset consisting of 100,000 URLs. Our experimental results reveal that features based on word segmentation, phishy-list and numerical features (e.g., URL length) perform better than all other features, as measured by misclassification and false negative rates.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

APWG, February 2019. http://docs.apwg.org/reports/apwg_trends_report_q3_2018.pdf
DMOZ, February 2019. http://dmoz-odp.org/
Google Safe Browsing, February 2019. https://safebrowsing.google.com/
PhishTank, February 2019. https://www.antiphishing.org/resources/apwg-reports/
Python Word Segmentation, February 2019. http://www.grantjenks.com/docs/wordsegment/
Alsharnouby, M., Alaca, F., Chiasson, S.: Why phishing still works: user strategies for combating phishing attacks. Int. J. Hum.-Comput. Stud. 82, 69–82 (2015)
Article Google Scholar
Ardi, C., Heidemann, J.: Auntietuna: personalized content-based phishing detection. In: Proceedings of the NDSS Workshop on Usable Security. The Internet Society, San Diego, California, USA, February 2016. http://www.isi.edu/%7ejohnh/PAPERS/Ardi16a.html
Canova, G., Volkamer, M., Bergmann, C., Reinheimer, B.: NoPhish app evaluation: lab and retention study. Internet Society, USEC (2015)
Google Scholar
CJ, G., Pandit, S., Vaddepalli, S., Tupsamudre, H., Banahatti, V., Lodha, S.: Phishy - a serious game to train enterprise users on phishing awareness. In: Proceedings of the 2018 Annual Symposium on Computer-Human Interaction in Play Companion Extended Abstracts, CHI PLAY 2018, pp. 169–181. ACM, New York (2018). https://doi.org/10.1145/3270316.3273042
Dhamija, R., Tygar, J.D., Hearst, M.: Why phishing works. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2006, pp. 581–590. ACM, New York (2006). https://doi.org/10.1145/1124772.1124861
Felt, A.P., et al.: Improving SSL warnings: comprehension and adherence. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI 2015, pp. 2893–2902. ACM, New York (2015). https://doi.org/10.1145/2702123.2702442
Garera, S., Provos, N., Chew, M., Rubin, A.D.: A framework for detection and measurement of phishing attacks. In: Proceedings of the 2007 ACM Workshop on Recurring Malcode, WORM 2007, pp. 1–8. ACM, New York (2007). https://doi.org/10.1145/1314389.1314391
Hong, J.: The state of phishing attacks. Commun. ACM 55(1), 74–81 (2012). https://doi.org/10.1145/2063176.2063197
Article Google Scholar
Khonji, M., Iraqi, Y., Jones, A.: Phishing detection: a literature survey. IEEE Commun. Surv. Tutor. 15(4), 2091–2121 (2013). https://doi.org/10.1109/SURV.2013.032213.00009
Article Google Scholar
Kintis, P., et al.: Hiding in plain sight: a longitudinal study of combosquatting abuse. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, pp. 569–586. ACM, New York (2017). https://doi.org/10.1145/3133956.3134002
Le, A., Markopoulou, A., Faloutsos, M.: PhishDef: URL names say it all. In: 2011 Proceedings IEEE INFOCOM, pp. 191–195, April 2011. https://doi.org/10.1109/INFCOM.2011.5934995
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 1245–1254. ACM, New York (2009). https://doi.org/10.1145/1557019.1557153
Marchal, S., François, J., State, R., Engel, T.: Phishstorm: detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manag. 11(4), 458–471 (2014). https://doi.org/10.1109/TNSM.2014.2377295
Article Google Scholar
Marchal, S., Saari, K., Singh, N., Asokan, N.: Know your phish: novel techniques for detecting phishing sites and their targets. In: 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS), pp. 323–333, June 2016. https://doi.org/10.1109/ICDCS.2016.10
McGrath, D.K., Gupta, M.: Behind phishing: an examination of phisher modi operandi. In: Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats, LEET 2008, pp. 4:1–4:8. USENIX Association, Berkeley, CA, USA (2008). http://dl.acm.org/citation.cfm?id=1387709.1387713
Norvig, P.: Natural Language Corpus Data: Beautiful Data, February 2019. http://norvig.com/ngrams/
Reeder, R.W., Felt, A.P., Consolvo, S., Malkin, N., Thompson, C., Egelman, S.: An experience sampling study of user reactions to browser warnings in the field. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI 2018, pp. 512:1–512:13. ACM, New York (2018). https://doi.org/10.1145/3173574.3174086
Sahoo, D., Liu, C., Hoi, S.C.: Malicious URL detection using machine learning: a survey. arXiv preprint arXiv:1701.07179 (2017)
Sheng, S., et al.: Anti-phishing phil: the design and evaluation of a game that teaches people not to fall for phish. In: Proceedings of the 3rd Symposium on Usable Privacy and Security, SOUPS 2007, pp. 88–99. ACM, New York (2007). https://doi.org/10.1145/1280680.1280692
Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J., Zhang, C.: An empirical analysis of phishing blacklists. In: Sixth Conference on Email and Anti-Spam (CEAS), California, USA (2009)
Google Scholar
Verizon: 2018 data breach investigations report, February 2019. http://www.verizonenterprise.com/resources/reports/rp_DBIR_2018_Report_en_xg.pdf
Verma, R., Das, A.: What’s in a URL: fast feature extraction and malicious URL detection. In: Proceedings of the 3rd ACM on International Workshop on Security and Privacy Analytics, IWSPA 2017, pp. 55–63. ACM, New York (2017). https://doi.org/10.1145/3041008.3041016
Wang, W., Shirley, K.: Breaking bad: detecting malicious domains using word segmentation. arXiv preprint arXiv:1506.04111 (2015)
Whittaker, C., Ryner, B., Nazif, M.: Large-scale automatic classification of phishing pages. In: NDSS 2010 (2010). http://www.isoc.org/isoc/conferences/ndss/10/pdf/08.pdf
Yang, W., Zuo, W., Cui, B.: Detecting malicious urls via a keyword-based convolutional gated-recurrent-unit neural network. IEEE Access 7, 29891–29900 (2019). https://doi.org/10.1109/ACCESS.2019.2895751
Article Google Scholar
Zhang, Y., Hong, J.I., Cranor, L.F.: Cantina: a content-based approach to detecting phishing web sites. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 639–648. ACM, New York (2007). https://doi.org/10.1145/1242572.1242659

Download references

Author information

Authors and Affiliations

TCS Research, Pune, India
Harshal Tupsamudre, Ajeet Kumar Singh & Sachin Lodha

Authors

Harshal Tupsamudre
View author publications
You can also search for this author in PubMed Google Scholar
Ajeet Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar
Sachin Lodha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Harshal Tupsamudre .

Editor information

Editors and Affiliations

Ben-Gurion University of the Negev, Beer-Sheva, Israel
Shlomi Dolev
Ben-Gurion University of the Negev, Beer-Sheva, Israel
Danny Hendler
Tata Consultancy Services, Chennai, India
Sachin Lodha
Columbia University and Google, New York, NY, USA
Moti Yung

Appendix A

The phishy-list consisting of 105 words extracted from the phishing dataset is given below:

{limited, securewebsession, confirmation, page, signin, team, sign, access, protection,active, manage, redirectme, http, secure, customer, account, client, information, recovery, verify, secured, busines, refund, help, safe, bank, event, promo, webservis, giveaway, card, webspace, user, notify, servico, store, device, payment, webnode, drive, shop, gold, violation, random, upgrade, webapp, dispute, setting, banking, activity, startup, review, email, approval, admin, browser, webapp, billing, advert, protect, case, temporary, alert, portal, login, servehttp, center, client, restore, secure, blob, smart, fortune, gift, server, security, page, confirm, notification, core, host, central, service, account, servise, support, apps, form, info, compute, verification, check, storage, setting, digital, update, token, required, resolution, ebayisapi, webscr, login, free, lucky, bonus}

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tupsamudre, H., Singh, A.K., Lodha, S. (2019). Everything Is in the Name – A URL Based Approach for Phishing Detection. In: Dolev, S., Hendler, D., Lodha, S., Yung, M. (eds) Cyber Security Cryptography and Machine Learning. CSCML 2019. Lecture Notes in Computer Science(), vol 11527. Springer, Cham. https://doi.org/10.1007/978-3-030-20951-3_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-20951-3_21
Published: 19 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20950-6
Online ISBN: 978-3-030-20951-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix A

Appendix A

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation