ABSTRACT
Existing phishing detection techniques mainly rely on blacklists or content-based analysis, which are not only evadable, but also exhibit considerable detection delays as they are reactive in nature. We observe through our deep dive analysis that artifacts of phishing are manifested in various sources of intelligence related to a domain even before its contents are online. In particular, we study various novel patterns and characteristics computed from viable sources of data including Certificate Transparency Logs, and passive DNS records. To compare benign and phishing domains, we construct thoroughly-verified realistic benign and phishing datasets. Our analysis shows clear differences between benign and phishing domains that can pave the way for content-agnostic approaches to predict phishing domains even before the contents of these webpages are up and running.
To demonstrate the usefulness of our analysis, we train a classifier with distinctive features, and we show that we can (1) perform content-agnostic predictions with a very low FPR of 0.3%, and high precision (98%) and recall (90%), and (2) predict phishing domains days before they are discovered by state-of-the-art content-based tools such as VirusTotal.
- 2019. Anti-Phishing Working Group. https://apwg.org.Google Scholar
- 2019. CDN Planet CDN List. https://www.cdnplanet.com/cdns/. [Online; accessed 24-05-2021].Google Scholar
- 2019. Certificate Transparency. https://developers.facebook.com/docs/certificate-transparency/. Accessed April 2022.Google Scholar
- 2019. Chrome and Firefox Changes Spark the End of EV Certificates. https://www.bleepingcomputer.com/news/software/chrome-and-firefox-changes-spark-the-end-of-ev-certificates/.Google Scholar
- 2019. Comodo Free SSL Certificate. https://www.comodo.com/e-commerce/ssl-certificates/free-ssl-certificate.php.Google Scholar
- 2019. Getting Started. https://letsencrypt.org/getting-started/.Google Scholar
- 2019. Public Suffix List. https://publicsuffix.org/. [Online; accessed 24-05-2021].Google Scholar
- 2019. Wombat Security The State of the Phish Report 2019. https://www.wombatsecurity.com/state-of-the-phish/. Accessed April 2022.Google Scholar
- 2019. WPO Foundation CDN List. https://github.com/WPO-Foundation/webpagetest/blob/master/agent/wpthook/cdn.h. [Online; accessed 24-05-2021].Google Scholar
- [10] 2021. https://gdpr.eu.Google Scholar
- 2021. CIRCL Passive DNS. https://www.circl.lu.Google Scholar
- 2021. COMODO SSL Analyzer. https://sslanalyzer.comodoca.com. Accessed April 2022.Google Scholar
- 2021. crt.sh Certificate Search. https://crt.sh. Accessed April 2022.Google Scholar
- 2021. CT Enforcement in Google Chrome. https://tinyurl.com/y2nyyjtm. Accessed February 2021.Google Scholar
- 2021. Phishing catcher. https://github.com/x0rz/phishing_catcher.Google Scholar
- 2021. Phishtank. Out of the Net, into the Tank. https://www.phishtank.com. Accessed April 2022.Google Scholar
- 2021. The Domain Block List (DBL). https://www.spamhaus.org/dbl/. Accessed April 2022.Google Scholar
- 2021. What Services Does Let’s Encrypt Offer?https://letsencrypt.org/docs/faq/. Accessed May 2021.Google Scholar
- 2022. Certificate Transparency. https://www.certificate-transparency.org/. Accessed April 2022.Google Scholar
- 2022. Facebook Certificate Transparency Tool. https://developers.facebook.com/docs/certificate-transparency/. Accessed April 2022.Google Scholar
- 2022. Google Safe Browsing: Making the world’s information safely accessible. https://safebrowsing.google.com. Accessed April 2022.Google Scholar
- 2022. Mcafee Labs Threat Report December 2018. https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-dec-2018.pdf. Accessed April 2022.Google Scholar
- 2022. SSL Mate Certspotter. https://sslmate.com/certspotter/. Accessed April 2022.Google Scholar
- Josh Aas, Richard Barnes, Benton Case, Zakir Durumeric, Peter Eckersley, Alan Flores-López, J. Alex Halderman, Jacob Hoffman-Andrews, James Kasten, Eric Rescorla, Seth Schoen, and Brad Warren. 2019. Let’s Encrypt: An Automated Certificate Authority to Encrypt the Entire Web. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (London, United Kingdom) (CCS ’19). ACM, New York, NY, USA, 2473–2487. https://doi.org/10.1145/3319535.3363192Google ScholarDigital Library
- Bhupendra Acharya and Phani Vadrevu. 2021. PhishPrint: Evading Phishing Detection Crawlers by Prior Profiling. In 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 3775–3792. https://www.usenix.org/conference/usenixsecurity21/presentation/acharyaGoogle Scholar
- Maarten Aertsen, Maciej Korczyński, Giovane C. M. Moura, Samaneh Tajalizadehkhoob, and Jan van den Berg. 2017. No Domain Left Behind: Is Let’s Encrypt Democratizing Encryption?. In Proceedings of the Applied Networking Research Workshop (Prague, Czech Republic) (ANRW ’17). ACM, New York, NY, USA, 48–54. https://doi.org/10.1145/3106328.3106338Google ScholarDigital Library
- A. C. Bahnsen, E. C. Bohorquez, S. Villegas, J. Vargas, and F. A. González. 2017. Classifying phishing URLs using recurrent neural networks. In 2017 APWG Symposium on Electronic Crime Research (eCrime). 1–8.Google ScholarCross Ref
- A. C. Bahnsen, U. Torroledo, D. Camacho, and S. Villegas. 2018. DeepPhish: Simulating Malicious AI. In 2018 APWG Symposium on Electronic Crime Research (eCrime). 1–8.Google Scholar
- BEN DOWNING. 2021. Using Entropy in Threat Hunting: a Mathematical Search for the Unknown. https://redcanary.com/blog/threat-hunting-entropy/. Accessed February 2021.Google Scholar
- Leyla Bilge, Sevil Sen, Davide Balzarotti, Engin Kirda, and Christopher Kruegel. 2014. Exposure: A Passive DNS Analysis Service to Detect and Report Malicious Domains. ACM Transactions on Information and System Security 16, 4 (apr 2014), 14:1–14:28.Google ScholarDigital Library
- Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (01 Oct 2001), 5–32.Google ScholarDigital Library
- CA Browser Forum. 2021. Baseline Requirements. https://cabforum.org/wp-content/uploads/CA-Browser-Forum-BR-1.6.6.pdf. Accessed Jan 2021.Google Scholar
- CA / Browser Forum. 2022. Object Registry of the CA / Browser Forum. https://cabforum.org/object-registry/. Accessed April 2022.Google Scholar
- CaliDog. 2022. CertStream Python. https://github.com/CaliDog/certstream-python. Accessed April 2022.Google Scholar
- Censys. 2022. See Your Entire Attack Surface in Real Time. https://censys.io. Accessed April 2022.Google Scholar
- Chromium. 2021. EV OID list. https://chromium.googlesource.com/chromium/src/net/+/master/cert/ev_root_ca_metadata.cc/. Accessed February 2021.Google Scholar
- Zheng Dong, Apu Kapadia, Jim Blythe, and L Camp. 2015. Beyond the lock icon: Real-time detection of phishing websites using public key certificates. eCrime Researchers Summit, eCrime 2015 (06 2015). https://doi.org/10.1109/ECRIME.2015.7120795Google Scholar
- Arthur Drichel, Vincent Drury, Justus von Brandt, and Ulrike Meyer. 2021. Finding Phish in a Haystack: A Pipeline for Phishing Classification on Certificate Transparency Logs. In The 16th International Conference on Availability, Reliability and Security (Vienna, Austria) (ARES 2021). Article 59, 12 pages.Google ScholarDigital Library
- Farsight Security, Inc.2022. DNS Database. https://www.dnsdb.info/. Accessed April 2022.Google Scholar
- Sujata Garera, Niels Provos, Monica Chew, and Aviel D. Rubin. 2007. A Framework for Detection and Measurement of Phishing Attacks. In Proceedings of the 2007 ACM Workshop on Recurring Malcode. ACM, New York, NY, USA, 1–8.Google ScholarDigital Library
- J. Gargano and K. Weiss. 1995. Whois and Network Information Lookup Service, Whois++. RFC 1834. RFC Editor. http://www.rfc-editor.org/rfc/rfc1834.txt.Google Scholar
- Josef Gustafsson, Gustaf Overier, Martin F. Arlitt, and Niklas Carlsson. 2017. A First Look at the CT Landscape: Certificate Transparency Logs in Practice. In PAM.Google Scholar
- Ryan Hurst. 2012. How to Tell DV and OV Certificates Apart. http://unmitigatedrisk.com/?p=203.Google Scholar
- Issa M. Khalil, Bei Guan, Mohamed Nabeel, and Ting Yu. 2018. A Domain is Only As Good As Its Buddies: Detecting Stealthy Malicious Domains via Graph Inference. In Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy (Tempe, AZ, USA) (CODASPY ’18). ACM, New York, NY, USA, 330–341. https://doi.org/10.1145/3176258.3176329Google ScholarDigital Library
- Panagiotis Kintis, Najmeh Miramirkhani, Charles Lever, Yizheng Chen, Rosa Romero-Gómez, Nikolaos Pitropakis, Nick Nikiforakis, and Manos Antonakakis. 2017. Hiding in Plain Sight: A Longitudinal Study of Combosquatting Abuse. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, New York, NY, USA, 569–586.Google ScholarDigital Library
- Brian Kondracki, Babak Amin Azad, Oleksii Starov, and Nick Nikiforakis. 2021. Catching Transparent Phish: Analyzing and Detecting MITM Phishing Toolkits. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (Virtual Event, Republic of Korea) (CCS ’21). Association for Computing Machinery, New York, NY, USA, 36–50.Google ScholarDigital Library
- Neeraj Kumar, Sukhada Ghewari, Harshal Tupsamudre, Manish Shukla, and Sachin Lodha. 2021. When Diversity Meets Hostility: A Study of Domain Squatting Abuse in Online Banking. In 2021 APWG Symposium on Electronic Crime Research (eCrime). 1–15. https://doi.org/10.1109/eCrime54498.2021.9738769Google Scholar
- Anh Le, Athina Markopoulou, and Michalis Faloutsos. 2011. PhishDef: URL names say it all. 2011 Proceedings IEEE INFOCOM(2011), 191–195.Google Scholar
- S. Le Page, G. Jourdan, G. V. Bochmann, J. Flood, and I. Onut. 2018. Using URL shorteners to compare phishing and malware attacks. In 2018 APWG Symposium on Electronic Crime Research (eCrime). 1–13. https://doi.org/10.1109/ECRIME.2018.8376215Google ScholarCross Ref
- C. Lever, R. Walls, Y. Nadji, D. Dagon, P. McDaniel, and M. Antonakakis. 2016. Domain-Z: 28 Registrations Later Measuring the Exploitation of Residual Trust in Domains. In 2016 IEEE Symposium on Security and Privacy (SP). 691–706. https://doi.org/10.1109/SP.2016.47Google Scholar
- Yun Lin, Ruofan Liu, Dinil Mon Divakaran, Jun Yang Ng, Qing Zhou Chan, Yiwen Lu, Yuxuan Si, Fan Zhang, and Jin Song Dong. 2021. Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages. In 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 3793–3810. https://www.usenix.org/conference/usenixsecurity21/presentation/linGoogle Scholar
- Chaoyi Lu, Baojun Liu, Yiming Zhang, Zhou Li, Fenglu Zhang, Haixin Duan, Y. Liu, J. Chen, Jinjin Liang, Z. Zhang, S. Hao, and Min Yang. 2021. From WHOIS to WHOWAS: A Large-Scale Measurement Study of Domain Registration Privacy under the GDPR. In NDSS.Google Scholar
- Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. 2009. Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs. In Proceedingsof theSIGKDD Conference. Paris,France.Google ScholarDigital Library
- Pratyusa K. Manadhata, Sandeep Yadav, Prasad Rao, and William Horne. 2014. Detecting Malicious Domains via Graph Inference. In Proceedings of the 19th European Symposium on Research in Computer Security,, Mirosław Kutyłowski and Jaideep Vaidya (Eds.). Springer International Publishing, Cham, 1–18.Google ScholarDigital Library
- MaxMind. 2022. GeoLite2 Databases. http://www.maxmind.com. Accessed April 2022.Google Scholar
- D. Kevin McGrath and Minaxi Gupta. 2008. Behind Phishing: An Examination of Phisher Modi Operandi. In Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats (San Francisco, California) (LEET’08). USENIX Association, Berkeley, CA, USA, Article 4, 8 pages. http://dl.acm.org/citation.cfm?id=1387709.1387713Google Scholar
- Ulrike Meyer and Vincent Drury. 2019. Certified Phishing: Taking a Look at Public Key Certificates of Phishing Websites. In Fifteenth Symposium on Usable Privacy and Security (SOUPS 2019). USENIX Association, Santa Clara, CA. https://www.usenix.org/conference/soups2019/presentation/druryGoogle Scholar
- Najmeh Miramirkhani, Timothy Barron, Michael Ferdman, and Nick Nikiforakis. 2018. Panning for gold.com: Understanding the Dynamics of Domain Dropcatching. 257–266.Google Scholar
- Mishari Al Mishari, Emiliano De Cristofaro, Karim M. El Defrawy, and Gene Tsudik. 2012. Harvesting SSL Certificate Data to Identify Web-Fraud. I. J. Network Security 14, 6 (2012), 324–338.Google Scholar
- Mohamed Nabeel, Issa M. Khalil, Bei Guan, and Ting Yu. 2020. Following Passive DNS Traces to Detect Stealthy Malicious Domains Via Graph Inference. ACM Trans. Priv. Secur. 23, 4, Article 17 (July 2020), 36 pages. https://doi.org/10.1145/3401897Google ScholarDigital Library
- Network Solutions, LLC. 2022. Network Solutions Certification Practice Statement. https://assets.web.com/legal/English/CertificationPracticeStatement.pdf. Accessed April 2022.Google Scholar
- Amirreza Niakanlahiji, Bei-Tseng Chu, and Ehab Al-Shaer. 2018. PhishMon: A Machine Learning Framework for Detecting Phishing Webpages. 220–225. https://doi.org/10.1109/ISI.2018.8587410Google ScholarDigital Library
- A. Oest, Y. Safaei, A. Doupé, G. Ahn, B. Wardman, and K. Tyers. 2019. PhishFarm: A Scalable Framework for Measuring the Effectiveness of Evasion Techniques against Browser Phishing Blacklists. In 2019 IEEE Symposium on Security and Privacy (SP). 1344–1361. https://doi.org/10.1109/SP.2019.00049Google ScholarCross Ref
- Adam Oest, Yeganeh Safaei, Penghui Zhang, Brad Wardman, Kevin Tyers, Yan Shoshitaishvili, and Adam Doupé. 2020. PhishTime: Continuous Longitudinal Measurement of the Effectiveness of Anti-phishing Blacklists. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 379–396. https://www.usenix.org/conference/usenixsecurity20/presentation/oest-phishtimeGoogle Scholar
- Adam Oest, Penghui Zhang, Brad Wardman, Eric Nunes, Jakub Burgis, Ali Zand, Kurt Thomas, Adam Doupé, and Gail-Joon Ahn. 2020. Sunrise to Sunset: Analyzing the End-to-end Life Cycle and Effectiveness of Phishing Attacks at Scale. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 361–377. https://www.usenix.org/conference/usenixsecurity20/presentation/oest-sunriseGoogle Scholar
- A. Oprea, Z. Li, T. F. Yen, S. H. Chin, and S. Alrwais. 2015. Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 45–56.Google Scholar
- Peng Peng, Limin Yang, Linhai Song, and Gang Wang. 2019. Opening the Blackbox of VirusTotal: Analyzing Online Phishing Scan Engines. In Proceedings of the Internet Measurement Conference (Amsterdam, Netherlands) (IMC ’19). Association for Computing Machinery, New York, NY, USA, 478–485. https://doi.org/10.1145/3355369.3355585Google ScholarDigital Library
- J. R. Quinlan. 1986. Induction of decision trees. Machine Learning 1, 1 (01 Mar 1986), 81–106. https://doi.org/10.1007/BF00116251Google Scholar
- Richard Roberts, Yaelle Goldschlag, Rachel Walter, Taejoong Chung, Alan Mislove, and Dave Levin. 2019. You Are Who You Appear to Be: A Longitudinal Study of Domain Impersonation in TLS Certificates. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (London, United Kingdom) (CCS ’19). ACM, New York, NY, USA, 2489–2504. https://doi.org/10.1145/3319535.3363188Google ScholarDigital Library
- A. P. E. Rosiello, E. Kirda, 2. Kruegel, and F. Ferrandi. 2007. A Layout-Similarity-Based Approach for Detecting Phishing Pages. In SecureComm. 454–463.Google Scholar
- Yuji Sakurai, Takuya Watanabe, Tetsuya Okuda, Mitsuaki Akiyama, and Tatsuya Mori. 2020. Discovering HTTPSified Phishing Websites Using the TLS Certificates Footprints. In 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS PW). 522–531. https://doi.org/10.1109/EuroSPW51379.2020.00077Google Scholar
- Quirin Scheitle, Oliver Gasser, Theodor Nolte, Johanna Amann, Lexi Brent, Georg Carle, Ralph Holz, Thomas C. Schmidt, and Matthias Wählisch. 2018. The Rise of Certificate Transparency and Its Implications on the Internet Ecosystem. In Proceedings of the Internet Measurement Conference 2018 (Boston, MA, USA) (IMC ’18). ACM, New York, NY, USA, 343–349. https://doi.org/10.1145/3278532.3278562Google ScholarDigital Library
- Quirin Scheitle, Oliver Gasser, Theodor Nolte, Johanna Amann, Lexi Brent, Georg Carle, Ralph Holz, Thomas C. Schmidt, and Matthias Wählisch. 2018. The Rise of Certificate Transparency and Its Implications on the Internet Ecosystem. CoRR abs/1809.08325(2018). arxiv:1809.08325http://arxiv.org/abs/1809.08325Google Scholar
- Scott Helme. march 06, 2017. Let’s Encrypt are enabling the bad guys, and why they should. https://scotthelme.co.uk/lets-encrypt-are-enabling-the-bad-guys-and-why-they-should/. Accessed April 2022.Google Scholar
- Hossein Shirazi, Bruhadeshwar Bezawada, and Indrakshi Ray. 2018. ”Kn0W Thy Doma1N Name”: Unbiased Phishing Detection Using Domain Name Based Features. In Proceedings of the 23Nd ACM on Symposium on Access Control Models and Technologies(Indianapolis, Indiana, USA) (SACMAT ’18). ACM, New York, NY, USA, 69–75. https://doi.org/10.1145/3205977.3205992Google ScholarDigital Library
- Statcounter GlobalStats. 2022. Browser Market Share Worldwide. https://certs.securetrust.com/CA/twcps2_9.pdf. Accessed April 2022.Google Scholar
- J.A.K. Suykens and J. Vandewalle. 1999. Least Squares Support Vector Machine Classifiers. Neural Processing Letters 9, 3 (01 Jun 1999), 293–300. https://doi.org/10.1023/A:1018628609742Google ScholarDigital Library
- Ke Tian, Steve T. K. Jan, Hang Hu, Danfeng Yao, and Gang Wang. 2018. Needle in a Haystack: Tracking Down Elite Phishing Domains in the Wild. In Proceedings of the Internet Measurement Conference 2018, IMC 2018, Boston, MA, USA, October 31 - November 02, 2018. 429–442.Google ScholarDigital Library
- Ke Tian, Steve T. K. Jan, Hang Hu, Danfeng Yao, and Gang Wang. 2018. Needle in a Haystack: Tracking Down Elite Phishing Domains in the Wild. In Proceedings of the Internet Measurement Conference 2018 (Boston, MA, USA) (IMC ’18). ACM, New York, NY, USA, 429–442. https://doi.org/10.1145/3278532.3278569Google ScholarDigital Library
- Ivan Torroledo, Luis David Camacho, and Alejandro Correa Bahnsen. 2018. Hunting Malicious TLS Certificates with Deep Neural Networks. In Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security (Toronto, Canada) (AISec ’18). ACM, New York, NY, USA, 64–73. https://doi.org/10.1145/3270101.3270105Google ScholarDigital Library
- Rakesh Verma and Keith Dyer. 2015. On the Character of Phishing URLs: Accurate and Robust Statistical Learning Classifiers. In Proceedings of the 5th ACM Conference on Data and Application Security and Privacy (San Antonio, Texas, USA) (CODASPY ’15). ACM, New York, NY, USA, 111–122. https://doi.org/10.1145/2699026.2699115Google ScholarDigital Library
- VirusTotal, Subsidiary of Google. 2022. VirusTotal – Free Online Virus, Malware and URL Scanner. https://www.virustotal.com/. Accessed April 2022.Google Scholar
- Florian Weimer. 2005. Passive DNS Replication. In FIRST Conference on Computer Security Incident. 98.Google Scholar
- Daniel Lowe Wheeler. 2016. zxcvbn: Low-Budget Password Strength Estimation. In 25th USENIX Security Symposium (USENIX Security 16). USENIX Association, Austin, TX, 157–173. https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/wheelerGoogle Scholar
- Colin Whittaker, Brian Ryner, and Marria Nazif. 2010. Large-Scale Automatic Classification of Phishing Pages. In NDSS ’10. http://www.isoc.org/isoc/conferences/ndss/10/pdf/08.pdfGoogle Scholar
- Sandeep Yadav, Ashwath Kumar Krishna Reddy, A.L. Narasimha Reddy, and Supranamaya Ranjan. 2010. Detecting Algorithmically Generated Malicious Domain Names. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (Melbourne, Australia) (IMC ’10). ACM, New York, NY, USA, 48–61. https://doi.org/10.1145/1879141.1879148Google ScholarDigital Library
- Yue Zhang, Jason I. Hong, and Lorrie F. Cranor. 2007. Cantina: A Content-based Approach to Detecting Phishing Web Sites. In Proceedings of the 16th International Conference on World Wide Web. ACM, New York, NY, USA, 639–648.Google ScholarDigital Library
Index Terms
- Content-Agnostic Detection of Phishing Domains using Certificate Transparency and Passive DNS
Recommendations
Early Detection of Spam Domains with Passive DNS and SPF
Passive and Active MeasurementAbstractSpam domains are sources of unsolicited mails and one of the primary vehicles for fraud and malicious activities such as phishing campaigns or malware distribution. Spam domain detection is a race: as soon as the spam mails are sent, taking down ...
Discovering Malicious Domains through Passive DNS Data Graph Analysis
ASIA CCS '16: Proceedings of the 11th ACM on Asia Conference on Computer and Communications SecurityMalicious domains are key components to a variety of cyber attacks. Several recent techniques are proposed to identify malicious domains through analysis of DNS data. The general approach is to build classifiers based on DNS-related local domain ...
Following Passive DNS Traces to Detect Stealthy Malicious Domains Via Graph Inference
Malicious domains, including phishing websites, spam servers, and command and control servers, are the reason for many of the cyber attacks nowadays. Thus, detecting them in a timely manner is important to not only identify cyber attacks but also take ...
Comments