Skip to main content

Feature Selection for Improved Phishing Detection

  • Conference paper
Advanced Research in Applied Artificial Intelligence (IEA/AIE 2012)

Abstract

Phishing – a hotbed of multibillion dollar underground economy – has become an important cybersecurity problem. The centralized blacklist approach used by most web browsers usually fails to detect zero-day attacks, leaving the ordinary users vulnerable to new phishing schemes; therefore, learning machine based approaches have been implemented for phishing detection. Many existing techniques in phishing website detection seem to include as many features as can be conceived, while identifying a relevant and representative subset of features to construct an accurate classifier remains an interesting issue in this particular application of machine learning. This paper evaluates correlation-based and wrapper feature selection techniques using real-world phishing data sets with 177 initial features. Experiments results show that applying an effective feature selection procedure generally results in statistically significant improvements in the classification accuracies of – among others – Naïve Bayes, Logistic Regression and Random Forests, in addition to improved efficiency in training time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. APWG Phishing Activity Trends Report- 2nd Half (2010), http://apwg.org/reports/apwg_report_h2_2010.pdf (accessed on October 21, 2011)

  2. Hall, M.A.: Correlation-based Feature Selection for Machine Learning. Hamilton, NewZealand (1999)

    Google Scholar 

  3. Kohavi, F., John, G.H.: Wrappers for Feature Subset Selection. Artificial Intelligence 97, 273–324 (1997)

    Article  MATH  Google Scholar 

  4. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley (1989)

    Google Scholar 

  5. Basnet, R., Mukkamala, S., Sung, A.H.: Detection of Phishing Attacks: A Machine Learning Approach. In: Prasad, B. (ed.) Soft Computing Applications in Industry. STUDFUZZ, vol. 226, pp. 373–383. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  6. Ma, J., Saul, L.K., Safage, S., Voelker, G.M.: Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs. In: ACM SIGKDD, Paris, France, pp. 1245–1253 (2009)

    Google Scholar 

  7. Caruna, R., Freitag, D.: Greedy Attribute Selection. In: 11th International Conference in Machine Learning. Morgan Kaufmann, San Francisco (1994)

    Google Scholar 

  8. Zhang, Y., Hong, J., Cranor, L.: CANTINA: A Content-Based Approach to Detecting Phishing Web Sites. In: WWW 2007, Banff, Alberta, Canada. ACM Press (2007)

    Google Scholar 

  9. Whittaker, C., Ryner, B., Nazif, M.: Large-Scale Automatic Classification of Phishing Pages. In: 17th Annual Network and Distributed System Security Symposium, California, USA (2010)

    Google Scholar 

  10. Yahoo! Inc.: Random Link – random, http://random.yahoo.com/fast/ryl

  11. PhishTank - Out of the Net, into the Tank, http://www.phishtank.com/developer_info.php

  12. Garera, S., Provos, N., Chew, M., Rubin, A.D.: A Framework for Detection and Measurement of Phishing Attacks. In: 5th ACM Workshop on Recurring Malcode (WORM 2007), pp. 1–8. ACM Press, New York (2007)

    Chapter  Google Scholar 

  13. PyLongURL - Python Library for LongURL.org, http://code.google.com/p/pylongurl/

  14. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. ACM SIGKDD Explorations 11, 1–8 (2009)

    Article  Google Scholar 

  15. Vafaie, H., Jong, K.D.: Robust Feature Selection Algorithms. In: International Conference on Tools with Artificial Intelligence (ICTAI), pp. 356–363 (1993)

    Google Scholar 

  16. Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  17. John, G., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers. In: 11th International Conference on Uncertainty in Artificial Intelligence, San Mateo, USA, pp. 338–345 (1995)

    Google Scholar 

  18. Toolan, F., Carthy, J.: Feature Selection for Spam and Phishing Detection: In: eCrime Researchers Summit (eCrime), Dallas, TX, pp. 1–9 (2010)

    Google Scholar 

  19. Miyamoto, D., Hazeyama, H., Kadobayashi, Y.: A Proposal of the AdaBoost-Based Detection of Phishing Sites. In: 2nd Joint Workshop on Information Security (2007)

    Google Scholar 

  20. Fette, I., Sadeh, N., Tomasic, A.: Learning to Detect Phishing Emails. In: 16th International Conference on World Wide Web, pp. 649–656 (2007)

    Google Scholar 

  21. Basnet, R.B., Sung, A.H.: Classifying Phishing Emails Using Confidence-Weighted Linear Classifiers. In: International Conference on Information Security and Artificial Intelligence, Chengdu, China, pp. 108–112 (2010)

    Google Scholar 

  22. Kittler, J.: Feature Set Search Algorithms. In: Chen, C.H. (ed.) Pattern Recognition and Signal Processing, The Netherlands (1978)

    Google Scholar 

  23. Miller, J.: Subset Selection in Regression. Chapman and Hall, New York (1990)

    MATH  Google Scholar 

  24. Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)

    Google Scholar 

  25. Basnet, R.B., Sung, A.H., Liu, Q.: Rule-Based Phishing Attack Detection. In: International Conference on Security and Management (SAM 2011), Las Vegas, NV (2011)

    Google Scholar 

  26. Holland, J.H.: Adaption in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975)

    MATH  Google Scholar 

  27. Ludl, C., McAllister, S., Kirda, E., Kruegel, C.: On the Effectiveness of Techniques to Detect Phishing Sites. In: Hämmerli, B.M., Sommer, R. (eds.) DIMVA 2007. LNCS, vol. 4579, pp. 20–39. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  28. le Cessie, S., van Houwelingen, J.C.: Ridge Estimators in Logistic Regression. Applied Statistics 41, 191–201 (1992)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Basnet, R.B., Sung, A.H., Liu, Q. (2012). Feature Selection for Improved Phishing Detection. In: Jiang, H., Ding, W., Ali, M., Wu, X. (eds) Advanced Research in Applied Artificial Intelligence. IEA/AIE 2012. Lecture Notes in Computer Science(), vol 7345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31087-4_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31087-4_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31086-7

  • Online ISBN: 978-3-642-31087-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics