Feature Selection for Improved Phishing Detection

Basnet, Ram B.; Sung, Andrew H.; Liu, Quingzhong

doi:10.1007/978-3-642-31087-4_27

Ram B. Basnet²³,
Andrew H. Sung²³ &
Quingzhong Liu²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7345))

Included in the following conference series:

International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems

2849 Accesses
20 Citations

Abstract

Phishing – a hotbed of multibillion dollar underground economy – has become an important cybersecurity problem. The centralized blacklist approach used by most web browsers usually fails to detect zero-day attacks, leaving the ordinary users vulnerable to new phishing schemes; therefore, learning machine based approaches have been implemented for phishing detection. Many existing techniques in phishing website detection seem to include as many features as can be conceived, while identifying a relevant and representative subset of features to construct an accurate classifier remains an interesting issue in this particular application of machine learning. This paper evaluates correlation-based and wrapper feature selection techniques using real-world phishing data sets with 177 initial features. Experiments results show that applying an effective feature selection procedure generally results in statistically significant improvements in the classification accuracies of – among others – Naïve Bayes, Logistic Regression and Random Forests, in addition to improved efficiency in training time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

APWG Phishing Activity Trends Report- 2nd Half (2010), http://apwg.org/reports/apwg_report_h2_2010.pdf (accessed on October 21, 2011)
Hall, M.A.: Correlation-based Feature Selection for Machine Learning. Hamilton, NewZealand (1999)
Google Scholar
Kohavi, F., John, G.H.: Wrappers for Feature Subset Selection. Artificial Intelligence 97, 273–324 (1997)
Article MATH Google Scholar
Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley (1989)
Google Scholar
Basnet, R., Mukkamala, S., Sung, A.H.: Detection of Phishing Attacks: A Machine Learning Approach. In: Prasad, B. (ed.) Soft Computing Applications in Industry. STUDFUZZ, vol. 226, pp. 373–383. Springer, Heidelberg (2008)
Chapter Google Scholar
Ma, J., Saul, L.K., Safage, S., Voelker, G.M.: Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs. In: ACM SIGKDD, Paris, France, pp. 1245–1253 (2009)
Google Scholar
Caruna, R., Freitag, D.: Greedy Attribute Selection. In: 11th International Conference in Machine Learning. Morgan Kaufmann, San Francisco (1994)
Google Scholar
Zhang, Y., Hong, J., Cranor, L.: CANTINA: A Content-Based Approach to Detecting Phishing Web Sites. In: WWW 2007, Banff, Alberta, Canada. ACM Press (2007)
Google Scholar
Whittaker, C., Ryner, B., Nazif, M.: Large-Scale Automatic Classification of Phishing Pages. In: 17th Annual Network and Distributed System Security Symposium, California, USA (2010)
Google Scholar
Yahoo! Inc.: Random Link – random, http://random.yahoo.com/fast/ryl
PhishTank - Out of the Net, into the Tank, http://www.phishtank.com/developer_info.php
Garera, S., Provos, N., Chew, M., Rubin, A.D.: A Framework for Detection and Measurement of Phishing Attacks. In: 5th ACM Workshop on Recurring Malcode (WORM 2007), pp. 1–8. ACM Press, New York (2007)
Chapter Google Scholar
PyLongURL - Python Library for LongURL.org, http://code.google.com/p/pylongurl/
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. ACM SIGKDD Explorations 11, 1–8 (2009)
Article Google Scholar
Vafaie, H., Jong, K.D.: Robust Feature Selection Algorithms. In: International Conference on Tools with Artificial Intelligence (ICTAI), pp. 356–363 (1993)
Google Scholar
Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)
Article MATH Google Scholar
John, G., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers. In: 11th International Conference on Uncertainty in Artificial Intelligence, San Mateo, USA, pp. 338–345 (1995)
Google Scholar
Toolan, F., Carthy, J.: Feature Selection for Spam and Phishing Detection: In: eCrime Researchers Summit (eCrime), Dallas, TX, pp. 1–9 (2010)
Google Scholar
Miyamoto, D., Hazeyama, H., Kadobayashi, Y.: A Proposal of the AdaBoost-Based Detection of Phishing Sites. In: 2nd Joint Workshop on Information Security (2007)
Google Scholar
Fette, I., Sadeh, N., Tomasic, A.: Learning to Detect Phishing Emails. In: 16th International Conference on World Wide Web, pp. 649–656 (2007)
Google Scholar
Basnet, R.B., Sung, A.H.: Classifying Phishing Emails Using Confidence-Weighted Linear Classifiers. In: International Conference on Information Security and Artificial Intelligence, Chengdu, China, pp. 108–112 (2010)
Google Scholar
Kittler, J.: Feature Set Search Algorithms. In: Chen, C.H. (ed.) Pattern Recognition and Signal Processing, The Netherlands (1978)
Google Scholar
Miller, J.: Subset Selection in Regression. Chapman and Hall, New York (1990)
MATH Google Scholar
Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Google Scholar
Basnet, R.B., Sung, A.H., Liu, Q.: Rule-Based Phishing Attack Detection. In: International Conference on Security and Management (SAM 2011), Las Vegas, NV (2011)
Google Scholar
Holland, J.H.: Adaption in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975)
MATH Google Scholar
Ludl, C., McAllister, S., Kirda, E., Kruegel, C.: On the Effectiveness of Techniques to Detect Phishing Sites. In: Hämmerli, B.M., Sommer, R. (eds.) DIMVA 2007. LNCS, vol. 4579, pp. 20–39. Springer, Heidelberg (2007)
Chapter Google Scholar
le Cessie, S., van Houwelingen, J.C.: Ridge Estimators in Logistic Regression. Applied Statistics 41, 191–201 (1992)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering & ICASA, New Mexico Tech, Socorro, NM, USA
Ram B. Basnet & Andrew H. Sung
Computer Science, Sam Houston State University, Huntsville, TX, USA
Quingzhong Liu

Authors

Ram B. Basnet
View author publications
You can also search for this author in PubMed Google Scholar
Andrew H. Sung
View author publications
You can also search for this author in PubMed Google Scholar
Quingzhong Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Software, Dalian University of Technology, Dalian, China
He Jiang
Department of Computer Science, University of Massachusetts Boston, 100 Morrissey Boulevard, 02125-3393, Boston,, MA, USA
Wei Ding
Department of Computer Science, Texas State University San Marcos, 601 University Drive, 78666-4616, San Marcos, TX, USA
Moonis Ali
Department of Computer Science, University of Vermont, Burlington, VT, USA
Xindong Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Basnet, R.B., Sung, A.H., Liu, Q. (2012). Feature Selection for Improved Phishing Detection. In: Jiang, H., Ding, W., Ali, M., Wu, X. (eds) Advanced Research in Applied Artificial Intelligence. IEA/AIE 2012. Lecture Notes in Computer Science(), vol 7345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31087-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-642-31087-4_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31086-7
Online ISBN: 978-3-642-31087-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics