ABSTRACT
Tree data analysis has many applications in information security. In particular, HTML pages' DOM trees are an important target of analysis because web pages can be vectors for, and targets of, major cyberattacks like phishing. Previous attempts to incorporate tree data analysis into security applications, however, have been hampered by the lack of efficient methods for tree data analysis in machine learning. As such, most security research has focused on data representable as vectors of real numbers, like most machine learning work. Recent work, however, has yielded several efficiency break-throughs in tree analysis. One example is kernel methods, a methodological bridge that fills the gap between discretely-structured data (like trees) and multivariate analysis. Kernel methods enable applying a variety of multivariate analysis techniques such as SVM and PCA to trees. The method we are interested in is the subpath kernel. The subpath kernel offers the following advantages: (1) it is invariant over ordered and unordered trees; (2) it can be computed using an extremely fast linear-time algorithm compared to the quadratic time required to compute values of most tree kernels; (3) its excellent prediction accuracy has been proven through intensive experiments. This paper proposes a subpath kernel-based method for tree-structured security data. To demonstrate the effectiveness of our method, we apply it to the problem of detecting fake e-commerce sites, a sub-problem of phishing detection with a significant real-world financial cost. In an experiment on a real dataset of fake sites provided by a major e-commerce company, our method exhibited accuracy as high as 0.998 when training SVM with as few as 1,000 instances. Its generalization efficiency is also excellent: with only 100 training instances, the accuracy score reaches 0.996. While previous phishing detection methods relied on textual content, URL components, and blacklists, our approach is the first to leverage DOM trees, which makes it both more effective and more robust against adversarial attacks. Unlike URL or content changes, changing a page's DOM structure incurs large costs to criminals.
- Abbasi, A., and Chen, H. A comparison of tools for detecting fake websites. Computer (June 2009).Google Scholar
- Berg, C., Christensen, J. P. R., and Ressel, R. Harmonic Analysis on semigroups. Theory of positive definite and related functions. Springer, 1984.Google ScholarCross Ref
- Chang, C.-C., and Lin, C.-J. Libsvm: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/, 2001.Google Scholar
- Collins, M., and Duffy, N. Convolution kernels for natural language. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001] (2001), MIT Press, pp. 625--632.Google Scholar
- Corona, I., Biggio, B., Contini, M., Piras, L., Corda, R., Mereu, M., Mureddu, G., Ariu, D., and Roli, F. Deltaphish: Detecting phishing webpages in compromised websites. In ESORICS (1) (2017), vol. 10492 of Lecture Notes in Computer Science, Springer, pp. 370--388.Google ScholarCross Ref
- Cristianini, N., and Shawe-Taylor, J. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000.Google ScholarCross Ref
- Demšar, J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Theory 7 (2006), 1 -- 30.Google ScholarDigital Library
- Gerbet, T., Kumar, A., and Lauradoux, C. (un)safe browsing. Tech. Rep. RR-8594, INRIA, 2014.Google Scholar
- Haussler, D. Convolution kernels on discrete structures. UCSC-CRL 99-10, Dept. of Computer Science, University of California at Santa Cruz, 1999.Google Scholar
- Hommel, G. A stagewise rejective multiple test procedure based on a modified bonferroni tests. Biometrika 75 (1988), 383 -- 386.Google ScholarCross Ref
- Kashima, H., and Koyanagi, T. Kernels for semi-structured data. In the 9th International Conference on Machine Learning (ICML 2002) (2002), pp. 291--298.Google Scholar
- Kimura, D., and Kashima, H. Fast computation of subpath kernel for trees. In ICML (2012).Google Scholar
- Levenshtein, V. I. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 8 (February 1966), 707 -- 710.Google Scholar
- Li, L., and Helenius, M. Usability evaluation of anti-phishing toolbars. J. Computer Virology 3 (2007), 163--184.Google ScholarCross Ref
- Liu, W. An antiphishing strategy based on visual similarity assessment. In IEEE Internet Computing (2006), pp. 58--65.Google Scholar
- Lu, S. Y. A tree-to-tree distance and its application to cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 1 (1979), 219--224.Google ScholarCross Ref
- Marchal, S., and Asokan, N. On designing and evaluating phishing webpage detection techniques for the real world. In 11th USENIX Workshop on Cyber Security Experimentation and Test (CSET 18) (Baltimore, MD, 2018), USENIX Association.Google ScholarDigital Library
- Marchal, S., François, J., State, R., and Engel, T. Phishstorm: Detecting phishing with streaming analytics. IEEE Trans. Network and Service Management 11, 4 (2014), 458--471.Google ScholarCross Ref
- Marchal, S., Saari, K., Singh, N., and Asokan, N. Know your phish: Novel techniques for detecting phishing sites and their targets. In ICDCS (2016), IEEE Computer Society, pp. 323--333.Google ScholarCross Ref
- Satish. S, and Babu. K, S. Phishing websites detection based on web source code and url in the webpage. International Journal of Computer Science and Engineering Communications 1 (2013).Google Scholar
- Shin, K. A theory of subtree matching and tree kernels based on the edit distance concept. Annals of Mathematics and Artificial Intelligence (2015).Google Scholar
- Shin, K., and Ishikawa, T. Linear-time algorithms for the subpath kernel. In Proceedings of 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018) (2018), pp. 22:1--22:13.Google Scholar
- Shin, K., and Kuboyama, T. A generalization of Haussler's convolution kernel - mapping kernel. In ICML 2008 (2008).Google Scholar
- Shin, K., and Kuboyama, T. A comprehensive study of tree kernels. In JSAIisAI Post-Workshop Proceedings, Lecture Notes in Articial Intelligence 8417 (2014), Springer, pp. 329--343.Google ScholarCross Ref
- Taï, K. C. The tree-to-tree correction problem. journal of the ACM 26, 3 (July 1979), 422--433.Google Scholar
- Whittaker, C., Ryner, B., and Nazif, M. Large-scale automatic classification of phishing pages. In NDSS '10 (2010).Google Scholar
- Zhang, K. Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recognition 28, 3 (March 1995), 463--474.Google Scholar
- Zhang, K., Wang, J. T. L., and Shasha, D. On the editing distance between undirected acyclic graphs. International Journal of Foundations of Computer Science 7, 1 (1996), 43--58.Google ScholarCross Ref
- Zhang, Y., Egelman, S., Cranor, L., and Hong, J. Phinding phish: Evaluating anti-phishing tools. In Proceedings of 14th Anual Network and Distributed System Security Symposium (2007), Internet Society.Google Scholar
- Zhang, Y., Hong, J. I., and Cranor, L. F. Cantina: A content-based approach to detecting phishing web sites. In Proceedings of the 16th International Conference on World Wide Web (New York, NY, USA, 2007), WWW '07, ACM, pp. 639--648.Google Scholar
Recommendations
Detecting Blind Cross-Site Scripting Attacks Using Machine Learning
SPML '18: Proceedings of the 2018 International Conference on Signal Processing and Machine LearningCross-site scripting (XSS) is a scripting attack targeting web applications by injecting malicious scripts into web pages. Blind XSS is a subset of stored XSS, where an attacker blindly deploys malicious payloads in web pages that are stored in a ...
Client-side cross-site scripting protection
Web applications are becoming the dominant way to provide access to online services. At the same time, web application vulnerabilities are being discovered and disclosed at an alarming rate. Web applications often make use of JavaScript code that is ...
Defeating Cross-Site Request Forgery Attacks with Browser-Enforced Authenticity Protection
Financial Cryptography and Data SecurityA cross site request forgery (CSRF) attack occurs when a user's web browser is instructed by a malicious webpage to send a request to a vulnerable web site, resulting in the vulnerable web site performing actions not intended by the user. CSRF ...
Comments