research-article

Machine learning for tree structures in fake site detection

Authors:
Taichi Ishikawa

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Yu-Lu Liu

Rakuten, Inc, Japan

Rakuten, Inc, Japan
View Profile

,
David Lawrence Shepard

Evidation Health

Evidation Health
View Profile

,
Kilho Shin

Gakushuin University, Japan

Gakushuin University, Japan
View Profile

ARES '20: Proceedings of the 15th International Conference on Availability, Reliability and SecurityAugust 2020Article No.: 13Pages 1–10https://doi.org/10.1145/3407023.3407035

Published:25 August 2020Publication History

ARES '20: Proceedings of the 15th International Conference on Availability, Reliability and Security

Pages 1–10

ABSTRACT

Tree data analysis has many applications in information security. In particular, HTML pages' DOM trees are an important target of analysis because web pages can be vectors for, and targets of, major cyberattacks like phishing. Previous attempts to incorporate tree data analysis into security applications, however, have been hampered by the lack of efficient methods for tree data analysis in machine learning. As such, most security research has focused on data representable as vectors of real numbers, like most machine learning work. Recent work, however, has yielded several efficiency break-throughs in tree analysis. One example is kernel methods, a methodological bridge that fills the gap between discretely-structured data (like trees) and multivariate analysis. Kernel methods enable applying a variety of multivariate analysis techniques such as SVM and PCA to trees. The method we are interested in is the subpath kernel. The subpath kernel offers the following advantages: (1) it is invariant over ordered and unordered trees; (2) it can be computed using an extremely fast linear-time algorithm compared to the quadratic time required to compute values of most tree kernels; (3) its excellent prediction accuracy has been proven through intensive experiments. This paper proposes a subpath kernel-based method for tree-structured security data. To demonstrate the effectiveness of our method, we apply it to the problem of detecting fake e-commerce sites, a sub-problem of phishing detection with a significant real-world financial cost. In an experiment on a real dataset of fake sites provided by a major e-commerce company, our method exhibited accuracy as high as 0.998 when training SVM with as few as 1,000 instances. Its generalization efficiency is also excellent: with only 100 training instances, the accuracy score reaches 0.996. While previous phishing detection methods relied on textual content, URL components, and blacklists, our approach is the first to leverage DOM trees, which makes it both more effective and more robust against adversarial attacks. Unlike URL or content changes, changing a page's DOM structure incurs large costs to criminals.

References

Abbasi, A., and Chen, H. A comparison of tools for detecting fake websites. Computer (June 2009).Google Scholar
Berg, C., Christensen, J. P. R., and Ressel, R. Harmonic Analysis on semigroups. Theory of positive definite and related functions. Springer, 1984.Google ScholarCross Ref
Chang, C.-C., and Lin, C.-J. Libsvm: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/, 2001.Google Scholar
Collins, M., and Duffy, N. Convolution kernels for natural language. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001] (2001), MIT Press, pp. 625--632.Google Scholar
Corona, I., Biggio, B., Contini, M., Piras, L., Corda, R., Mereu, M., Mureddu, G., Ariu, D., and Roli, F. Deltaphish: Detecting phishing webpages in compromised websites. In ESORICS (1) (2017), vol. 10492 of Lecture Notes in Computer Science, Springer, pp. 370--388.Google ScholarCross Ref
Cristianini, N., and Shawe-Taylor, J. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000.Google ScholarCross Ref
Demšar, J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Theory 7 (2006), 1 -- 30.Google ScholarDigital Library
Gerbet, T., Kumar, A., and Lauradoux, C. (un)safe browsing. Tech. Rep. RR-8594, INRIA, 2014.Google Scholar
Haussler, D. Convolution kernels on discrete structures. UCSC-CRL 99-10, Dept. of Computer Science, University of California at Santa Cruz, 1999.Google Scholar
Hommel, G. A stagewise rejective multiple test procedure based on a modified bonferroni tests. Biometrika 75 (1988), 383 -- 386.Google ScholarCross Ref
Kashima, H., and Koyanagi, T. Kernels for semi-structured data. In the 9th International Conference on Machine Learning (ICML 2002) (2002), pp. 291--298.Google Scholar
Kimura, D., and Kashima, H. Fast computation of subpath kernel for trees. In ICML (2012).Google Scholar
Levenshtein, V. I. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 8 (February 1966), 707 -- 710.Google Scholar
Li, L., and Helenius, M. Usability evaluation of anti-phishing toolbars. J. Computer Virology 3 (2007), 163--184.Google ScholarCross Ref
Liu, W. An antiphishing strategy based on visual similarity assessment. In IEEE Internet Computing (2006), pp. 58--65.Google Scholar
Lu, S. Y. A tree-to-tree distance and its application to cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 1 (1979), 219--224.Google ScholarCross Ref
Marchal, S., and Asokan, N. On designing and evaluating phishing webpage detection techniques for the real world. In 11th USENIX Workshop on Cyber Security Experimentation and Test (CSET 18) (Baltimore, MD, 2018), USENIX Association.Google ScholarDigital Library
Marchal, S., François, J., State, R., and Engel, T. Phishstorm: Detecting phishing with streaming analytics. IEEE Trans. Network and Service Management 11, 4 (2014), 458--471.Google ScholarCross Ref
Marchal, S., Saari, K., Singh, N., and Asokan, N. Know your phish: Novel techniques for detecting phishing sites and their targets. In ICDCS (2016), IEEE Computer Society, pp. 323--333.Google ScholarCross Ref
Satish. S, and Babu. K, S. Phishing websites detection based on web source code and url in the webpage. International Journal of Computer Science and Engineering Communications 1 (2013).Google Scholar
Shin, K. A theory of subtree matching and tree kernels based on the edit distance concept. Annals of Mathematics and Artificial Intelligence (2015).Google Scholar
Shin, K., and Ishikawa, T. Linear-time algorithms for the subpath kernel. In Proceedings of 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018) (2018), pp. 22:1--22:13.Google Scholar
Shin, K., and Kuboyama, T. A generalization of Haussler's convolution kernel - mapping kernel. In ICML 2008 (2008).Google Scholar
Shin, K., and Kuboyama, T. A comprehensive study of tree kernels. In JSAIisAI Post-Workshop Proceedings, Lecture Notes in Articial Intelligence 8417 (2014), Springer, pp. 329--343.Google ScholarCross Ref
Taï, K. C. The tree-to-tree correction problem. journal of the ACM 26, 3 (July 1979), 422--433.Google Scholar
Whittaker, C., Ryner, B., and Nazif, M. Large-scale automatic classification of phishing pages. In NDSS '10 (2010).Google Scholar
Zhang, K. Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recognition 28, 3 (March 1995), 463--474.Google Scholar
Zhang, K., Wang, J. T. L., and Shasha, D. On the editing distance between undirected acyclic graphs. International Journal of Foundations of Computer Science 7, 1 (1996), 43--58.Google ScholarCross Ref
Zhang, Y., Egelman, S., Cranor, L., and Hong, J. Phinding phish: Evaluating anti-phishing tools. In Proceedings of 14th Anual Network and Distributed System Security Symposium (2007), Internet Society.Google Scholar
Zhang, Y., Hong, J. I., and Cranor, L. F. Cantina: A content-based approach to detecting phishing web sites. In Proceedings of the 16th International Conference on World Wide Web (New York, NY, USA, 2007), WWW '07, ACM, pp. 639--648.Google Scholar

Recommendations

Detecting Blind Cross-Site Scripting Attacks Using Machine Learning
SPML '18: Proceedings of the 2018 International Conference on Signal Processing and Machine Learning

Cross-site scripting (XSS) is a scripting attack targeting web applications by injecting malicious scripts into web pages. Blind XSS is a subset of stored XSS, where an attacker blindly deploys malicious payloads in web pages that are stored in a ...
Read More
Client-side cross-site scripting protection

Web applications are becoming the dominant way to provide access to online services. At the same time, web application vulnerabilities are being discovered and disclosed at an alarming rate. Web applications often make use of JavaScript code that is ...
Read More
Defeating Cross-Site Request Forgery Attacks with Browser-Enforced Authenticity Protection
Financial Cryptography and Data Security

A cross site request forgery (CSRF) attack occurs when a user's web browser is instructed by a malicious webpage to send a request to a vulnerable web site, resulting in the vulnerable web site performing actions not intended by the user. CSRF ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ARES '20: Proceedings of the 15th International Conference on Availability, Reliability and Security
August 2020
1073 pages
ISBN:9781450388337
DOI:10.1145/3407023
Program Chairs:
Melanie Volkamer
Karlsruhe Institute of Technologie (KIT), Competence Center for Applied Security Technology (KASTEL)
,
Christian Wressnegger
Karlsruhe Institute of Technologie (KIT), Competence Center for Applied Security Technology (KASTEL)
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 August 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
fake sites detection
kernel method
web security
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate228of451submissions,51%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 127
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Machine learning for tree structures in fake site detection

ARES '20: Proceedings of the 15th International Conference on Availability, Reliability and Security

ABSTRACT

References

Cited By

Recommendations

Detecting Blind Cross-Site Scripting Attacks Using Machine Learning

Client-side cross-site scripting protection

Defeating Cross-Site Request Forgery Attacks with Browser-Enforced Authenticity Protection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Machine learning for tree structures in fake site detection

ARES '20: Proceedings of the 15th International Conference on Availability, Reliability and Security

ABSTRACT

References

Cited By

Recommendations

Detecting Blind Cross-Site Scripting Attacks Using Machine Learning

Client-side cross-site scripting protection

Defeating Cross-Site Request Forgery Attacks with Browser-Enforced Authenticity Protection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media