skip to main content
10.1145/3407023.3407035acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaresConference Proceedingsconference-collections
research-article

Machine learning for tree structures in fake site detection

Published:25 August 2020Publication History

ABSTRACT

Tree data analysis has many applications in information security. In particular, HTML pages' DOM trees are an important target of analysis because web pages can be vectors for, and targets of, major cyberattacks like phishing. Previous attempts to incorporate tree data analysis into security applications, however, have been hampered by the lack of efficient methods for tree data analysis in machine learning. As such, most security research has focused on data representable as vectors of real numbers, like most machine learning work. Recent work, however, has yielded several efficiency break-throughs in tree analysis. One example is kernel methods, a methodological bridge that fills the gap between discretely-structured data (like trees) and multivariate analysis. Kernel methods enable applying a variety of multivariate analysis techniques such as SVM and PCA to trees. The method we are interested in is the subpath kernel. The subpath kernel offers the following advantages: (1) it is invariant over ordered and unordered trees; (2) it can be computed using an extremely fast linear-time algorithm compared to the quadratic time required to compute values of most tree kernels; (3) its excellent prediction accuracy has been proven through intensive experiments. This paper proposes a subpath kernel-based method for tree-structured security data. To demonstrate the effectiveness of our method, we apply it to the problem of detecting fake e-commerce sites, a sub-problem of phishing detection with a significant real-world financial cost. In an experiment on a real dataset of fake sites provided by a major e-commerce company, our method exhibited accuracy as high as 0.998 when training SVM with as few as 1,000 instances. Its generalization efficiency is also excellent: with only 100 training instances, the accuracy score reaches 0.996. While previous phishing detection methods relied on textual content, URL components, and blacklists, our approach is the first to leverage DOM trees, which makes it both more effective and more robust against adversarial attacks. Unlike URL or content changes, changing a page's DOM structure incurs large costs to criminals.

References

  1. Abbasi, A., and Chen, H. A comparison of tools for detecting fake websites. Computer (June 2009).Google ScholarGoogle Scholar
  2. Berg, C., Christensen, J. P. R., and Ressel, R. Harmonic Analysis on semigroups. Theory of positive definite and related functions. Springer, 1984.Google ScholarGoogle ScholarCross RefCross Ref
  3. Chang, C.-C., and Lin, C.-J. Libsvm: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/, 2001.Google ScholarGoogle Scholar
  4. Collins, M., and Duffy, N. Convolution kernels for natural language. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001] (2001), MIT Press, pp. 625--632.Google ScholarGoogle Scholar
  5. Corona, I., Biggio, B., Contini, M., Piras, L., Corda, R., Mereu, M., Mureddu, G., Ariu, D., and Roli, F. Deltaphish: Detecting phishing webpages in compromised websites. In ESORICS (1) (2017), vol. 10492 of Lecture Notes in Computer Science, Springer, pp. 370--388.Google ScholarGoogle ScholarCross RefCross Ref
  6. Cristianini, N., and Shawe-Taylor, J. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  7. Demšar, J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Theory 7 (2006), 1 -- 30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gerbet, T., Kumar, A., and Lauradoux, C. (un)safe browsing. Tech. Rep. RR-8594, INRIA, 2014.Google ScholarGoogle Scholar
  9. Haussler, D. Convolution kernels on discrete structures. UCSC-CRL 99-10, Dept. of Computer Science, University of California at Santa Cruz, 1999.Google ScholarGoogle Scholar
  10. Hommel, G. A stagewise rejective multiple test procedure based on a modified bonferroni tests. Biometrika 75 (1988), 383 -- 386.Google ScholarGoogle ScholarCross RefCross Ref
  11. Kashima, H., and Koyanagi, T. Kernels for semi-structured data. In the 9th International Conference on Machine Learning (ICML 2002) (2002), pp. 291--298.Google ScholarGoogle Scholar
  12. Kimura, D., and Kashima, H. Fast computation of subpath kernel for trees. In ICML (2012).Google ScholarGoogle Scholar
  13. Levenshtein, V. I. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 8 (February 1966), 707 -- 710.Google ScholarGoogle Scholar
  14. Li, L., and Helenius, M. Usability evaluation of anti-phishing toolbars. J. Computer Virology 3 (2007), 163--184.Google ScholarGoogle ScholarCross RefCross Ref
  15. Liu, W. An antiphishing strategy based on visual similarity assessment. In IEEE Internet Computing (2006), pp. 58--65.Google ScholarGoogle Scholar
  16. Lu, S. Y. A tree-to-tree distance and its application to cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 1 (1979), 219--224.Google ScholarGoogle ScholarCross RefCross Ref
  17. Marchal, S., and Asokan, N. On designing and evaluating phishing webpage detection techniques for the real world. In 11th USENIX Workshop on Cyber Security Experimentation and Test (CSET 18) (Baltimore, MD, 2018), USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Marchal, S., François, J., State, R., and Engel, T. Phishstorm: Detecting phishing with streaming analytics. IEEE Trans. Network and Service Management 11, 4 (2014), 458--471.Google ScholarGoogle ScholarCross RefCross Ref
  19. Marchal, S., Saari, K., Singh, N., and Asokan, N. Know your phish: Novel techniques for detecting phishing sites and their targets. In ICDCS (2016), IEEE Computer Society, pp. 323--333.Google ScholarGoogle ScholarCross RefCross Ref
  20. Satish. S, and Babu. K, S. Phishing websites detection based on web source code and url in the webpage. International Journal of Computer Science and Engineering Communications 1 (2013).Google ScholarGoogle Scholar
  21. Shin, K. A theory of subtree matching and tree kernels based on the edit distance concept. Annals of Mathematics and Artificial Intelligence (2015).Google ScholarGoogle Scholar
  22. Shin, K., and Ishikawa, T. Linear-time algorithms for the subpath kernel. In Proceedings of 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018) (2018), pp. 22:1--22:13.Google ScholarGoogle Scholar
  23. Shin, K., and Kuboyama, T. A generalization of Haussler's convolution kernel - mapping kernel. In ICML 2008 (2008).Google ScholarGoogle Scholar
  24. Shin, K., and Kuboyama, T. A comprehensive study of tree kernels. In JSAIisAI Post-Workshop Proceedings, Lecture Notes in Articial Intelligence 8417 (2014), Springer, pp. 329--343.Google ScholarGoogle ScholarCross RefCross Ref
  25. Taï, K. C. The tree-to-tree correction problem. journal of the ACM 26, 3 (July 1979), 422--433.Google ScholarGoogle Scholar
  26. Whittaker, C., Ryner, B., and Nazif, M. Large-scale automatic classification of phishing pages. In NDSS '10 (2010).Google ScholarGoogle Scholar
  27. Zhang, K. Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recognition 28, 3 (March 1995), 463--474.Google ScholarGoogle Scholar
  28. Zhang, K., Wang, J. T. L., and Shasha, D. On the editing distance between undirected acyclic graphs. International Journal of Foundations of Computer Science 7, 1 (1996), 43--58.Google ScholarGoogle ScholarCross RefCross Ref
  29. Zhang, Y., Egelman, S., Cranor, L., and Hong, J. Phinding phish: Evaluating anti-phishing tools. In Proceedings of 14th Anual Network and Distributed System Security Symposium (2007), Internet Society.Google ScholarGoogle Scholar
  30. Zhang, Y., Hong, J. I., and Cranor, L. F. Cantina: A content-based approach to detecting phishing web sites. In Proceedings of the 16th International Conference on World Wide Web (New York, NY, USA, 2007), WWW '07, ACM, pp. 639--648.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ARES '20: Proceedings of the 15th International Conference on Availability, Reliability and Security
    August 2020
    1073 pages
    ISBN:9781450388337
    DOI:10.1145/3407023
    • Program Chairs:
    • Melanie Volkamer,
    • Christian Wressnegger

    Copyright © 2020 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 25 August 2020

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate228of451submissions,51%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader