skip to main content
10.1145/3545258.3545271acmotherconferencesArticle/Chapter ViewAbstractPublication PagesinternetwareConference Proceedingsconference-collections
research-article

WAIN: Automatic Web Application Identification and Naming Method

Published: 15 September 2022 Publication History

Abstract

As the defense shifts from vulnerability-centric to threat-centric and efficient security architecture can exclusively be constructed with adequate comprehension of the threat of the critical assets. In order to classify and identify the assets, the recognition and naming of the Web applications are the fundamental approaches. At present, the traditional Web application identification methods mainly rely on rules matching, which are extracted from the Web pages by manual analysis. This low coverage and labor-consuming method, which is not suitable for this time of explosive growth in Web applications and inevitably leaves some uncommon applications unrecognized and at risk. In this paper, we propose WAIN, an automatic method for Web application identification and naming, it first clusters different types of applications in numerous samples using K-Means algorithm, and then leverages a novel TF-IDF calculation method to extract keyword. After that, LDA is applied to explain why some parts of data are similar and extract possible fingerprints. Finally, WAIN utilizes filters and a statistic means to generate possible names for clusters. When evaluating, data from 30,000 instances of eight kinds of Web applications is processed, and the generated fingerprints and names can distinguish each type of application in the dataset. We manually checked all the results and found that fingerprints and at least one name that summarizes at least one of the product names, manufacturers, and functions are successfully generated for each kind of application.

Supplementary Material

Presentation slides (WAIN.pptx)

References

[1]
[n. d.]. Acunetix Web Application Vulnerability Report 2019. https://cdn2.hubspot.net/hubfs/4595665/Acunetix_web_application_vulnerability_report_2019.pdf.
[2]
[n. d.]. Internet Security Threat Report, Volume 24. https://docs.broadcom.com/docs/istr-24-2019-en.
[3]
[n. d.]. Web Vulnerabilities 2019. https://www.ptsecurity.com/upload/corporate/ww-en/analytics/Web-Vulnerabilities-2019-eng.pdf.
[4]
Amrita Anandika and Smita Prava Mishra. 2019. A Study on Machine Learning Approaches for Named Entity Recognition. In 2019 International Conference on Applied Machine Learning (ICAML). IEEE, 153–159.
[5]
David Arthur and Sergei Vassilvitskii. 2007. K-Means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007.
[6]
Slobodan Beliga, Ana Meštrović, and Sanda Martinčić-Ipšić. 2015. An overview of graph-based keyword extraction methods and approaches. Journal of information and organizational sciences 39, 1 (2015), 1–20.
[7]
David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003), 993–1022.
[8]
Yan Chen, Yang Yang, Huisan Zhang, Haiping Zhu, and Feng Tian. 2012. A topic detection method based on Semantic Dependency Distance and PLSA. In Proceedings of the 2012 IEEE 16th International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE, 703–708.
[9]
Arindam Dey, Abhijit Paul, and Bipul Syam Purkayastha. 2014. Named entity recognition for nepali language: A semi hybrid approach. International Journal of Engineering and Innovative Technology (IJEIT) Volume 3(2014), 21–25.
[10]
Omkar Dhariya, Shrikant Malviya, and Uma Shanker Tiwary. 2017. A hybrid approach for Hindi-English machine translation. In 2017 International Conference on Information Networking (ICOIN). IEEE, 389–394.
[11]
Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences 101, suppl 1(2004), 5228–5235.
[12]
J.A. Hartigan and M.A. Wong. 2013. A K-means clustering algorithm. Appl Stat 28, 1 (2013), 100–108.
[13]
Matthew D. Hoffman, David M. Blei, and Francis R. Bach. 2010. Online Learning for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada.
[14]
Z. Huang, C. Xia, B. Sun, and H. Xue. 2015. Analyzing and summarizing the web server detection technology based on HTTP. In 2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS). 1042–1045. https://doi.org/10.1109/ICSESS.2015.7339231
[15]
Seigo Igaki, Takashi Shinzaki, Fumio Yamagishi, Hiroyuki Ikeda, and Hironori Yahagi. 1992. Minutia data extraction in fingerprint identification. US Patent 5,109,428.
[16]
T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, and A.Y. Wu. [n. d.]. An efficient k-means clustering algorithm: analysis and implementation. IEEE Transactions on Pattern Analysis & Machine Intelligence 24, 7([n. d.]), 0–892.
[17]
D. Lee, J. Rowe, C. Ko, and K. Levitt. 2002. Detecting and defending against Web-server fingerprinting. In 18th Annual Computer Security Applications Conference, 2002. Proceedings.321–330. https://doi.org/10.1109/CSAC.2002.1176304
[18]
Zhiyuan Liu, Peng Li, Yabin Zheng, and Maosong Sun. 2009. Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 257–266.
[19]
Aytuğ Onan, Serdar Korukoğlu, and Hasan Bulut. 2016. Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications 57 (2016), 232–247.
[20]
Xuan-Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Le-Minh Nguyen, Susumu Horiguchi, and Quang-Thuy Ha. 2010. A hidden topic-based framework toward building applications with short web documents. IEEE Transactions on Knowledge and Data Engineering 23, 7(2010), 961–976.
[21]
Hinal Shah, Prachi Bhandari, Krunal Mistry, Shivani Thakor, Mishika Patel, and Kamini Ahir. 2016. Study of named entity recognition for indian languages. Int. J. Inf 6, 1 (2016), 11–25.
[22]
Sifatullah Siddiqi and Aditi Sharan. 2015. Keyword and keyphrase extraction techniques: a literature review. International Journal of Computer Applications 109, 2(2015).
[23]
Peter D Turney. 2000. Learning algorithms for keyphrase extraction. Information retrieval 2, 4 (2000), 303–336.
[24]
Jinghua Wang, Jianyi Liu, and Cong Wang. 2007. Keyword extraction based on pagerank. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 857–864.
[25]
Yujun Wen, Hui Yuan, and Pengzhou Zhang. 2016. Research on keyword extraction based on word2vec weighted textrank. In 2016 2nd IEEE International Conference on Computer and Communications (ICCC). IEEE, 2109–2113.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
Internetware '22: Proceedings of the 13th Asia-Pacific Symposium on Internetware
June 2022
291 pages
ISBN:9781450397803
DOI:10.1145/3545258
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 September 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Clustering
  2. Fingerprint Extraction
  3. Web Application Identification

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

Internetware 2022

Acceptance Rates

Overall Acceptance Rate 55 of 111 submissions, 50%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 59
    Total Downloads
  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media