skip to main content
research-article

Detecting and Augmenting Missing Key Aspects in Vulnerability Descriptions

Published: 09 April 2022 Publication History

Abstract

Security vulnerabilities have been continually disclosed and documented. For the effective understanding, management, and mitigation of the fast-growing number of vulnerabilities, an important practice in documenting vulnerabilities is to describe the key vulnerability aspects, such as vulnerability type, root cause, affected product, impact, attacker type, and attack vector. In this article, we first investigate 133,639 vulnerability reports in the Common Vulnerabilities and Exposures (CVE) database over the past 20 years. We find that 56%, 85%, 38%, and 28% of CVEs miss vulnerability type, root cause, attack vector, and attacker type, respectively. By comparing the differences of the latest updated CVE reports across different databases, we observe that 1,476 missing key aspects in 1,320 CVE descriptions were augmented manually in the National Vulnerability Database (NVD), which indicates that the vulnerability database maintainers try to complete the vulnerability descriptions in practice to mitigate such a problem.
To help complete the missing information of key vulnerability aspects and reduce human efforts, we propose a neural-network-based approach called PMA to predict the missing key aspects of a vulnerability based on its known aspects. We systematically explore the design space of the neural network models and empirically identify the most effective model design in the scenario. Our ablation study reveals the prominent correlations among vulnerability aspects when predicting. Trained with historical CVEs, our model achieves 88%, 71%, 61%, and 81% in F1 for predicting the missing vulnerability type, root cause, attacker type, and attack vector of 8,623 “future” CVEs across 3 years, respectively. Furthermore, we validate the predicting performance of key aspect augmentation of CVEs based on the manually augmented CVE data collected from NVD, which confirms the practicality of our approach. We finally highlight that PMA has the ability to reduce human efforts by recommending and augmenting missing key aspects for vulnerability databases, and to facilitate other research works such as severity level prediction of CVEs based on the vulnerability descriptions.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, and Xiaoqiang Zhang. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX symposium on operating systems design and implementation (OSDI’16). 265–283.
[2]
Sultan S. Alqahtani and Juergen Rilling. 2019. Semantic modeling approach for software vulnerabilities data sources. In Proceedings of the 2019 17th International Conference on Privacy, Security and Trust (PST). IEEE, 1–7.
[3]
Afsah Anwar, Ahmed Abusnaina, Songqing Chen, Frank Li, and David Mohaisen. 2020. Cleaning the NVD: Comprehensive quality assessment, improvements, and analyses. arXiv preprint arXiv:2006.15074 (2020).
[4]
P. Bhandari and M. Singh. 2016. Formal specification of the framework for NSSA. Procedia Computer Science 92 (2016), 23–29.
[5]
H. Binyamini, R. Bitton, M. Inokuchi, T. Yagyu, Y. Elovici, and A. Shabtai. 2020. An automated, end-to-end framework for modeling attacks from vulnerability descriptions. arXiv preprint arXiv:2008.04377.
[6]
Mehran Bozorgi, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. 2010. Beyond heuristics: Learning to classify vulnerabilities and predict exploits. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’10). 105–114.
[7]
Mehran Bozorgi, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. 2010. Beyond heuristics: Learning to classify vulnerabilities and predict exploits. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 105–114.
[8]
Brompwnie. 2017. cve-2020-5260. https://github.com/brompwnie/cve-2020-5260/. [Online; accessed 21-January-2017].
[9]
CAPEC. 2019. Common Attack Pattern Enumeration and Classification. http://cwe.mitre.org/. [Online; accessed 30-June-2019].
[10]
G. Chen, C. Chen, Z. Xing, and B. Xu. 2016. Learning a dual-language vector space for domain-specific cross-lingual question retrieval. In Proceedings of the 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). 744–755.
[11]
Yang Chen, Andrew E. Santosa, Asankhaya Sharma, and David Lo. 2020. Automated identification of libraries from vulnerability data. In Proceedings of the 42nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP ’20). IEEE Press.
[12]
Istehad Chowdhury and Mohammad Zulkernine. 2010. Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities. Journal of Systems Architecture, 57, 294–313.
[13]
CWE. 2019. Common weakness enumeration (CWE). http://capec.mitre.org/. [Online; accessed 30-June-2019].
[14]
The MITRE Corporation. 2019. CveForm: Submit a CVE request. https://cveform.mitre.org/. [Online; accessed 30-June-2019].
[15]
Ying Dong, Wenbo Guo, Yueqi Chen, Xinyu Xing, Yuqing Zhang, and Gang Wang. 2019. Towards the detection of inconsistencies in public security vulnerability reports. In Proceedings of the 28th \(\lbrace\)USENIX\(\rbrace\) Security Symposium (\(\lbrace\)USENIX\(\rbrace\) Security 19). 869–885.
[16]
Ronen Feldman. 2013. Techniques and applications for sentiment analysis. Commun. ACM, 56, 82–89.
[17]
FIRST. 2019. Common Vulnerability Scoring System (CVSS). https://www.first.org/cvss. [Online; accessed 30-June-2019].
[18]
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. 315–323.
[19]
Xi Gong, Zhenchang Xing, Xiaohong Li, Zhiyong Feng, and Zhuobing Han. 2019. Joint prediction of multiple vulnerability characteristics through multi-task learning. In Proceedings of the 2019 24th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE, 31–40.
[20]
Google. 2019. Word2vec. https://code.google.com/archive/p/word2vec/. [Online; accessed 30-June-2019].
[21]
Hao Guo, Zhenchang Xing, Sen Chen, Xiaohong Li, Yude Bai, and Hu Zhang. 2021. Key aspects augmentation of vulnerability description based on multiple security databases. In Proceedings of the 2021 IEEE 45th Annual Conference on Computers, Software, and Applications Conference (COMPSAC). IEEE, 1020–1025.
[22]
Harsha Gurulingappa, Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, Martin Hofmann-Apitius, and Luca Toldo. 2012. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. J. Biomed. Informat. 45, 885—892.
[23]
Z. Han, X. Li, Z. Xing, H. Liu, and Z. Feng. 2017. Learning to predict severity of software vulnerability using only vulnerability description. In Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). 125–136.
[24]
A. Hassan and A. Mahmood. 2018. Convolutional recurrent deep learning model for sentence classification. IEEE Access, 6, 13949–13957.
[25]
Erik Hemberg, Jonathan Kelly, Michal Shlapentokh-Rothman, Bryn Reinstadler, Katherine Xu, Nick Rutar, and Una-May O’Reilly. 2020. BRON–Linking attack tactics, techniques, and patterns with defensive weaknesses, vulnerabilities and affected platform configurations. arXiv preprint arXiv:2010.00533 (2020).
[26]
Jeremy Howard and Sebastian Ruder. 2018. Fine-tuned language models for text classification. In CoRR, Vol. abs/1801.06146. arxiv:1801.06146.
[27]
Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. Embeddings for word sense disambiguation: An evaluation study. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 897–907.
[28]
IBM. 2019. IBM X-Force Exchange. https://exchange.xforce.ibmcloud.com/. [Online; accessed 30-June-2019].
[29]
Jonathan Evans. 2020. MITRE key details phrasing. http://cveproject.github.io/docs/content/key-details-phrasing.pdf. [Online; accessed February-2020].
[30]
Kasif Dekel. 2017. whatsapp-rce-patched. https://github.com/kasif-dekel/whatsapp-rce-patched/. [Online; accessed 21-January-2017].
[31]
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In CoRR, abs/1408.5882. arxiv:1408.5882.
[32]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.
[33]
Jing Li, Aixin Sun, and Zhenchang Xing. 2018. Learning to answer programming questions with software documentation through social context embedding. In Information Sciences, Vol. 448-449. 36–52.
[34]
Bill Yuchen Lin, Frank F. Xu, Zhiyi Luo, and Kenny Zhu. 2017. Multi-channel bilstm-crf model for emerging named entity recognition in social media. Proceedings of the 3rd Workshop on Noisy User-Generated Text. 160–165.
[35]
Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
[36]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems. 3111–3119.
[37]
Corporation MITRE. 2017. National Vulnerability Database (NVD). https://nvd.nist.gov/. [Online; accessed 21-January-2017].
[38]
Corporation MITRE. 2019. Common Attack Pattern Enumeration and Classification Submission.https://cveform.mitre.org. [Online; accessed 30-June-2019].
[39]
Corporation MITRE. 2019. Common Vulnerabilities and Exposures (CVE). https://cve.mitre.org/. [Online; accessed 30-June-2019].
[40]
Lili Mou, Ge Li, Zhi Jin, Lu Zhang, and Tao Wang. 2014. TBCNN: A tree-based convolutional neural network for programming language processing. CoRR, abs/1409.5718. arxiv:1409.5718.
[41]
Dongliang Mu, Alejandro Cuevas, Limin Yang, Hang Hu, Xinyu Xing, Bing Mao, and Gang Wang. 2018. Understanding the reproducibility of crowd-reported security vulnerabilities. In Proceedings of the 27th USENIX Security Symposium (USENIX Security 18). USENIX Association, Baltimore, MD, 919–936.
[42]
Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. 2016. Improving document ranking with dual word embeddings. In Proceedings of the 25th International Conference Companion on World Wide Web. 83–84.
[43]
Stephan Neuhaus, Thomas Zimmermann, Christian Holler, and Andreas Zeller. 2007. Predicting vulnerable software components. In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS ’07). 529–540.
[44]
NIST. 2017. National Institute of Standards and Technology (NIST). https://www.nist.gov/. [Online; accessed 21-January-2017].
[45]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. CoRR, abs/1802.05365. arxiv:1802.05365.
[46]
Scott Reed and Nando Freitas. 2015. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279.
[47]
Kashif Riaz. 2010. Rule-based named entity recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop. Association for Computational Linguistics, 126–135.
[48]
R. Scandariato, J. Walden, A. Hovsepyan, and W. Joosen. 2014. Predicting vulnerable software components via text mining. IEEE Transactions on Software Engineering 40, 993–1006.
[49]
Yonghee Shin, Andrew Meneely, Laurie Williams, and Jason A. Osborne. 2011. Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities. IEEE Transactions on Software Engineering, 37, 772–787.
[50]
Ravindra Singh and Naurang Mangat. 2013. Elements of survey sampling, Vol. 15. Springer Science & Business Media.
[51]
Symantec. 2019. securityFocus. https://www.securityfocus.com/. [Online; accessed 30-June-2019].
[52]
Lingyu Wang, Tania Islam, Long Tao, Anoop Singhal, and Sushil Jajodia. 2008. An attack graph based probabilistic security metric. In Lecture Notes in Computer Science, Vol. 5094, 283–296.
[53]
Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, 297–308.
[54]
R. F. Woolson. 2007. Wilcoxon signed-rank test. In Wiley Encyclopedia of Clinical Trials. Wiley Online Library, 1–3.
[55]
Xin Xia, David Lo, Sinno Jialin Pan, Nachiappan Nagappan, and Xinyu Wang. 2016. Hydra: Massively compositional model for cross-project defect prediction. IEEE Transactions on Software Engineering42, 977–998.
[56]
Hongbo Xiao, Zhenchang Xing, Xiaohong Li, and Hao Guo. 2019. Embedding and predicting software security entity relationships: A knowledge graph based approach. In International Conference on Neural Information Processing. Springer, 50–63.
[57]
Bowen Xu, Deheng Ye, Zhenchang Xing, Xin Xia, Guibin Chen, and Shanping Li. 2016. Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In Proceedings of the 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 51–62.
[58]
B. Xu, D. Ye, Z. Xing, X. Xia, G. Chen, and S. Li. 2016. Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In Proceedings of the 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). 51–62.
[59]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1480–1489.
[60]
Xin Ye, Hui Shen, Xiao Ma, Razvan Bunescu, and Chang Liu. 2016. From word embeddings to document similarities for improved information retrieval in software engineering. In Proceedings of the 38th International Conference on Software Engineering. 404–415.
[61]
Liu Yuan, Yude Bai, Zhenchang Xing, Sen Chen, Xiaohong Li, and Zhidong Deng. 2021. Predicting entity relations across different security databases by using graph attention network. In Proceedings of the 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 834–843.
[62]
Bai Yude, Xing Zhenchang, Li Xiaohong, Feng Zhiyong, and Ma Duoyuan. 2020. Unsuccessful story about few shot malware family classification and Siamese network to the rescue. In Proceedings of the 2020 IEEE/ACM 42st International Conference on Software Engineering (ICSE ’20).
[63]
Xian Zhan, Lingling Fan, Sen Chen, Feng Wu, Tianming Liu, Xiapu Luo, and Yang Liu. 2021. ATVHUNTER: Reliable version detection of third-party libraries for vulnerability identification in android applications. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 1695–1707.
[64]
Ye Zhang and Byron C. Wallace. 2015. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In CoRR, Vol. abs/1510.03820. arxiv:1510.03820.
[65]
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In CoRR. arxiv:1909.03496.

Cited By

View all
  • (2025)Do Chase Your Tail! Missing Key Aspects Augmentation in Textual Vulnerability Descriptions of Long-Tail Software Through Feature InferenceIEEE Transactions on Software Engineering10.1109/TSE.2024.352328451:2(466-483)Online publication date: 1-Feb-2025
  • (2024)Behind the Code: Identifying Zero-Day Exploits in WordPressFuture Internet10.3390/fi1607025616:7(256)Online publication date: 19-Jul-2024
  • (2024)Automated Labeling of Entities in CVE Vulnerability Descriptions with Natural Language ProcessingIEICE Transactions on Information and Systems10.1587/transinf.2023DAP0013E107.D:5(674-682)Online publication date: 1-May-2024
  • Show More Cited By

Index Terms

  1. Detecting and Augmenting Missing Key Aspects in Vulnerability Descriptions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Software Engineering and Methodology
    ACM Transactions on Software Engineering and Methodology  Volume 31, Issue 3
    July 2022
    912 pages
    ISSN:1049-331X
    EISSN:1557-7392
    DOI:10.1145/3514181
    • Editor:
    • Mauro Pezzè
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 April 2022
    Online AM: 31 January 2022
    Accepted: 01 November 2021
    Revised: 01 September 2021
    Received: 01 January 2021
    Published in TOSEM Volume 31, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CVE
    2. vulnerability description
    3. data augmentation
    4. deep neural network

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • The National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)275
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Do Chase Your Tail! Missing Key Aspects Augmentation in Textual Vulnerability Descriptions of Long-Tail Software Through Feature InferenceIEEE Transactions on Software Engineering10.1109/TSE.2024.352328451:2(466-483)Online publication date: 1-Feb-2025
    • (2024)Behind the Code: Identifying Zero-Day Exploits in WordPressFuture Internet10.3390/fi1607025616:7(256)Online publication date: 19-Jul-2024
    • (2024)Automated Labeling of Entities in CVE Vulnerability Descriptions with Natural Language ProcessingIEICE Transactions on Information and Systems10.1587/transinf.2023DAP0013E107.D:5(674-682)Online publication date: 1-May-2024
    • (2024)Vision: Identifying Affected Library Versions for Open Source Software VulnerabilitiesProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695516(1447-1459)Online publication date: 27-Oct-2024
    • (2024)VulZoo: A Comprehensive Vulnerability Intelligence DatasetProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695345(2334-2337)Online publication date: 27-Oct-2024
    • (2024)On NVD Users’ Attitudes, Experiences, Hopes, and HurdlesDigital Threats: Research and Practice10.1145/36888065:3(1-19)Online publication date: 21-Aug-2024
    • (2024)PatchFinder: A Two-Phase Approach to Security Patch Tracing for Disclosed Vulnerabilities in Open-Source SoftwareProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680305(590-602)Online publication date: 11-Sep-2024
    • (2024)Vulnerability Root Cause Function Locating For Java VulnerabilitiesProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3641225(444-446)Online publication date: 14-Apr-2024
    • (2024)Multitask-Based Evaluation of Open-Source LLM on Software VulnerabilityIEEE Transactions on Software Engineering10.1109/TSE.2024.347033350:11(3071-3087)Online publication date: 1-Nov-2024
    • (2024)Advanced Automated Vulnerability Scoring: Improving Performance with a Fine-Tuned BERT-CNN Model2024 11th International Symposium on Telecommunications (IST)10.1109/IST64061.2024.10843410(109-113)Online publication date: 9-Oct-2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media