Keywords

1 Introduction

Rapid development of the Internet has changed not only the traditional society but also the software industry. Benefitting from the collaboration among fast growing software developers and the information sharing, the open source software development is accelerating at an amazingly high speed. However, the component reuses and the library dependencies that are common in open source communities present new challenges. Software vulnerability has emerging as a new concern for open source softwares and projects.

In order to analyze vulnerabilities comprehensively, multiple sources of information are required to be integrated and linked together. Currently, software vulnerabilities, open source project information and source code management information are stored in different locations with different representations or formats. It is therefore difficult for analysts to cross reference comprehensive analytical information as a whole picture. Consequently, it is of practical importance to develop a unified knowledge representation, which should make it possible to integrate all information together across resource boundaries for further knowledge sharing and more in-depth analysis.

Knowledge Graph, having the ability of representing complex concepts and relationships, after it was proposed by GoogleFootnote 1, has been rapidly applied to different research domains, such as semantic web, intelligent searching and knowledge computing etc. Within software security domain, it is introduced to represent the concept of vulnerability, products, versions and related dependent libraries or projects, as well as the complex relationships among these concepts. Taking advantages of the interlinked data, the software vulnerability knowledge graph enables the significant ability of vulnerability spreading tracing, component dependency management, and relationship inference.

Software Vulnerability knowledge graph is defined as a graph dedicated to representing relevant knowledge in software security domains, projects and systems. A software vulnerability knowledge graph typically consists of a series of nodes and edges, with the nodes used to represent entities such as software project, vulnerability or any user defined entity types, while the edges used to represent relationships between two different entities. Both the nodes and the edges have various properties to depict inner features. For example, a CVE type node has ID, name, published date as its properties to differ with other nodes.

However, one of the preconditions is how to perform an accurate ontology matching in a vulnerability knowledge graph to link core ontologies with accurate relationship edges. Ontology matching, also called as ontology alignment, is a necessary procedure which is helpful in reducing the semantic gap between different domain knowledge. After completion of the ontology matching, vulnerability information, project meta-data and software configurations can then be linked together within the knowledge graph and be ready for further analysis. The quality of ontology matching is therefore very crucial, since it directly determines the accuracy of the analytical results.

Hence, in this paper, we first propose a software vulnerability knowledge graph ontology to represent vulnerability concepts with surrounded relationships. Moreover, two ontology matching approaches were proposed to link the Github project with the Maven project, and link the CVE product version with the Maven project version respectively. The major contributions of this work are summarized as follows:

  • To comprehensively analyze vulnerabilities from multiple resources, we propose a software vulnerability ontology by integrating information from CVE, Maven, and Github, respectively.

  • Based on the vulnerability knowledge graph, to refine traceability links between vulnerability and software components, we propose an Ontology Matching approach based on Github URL text-matching to link Maven project and Github project (OM-MG).

  • In addition, we propose an Ontology Matching approach using Random Forests to match CVE product version with Maven project version (OM-CM).

The remainder of this paper is organized as follows: Sect. 2 describes related work. Section 3 introduces background concepts about vulnerability database, Maven and Github. Section 4 focuses on the introduction of proposed software vulnerability ontology. Then, Sect. 5 illustrates details of proposed ontology matching approach OM-MG and OM-CM respectively. Section 6 presents detailed experimental results. Finally, Sect. 7 draws conclusions and announces acknowledgements.

2 Related Work

Ontology matching, also called as ontology alignment, was initially started from the integration of relational database schemas [6]. It was further developed in two areas as: (1) research on matching algorithms automatizing the similarity finding tasks and (2) research on efficient interfaces assisting the alignment process. Shvaiko et al. [22] summarized the existing ontology matching problems, evaluated the efficiency of matching techniques and pointed out the new challenges.

Ontology matching techniques [19] can be divided into two categories: (1) Context-based, which utilizes the matching of the internal information extracted from various ontologies. It can be further divided into two sub-categories as Semantic-based [15, 21] and Syntactic-based techniques; (2) Content-based, which focuses on matching the external information extracted from various ontologies or other outside resources. Similarly, it can also be further divided into sub-categories including Terminological [1, 10], Structural [2, 12], Extensional [14] and Semantic [20].

Within information extraction domain, the name entity alignment or instance alignment, is commonly used to match different ontologies. Jean et al. [11] proposed an approach named as automated semantic matching of ontologies with verification. Their approach, by combining lexical, structural and extensional matchers, utilized thesaurus from WordNet to perform the alignment of specialized ontologies. Loia et al. [14] proposed a content aware hybrid matching algorithm, by utilizing a weighted similarity measures on string-based, web-based and corpus-based similarity. In addition, frameworks or systems aiming at ontology alignment were also proposed such as AgreementMaker [5], AUTOMSv2 [13], and CIDER [9] etc.

In the semantic web domain, because of the big data, machine learning based techniques [7] were introduced to solve the problems in building large-scale information management systems. Ngo et al. [16, 17] introduced a combination of graph matching and machine learning approach to align ontology automatically and reduce the manual tasks human being involved. In their approach, terminological features were extracted and applied into the decision tree algorithm. However, due to the feature noise and the local overfitting shortages in decision tree algorithm, their results still need to be improved.

Among research on software vulnerability, to our best knowledge, Alqahtani et al. [3] proposed a modeling approach that combines software vulnerability ontology with Maven information to solve the problem of tracing down software vulnerabilities in software repositories. In addition, they proposed a matching approach on top of context-based matching and utilized semantic equivalence constructs (e.g. owl:equivalentClass and owl:equivalentProperty) existing in Web Ontology Language (OWL)Footnote 2 to match ontologies. It should be noted though that, their approach lacked the important software development inner-state information, which can be extracted from the Github repositories. Its evaluation results showed that the precision of such an ontology matching approach ranged from 80% to 95% with an average value of 90%. However, in practice, preliminary analysis results showed that the production version number is approximately 186,352 while project version number reaches as high as 2,383,247. Even a generally acceptable error of 10% could result in a large number of ontology mismatch and in turn, a serious deviation in final analysis.

To sum up, vulnerability information could be retrieved from the internet, motivated from research in semantic web domain, a software vulnerability knowledge graph is considered to comprehensively analyze vulnerability tracing problems. Existing matching approaches are still insufficient to have accurate matching results, so machine learning algorithm is introduced to improve the quality of links between vulnerability, Maven and Github ontologies.

3 Preliminaries

3.1 National Vulnerabilities Database (NVD)

NVD [18] is the U.S. government repository of standards based vulnerability management data, which enables the automation of vulnerability management, as well as security measurement and compliance. Common Vulnerabilities and Exposures (CVE), a public dictionary containing a series of tags, custom keywords and descriptions, has been introduced to identify common vulnerabilities across different software resources (open source projects, business products, etc.). As soon as a new vulnerability is revealed and reported, it will be first reviewed and verified by the security experts. Once confirmed, this new vulnerability would be added into the database along with additional information such as new identifying number (CVE ID), descriptions and meta data etc. By the end of year 2017, there were a total of 94,035 vulnerability records stored in the CVE database. Moreover, Common Weakness Enumeration (CWE)Footnote 3 and Common Vulnerability Scoring System (CVSS)Footnote 4 with extra weakness categories as well as security severity numerical scores provide additional information regarding a particular software vulnerability.

3.2 Maven and Maven Central Repository

Apache MavenFootnote 5, a most widely used management and comprehension tool for Java projects, adopts the concept of a project object model (POM) and can be used to manage a project’s build, reporting and documentation through a central piece of information. Within the POM file (in xml format), a software project defines its dependences, artifacts, plugins, related property and other configurations. During the build process, all of the required Java libraries and Maven plugins were pulled down from the Maven Central repository [4] into a local user directory for further use.

Table 1. Maven central repository statistics by the end of 2017

The Maven central repository is the official repository hosted by the Apache Software Foundation and it provides a channel for developers and organizations to upload and publish their products or library components. By the end of 2017, there were approximately 2,435,997 indexed project versions, each of which had components such as meta-data, jar files, source codes, JavaDoc and POM files. In this work, a website spider based on the SCRAPY-REDISFootnote 6 framework was coded to crawl all of the POM files located in Central repository, removing any POM files that contained errors or missing core configurations. The central repository statistics is shown in Table 1.

3.3 Github

Github [8] is currently the most widely used code hosting website in the world. Built on top of the Git version control system, Github provides features such as project management, code review, integrations, social coding, etc. By the end of 2017, there were approximately 24 million developers across 200 countries working across 67 million repositories, most of which are open source projects and software. GitHub is becoming one of the most important sources for us to analyze open source software vulnerability propagation.

For each open source software containing POM files downloaded from the Maven Central repository, it was cross-checked in this work to determine whether it also exists in the corresponding repository in Github. If so, then we would crawl through all related information such as the issues posts, pull requests descriptions, commits description, and wikis for further analysis on keywords related to the software vulnerability or defects.

4 Software Vulnerability Ontology

As it has been discussed in Sect. 3, software vulnerabilities, open source project information and source code management information are stored in different locations with different representations or formats. It is therefore difficult for analysts to cross reference comprehensive analytical information as a whole picture. Consequently, it is of practical importance to develop a unified ontological representation, which should make it possible to integrate all information together across resource boundaries for further knowledge sharing and more in-depth analysis.

Fig. 1.
figure 1

Overall process of software vulnerability tracing analysis

Figure 1 presents a schematic illustration of an overall platform for software vulnerability tracing analysis. There are three core components including data collection, data fusion, and data analysis. Within the data collection phase, different data resources such as CVE vulnerabilities, open source project configurations, and properties, as well as library dependencies are collected and incorporated into the system. Then by matching different ontologies, information and knowledge were fused, reorganized and aligned through ontology building and matching process. Finally, we utilize Neo4jFootnote 7, a popular graph database, to store the entire software vulnerability knowledge graph.

Fig. 2.
figure 2

Software vulnerability ontology top view

Figure 2 illustrates the top view of the proposed software vulnerability ontology implemented in our analysis platform. Three subcomponents, shown as the CVE Ontology, Maven Ontology, and Github Ontology, respectively, are integrated together to fuse information into a whole vulnerability knowledge graph. For each ontology, detailed concepts, properties, together with relationships can also be viewed in the top view figure.

4.1 CVE Ontology

CVE ontology focuses on the vulnerability-centric concepts together with surrounded relationships, which were extracted from the NVD database schema. For the purpose of tracing software vulnerability, this paper places extra emphasis on four concepts: Vulnerability, ProductVersion, Product, and Vendor, respectively.

Vulnerability affects a specific or a series of ProductVersion directly, and ProductVersion belongs to a certain Product, which then belongs to a Vendor as well. All of them are particularly useful when tracing software vulnerability spreading. The rest of CVE details such as CWE, CVSS etc., are kept as much as possible in the CVE ontology.

4.2 Maven Ontology

With the help of dependency management mechanism, Maven ontology has the excellent ability of illustrating dependency relationships between different project versions. Compared with CVE ontology, Maven ontology is relatively precise with three core concepts: ProjectGroup, Project, and ProjectVersion, respectively.

A ProjectGroup usually contains a group of Projects developed/owned by the same organization. A Project is dedicated to providing a specific modular library, functionality or a particular tool, which normally developed iteratively via various ProjectVersion releases. The ProjectVersion contains groupID, artifactID, project versions, parent project as well as project dependencies, and all of these information are extracted from a maven POM.xml file.

4.3 Github Ontology

There are thousands of active developers working in Github on a daily basis, providing massive valuable information such as issue comments and bug reports etc. It is helpful to carry out an in-depth analysis on these issues, comments and bug reports, for the purpose of refining vulnerability traceability more accurately. Base on the Github schema, core concepts (e.g. Project, commits, contributors, pull requests, issues etc.) were extracted and added into the Github ontology as shown in Fig. 2.

4.4 Implementation

These ontologies discussed above are implemented using Neo4j, which is one of the most popular graph database management system. Different types of ontologies along with related properties and relationships are mapped into build-in Nodes, Properties and Relationships in Neo4j, which are created using build-in functions and clauses.

Fig. 3.
figure 3

Sample codes to import data into Neo4j

Customized tools were programed to extract filed values from CVE data feeds, Maven POM files and crawled Github MongoDB and then write into CSVs, finally, these CSVs are loaded into Neo4j one by one. Figure 3 shows sample codes to import CVE data into graph database. Additionally, daily execution scripts were coded to check new data periodically and make sure the system is up-to-date. After data had been imported, Cypher, an effective and expressive query language from Neo4j, was used to query graph data for further analysis. Figure 4 shows a Cypher sample query and the returned graph including nodes and relationships.

Fig. 4.
figure 4

Cypher query sample to show nodes and relationships

5 Ontology Matching

In this paper, the goal of ontology matching is to match the Project from Maven ontology with the Project in Github ontology, and to match the ProductVersion from CVE ontology with the ProjectVersion in Maven ontology. Accordingly, two ontology matching approaches named as the OM-MG and the OM-CM respectively, are described below:

Fig. 5.
figure 5

Illustrative example of OM-MG

5.1 Ontology Matching from Maven to Github (OM-MG)

The OM-MG approach is mainly focused on matching Github Project with Maven Project. The main idea is to extract Github links from CVE vulnerability references and then navigate to the corresponding repository in Github. From the repository, all project information are located and used for ontology matching.

The example in Fig. 5 illustrates the implementation of the OM-MG approach, and the detailed steps are described as follows:

  1. 1.

    CVE-2015-2156 has a reference link: https://github.com/netty/netty/pull/3754, shown as a pull request link.

  2. 2.

    The pull request information showed that this pull request was merged into version netty:3.10.

  3. 3.

    The POM.xml file can be located from branch/tag 3.10, through which the information including the groupID, artifactID and projectVersion as io.netty, netty and 3.10.7.Final respectively, can be extracted.

  4. 4.

    Version 3.10.1 has been shown both in ProductVersion and ProjectVersion, so the ProductVersion:netty:3.10.1 in the CVE Ontology is matched with the ProjectVersion:io.netty:netty:3.10.1 in the Maven Ontology.

  5. 5.

    The CVE-2015-2156 is matched with the Github PullRequest

    https://github.com/netty/netty/pull/3754, which links to repository

    https://github.com/netty/netty/

  6. 6.

    Tracing the Github Repository https://github.com/netty/netty/, the matching of this repository to CVE-2015-2156, CVE-2016-4970, CVE-2014-3488 and CVE-2014-0193 respectively, can be achieved.

  7. 7.

    With the help of extra information provided in Github, the Maven Project is matched with Github Project successfully.

As shown in Fig. 5, starting from a single reference link, valuable information from the Github can be extracted through the proposed OM-MG approach. This also demonstrates the importance of the Github ontology.

figure a

5.2 Ontology Matching from CVE to Maven (OM-CM)

The OM-CM approach is dedicated to matching CVE ProductVersion with Maven ProjectVersion. Essentially, ontology matching can be seen as a typical classification problem, solved via filtering the best matched results according to a series of different features. Random forests are an ensemble learning method for classification by constructing a multitude of decision trees at training time and outputting the mode of the classes. Its obvious advantages include: (1) fast training speed; (2) high accurate classification result; (3) avoiding the problem of overfitting. Besides, preliminary results show that it performs well in even an unbalanced dataset.

Table 2. The list of extracted features. Within feature description, different abbreviations are explained as follows: vd means vendor name, gid means groupID name, pd means product name, aid means artifactID name, pdv means product version number and pjv means project version number.

Data Pre-processing. The majority of maven projects are open source softwares, while CVE affected products are from multiple categories and sources. Noticing this difference, an initial filtering process on the entire dataset is required to reduce the candidate dataset size and filter out those data (such as commercial softwares, hardwares etc.) that can not be applied to the matching processes. Detailed data pre-processing algorithm can be viewed in Algorithm 1.

According to the experiment data, this initial data pre-processing algorithm is highly effective. After the data pre-processing step, the product version dataset size was reduced from 186,352 down to 76,330, and the project version dataset size was reduced from 2,383,247 down to 891,202.

Feature Extraction. In order to obtain more accurate matching results, the features to be extracted and summarized shall be comprehensive enough. Based on such a consideration, a total of 16 features are included in OM-CM approach presented in this paper. Table 2 summarizes the list of features. The OM-CM algorithm could recognize and classify labeled data by utilizing different combinations of features.

Table 3. Labeled experimental data sample. mpv and cpv is the dataset list index of Maven ProjectVersion and CVE ProductVersion respectively; f1 to f16 stands for feature index with its values, and label value indicates the positive or the negative.

Experimental Dataset. From the candidate product version list and project version list generated through the data pre-processing, a total of 20,000 data records are randomly selected and grouped into the experimental dataset, of which 10,000 are positive and the rest 10,000 are negative. The dataset was manually labeled by 10 graduate students with relevant expertise knowledge, and then, split into a training (80%) and a testing (20%) dataset. An illustrative labeled data sample can be seen in Table 3.

Fig. 6.
figure 6

An extracted local view of trained decision tree structure sample

Model Training. The machine learning in Python toolkit, scikit-learn, an efficient open source toolkit for data mining and data analysis, was used to implement the model training. The goal is to create a model that predicts the value of a target variable by learning decision rules inferred from the data features. Class sklearn.ensemble.RandomForestClassifier was used and all parameters were set to default.

Figure 6 illustrates an extracted local view of the trained tree structure generated by the algorithm. It is noticeable that the tree structure was trained properly, and the positive and negative data samples were distinguished gradually by combinations of different features.

6 Experimental Results

6.1 Experimental Results for OM-MG Approach

By introducing Github ontology, via Github references shown in CVE, unique reference links and repositories were extracted and corresponding project ontologies had been matched. Detailed statistics data were collected for each step in the OM-MG approach and presented below.

  • There are a total of 94,035 CVE records, in which 5,017 CVE records contain Github reference links.

  • For those 5,017 CVEs, a total of 19,511 unique Github links were extracted.

  • All of the 19,511 links were grouped into 1,201 distinct repositories.

  • Utilizing the Github language APIFootnote 8, a total of 107 repositories were filtered out, whose primary coding language is “Java”. The rest of repositories are categorized according to the coding languages, e.g. Python, PHP, C++, Ruby ...

  • All of these 107 Github repositories were linked with corresponding Maven projects.

  • 5,017 CVEs and 1,201 Github repository were linked together with three different relationships, one-to-many, one-to-one and many-to-one respectively.

6.2 Experimental Results for OM-CM Approach

Experiments have been performed with the 20,000 manually labeled dataset, of which 80% is used for training and the rest of 20% is used for testing. Prediction confusion matrix and evaluation formula are shown in Table 4 and formula 1 and 2 respectively.

$$\begin{aligned} precision = \frac{TP}{TP + FP} ; recall = \frac{TP}{TP + FN} ; accuracy = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
(1)
$$\begin{aligned} \textit{ false discovery rate (FDR) }= \frac{FP}{TP + FP} \end{aligned}$$
(2)
Table 4. Basic prediction confusion matrix

Table 5 summarizes detailed testing results for the proposed OM-CM approach. As it can be seen, the proposed OM-CM approach demonstrates promising precision rates as high as 99.95%, and the accuracy rate reaches as high as 99.85%.

Table 5. Experimental results of total 4,000 testing data samples

After the completion of model training, the whole candidate dataset, including a total of 891,202 Maven ProjectVersion list and a total of 76,330 CVE ProductVersion list, excluding the training data, was processed by the model. Experimental results showed that a total of 33,560 positive records were generated from ProductVersion to ProjectVersion.

For those 33,560 positive records, 500 randomly selected data, repeated for 5 times, a total of 2,500 records were verified manually. Table 6 shows detailed verification statistics, which also demonstrates the feasibility and accuracy of the proposed approach.

Table 6. Manual verification for actual matching results, repeated 5 times.
Table 7. False Positive samples extracted from verification results.

Discussion. Three false positive records were extracted and put into Table 7 separately. For the record in row 2, com.facebook.android:facebook-android-sdk:4.0.1<- ->android 4.0.1, was predicted as True by the model. While actually, this record obey both two features, product name as android and version number is correct. The first reason is that this specific product version has only two features which is insufficient. On the other side, for the ProjectVersion, we could not extract facebook and add it into the feature list neither, because of the inconsistency of maven groupID naming logic. Otherwise, it will result in even worse matching results.

Another two records shown in row 5 indicate that the data samples were classified using different weighted features from top to down, it does have local deviations in classification at the very bottom level. However, the average precision and accuracy rate is controlled at a very low range, and the final matching results are satisfying and promising.

7 Conclusions

In this paper, we propose a software vulnerability ontology integrating the CVE, Maven and Github ontologies together to provide vulnerabilities or open source projects with more valuable inner-state information stored in Github. Furthermore, in order to link all of the three ontologies together into a whole software vulnerability knowledge graph, two ontology matching approaches are proposed to match Maven Project with Github Project, and match CVE ProductVersion with Maven ProjectVersion, respectively.

Experimental results demonstrate that OM-MG approach has the ability of linking 107 Github repositories with corresponding Maven Projects, and linking 5017 CVEs with 1201 Github repositories. The OM-CM approach links a total of 33,560 pairs of CVE ProductVersion with Maven ProjectVersion, with an average precision rate as high as 99.88%.