Metallic materials ontology population from LOD based on conditional random field
Introduction
With the development of Linked Open Data (LOD) [1,2], domain ontologies are rapidly built [3,4] in a variety of ways, which leads to a rapid increase in the number of ontologies in various fields. At present, relatively perfect domain ontologies have been established in the fields of environment [[5], [6], [7]], chemistry [[8], [9], [10]] and biomedicine [[11], [12], [13]], and applied in their respective fields. In addition, along with continuous development of industrial technology, immeasurable amounts of metallic materials data have accumulated. Metallic materials refer to a substance (or a mixture of substances) such as steel and alloy, which are indispensable to our life and the basis of the industry. Meanwhile, there are also corresponding ontologies in the field of metallic materials, such as MOA (the metallic materials ontology created by Ashino) [14] and STSM [15]. For the existing metallic materials ontologies, although their schemata are relatively complete, their instances need appending gradually [16,17]. However, for users, they hope that not only the domain ontology has a relatively complete schema, but also it contains rich instances. Thus, enriching their instances is necessary. Meanwhile, there are a lot of triples in LOD, such as DBpedia[18,19], Wikipedia and Yago[20,21], and the knowledge of metallic materials that is covered by these data sets can be used to populate the domain ontology. However, there are differences between the LOD and the domain ontology. Therefore, we come up with an idea to populate a specific metallic materials ontology with the metallic materials data in the LOD.
For the semantic web [22,23], the data is associated with each other, instead of existing alone. So in the field of data integration, when the data needs to be integrated into existing structured data, it is indispensable to not only specify explicitly the data type, but also know exactly where the data is integrated. Meanwhile, for domain ontology population, we need to understand the field of the integrated data, and what's more crucial is to obtain the insertion location. In the process of ontology population, the existing methods of obtaining the filling position are mostly manual. In this paper, we present an approach to populate a specific ontology with the metallic materials data from LOD. The data types in LOD are not single, including the concept and property, and even the data is more numerous. Obviously, it's arduous to populate a specific ontology with LOD by using the existing methods. Hence, we endeavor to design a population strategy which uses the machine learning algorithm to obtain the filling positions of the knowledge that needs to be inserted into a specific ontology.
In summary, this paper uses the machine learning algorithm to populate ontology with the metallic materials data in LOD. First of all, in LOD, we determine the data that needs to be inserted as an instance and obtain its related data. Meanwhile, we use CHT (Chain Triple) to describe the structured data which contains the population data that can be filled into the ontology and its related data extracted from LOD, and the detailed definition is given in Section 3. Then, we obtain the filling position in the ontology according to the CRF algorithm. Finally, the data is inserted into the ontology. For experiment testing, we insert the metallic materials data in DBpedia and Yago into existing metallic materials ontologies, such as STSM and MOA. The experiment results show that our method can not only obtain high accuracy and F-measure, but also still achieve higher F-measure when changing the material ontology needs to populate. Meanwhile, it costs a relatively short time to obtain the filling position of the CHT.
The contributions of our work can be summarized as follows:
- (1)
For the existing approaches of ontology population, they usually focus more on analyzing natural language text and often neglect other more appropriate sources of information, such as the structured and semantically rich sets of LOD. Being different from the existing approaches, this paper proposes using LOD to populate a specific metallic materials ontology.
- (2)
When the LOD is inserted into a specific ontology, the types of data inserted into the ontology are identified. Meanwhile, the data that needs to be filled into the ontology is also determined. In order to obtain the filling position where the data is populated into the ontology, we transform the LOD into an army of CHTs according to the determined filling data, and we specify the format of the CHT, which contains classes, instance and properties. For the CHT, we should note that its instance and properties are the data that is populated into the ontology, and its classes are the information for judging the filling position. In this way, the filling position can be determined by the information of the corresponding CHT, instead of by the whole data of LOD.
- (3)
In our proposed approach, the filling position of instance and property in the CHT is obtained by using the CRF algorithm. This approach not only avoids manual statistics and designing the rules of the data which needs to be inserted into the specific ontology, but also achieves ontology population faster. In addition, a generation strategy that combines the specific ontology and CHTs is designed, and the strategy transforms them into the input data set. The users can utilize this strategy to generate the input data set which can be recognized by the CRF algorithm directly.
- (4)
We evaluate our approach using precision, recall and F-measure, and its experiment results are satisfactory. Furthermore, as the scale of the data sets increases, the F-measure is constantly increasing. Moreover, we conduct experiments using different LOD data sets and existing metallic materials ontologies, and the results are acceptable.
The rest of the paper is organized as follows: in Section 2, we discuss related work. Section 3 describes the problems and defines the concepts. Following that, Section 4 introduces the approach and process in this paper. In Section 5, we describe detailed implementation method. In Section 6, the experiment evaluation is given and discussed. Finally, Section 7 provides the conclusion and future work.
Section snippets
Related work
In domain ontology, the classes usually constitute the knowledge framework of the whole ontology. For the existing domain ontologies, most research issues focus on the construction and relevance of the whole class knowledge framework. However, the users of ontology desire not only the schema is perfect, but also there is a large number of instances in the domain ontology. Therefore, more and more researches pour attention into the population of instances in domain ontology.
At present, the data
Problem description
In the existing metallic materials ontologies, most of them have satisfactory schemata, but the instance knowledge needs appending increasingly. For example, STSM [15] is an metallic materials ontology, which is developed for the integration of heterogeneous materials data and covers the basic knowledge of metallic materials. STSM contains some basic concepts, e.g. Element, Property, Steel and Unit, which are mainly used to represent the knowledge related to metallic materials. Element is
Approach overview
In the paper, we propose an approach to populate STSM with the metallic materials data in the LOD. Meanwhile, the filling positions are obtained by using the CRF algorithm. Fig. 3 illustrates the process of filling the LOD into the ontology. The steps are as follows.
Step 1. Getting the CHTs from the LOD. Firstly, we determine the node which needs be inserted into the ontology from the LOD. And then, we obtain the related information in the LOD by the node, which contains properties and other
Methodology
In this section, we introduce our proposed method in detail, which is about inserting the metallic materials data in the LOD into STSM [15]. In the method, the filling positions are obtained by using the CRF algorithm.
Experiment environment and performance metrics
All the experiments are run on JDK 1.7 which is deployed on the Intel I7 CPU with 12GB memory on the Windows 7 64 bit version.
We use precision, recall, F-measure and time performance to evaluate our approach.
As shown in Eq. (2), Precision (P) denotes that the correct identification results account for the proportion in all identification results, where |CFP| denotes the number of CHTs which get the correct filling position, and |NCFP| is the number of CHTs which get the uncorrected filling
Conclusion and future work
In this paper, in order to continuously improve the knowledge in the instance of existing metallic materials ontology and provide users with a relatively rich domain knowledge in the ontology, we have proposed an approach to populate existing metallic materials ontology with the metallic materials data in LOD. First and foremost, the LOD is huge and complex, and there exist differences between the LOD and the domain ontology. Thus, we insert the selected specific information into the
Acknowledgments
This work is supported by National Natural Science Foundation of China [No. 51271033, 71271076</GN1>]; Natural Science Foundation of Hebei Province [No. F2018208116, F2013208107]; Hebei Science and Technnology Support Program [No. 16210312D]; and Natural Science Foundation of Hebei Education Department [No. QN2015207].
References (56)
- et al.
Eco-informatics modeling via semantic inference
Inf. Syst.
(2013) - et al.
Ontology-based supply chain decision support for steel manufacturers in China
Expert Syst. Appl.
(2013) - et al.
YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia
Artif. Intell.
(2013) - et al.
Semantic Web in data mining and knowledge discovery: a comprehensive survey
Web Semant. Sci. Serv. Agents World Wide Web
(2016) - et al.
Substructure counting graph kernels for machine learning from RDF data
Web Semant. Sci. Serv. Agents World Wide Web
(2015) - et al.
MMKG: an approach to generate metallic materials knowledge graph based on DBpedia and Wikipedia
Comput. Phys. Commun.
(2017) - et al.
MMOY: towards deriving a metallic materials ontology from Yago
Adv. Eng. Inf.
(2016) - et al.
Opinion mining based on fuzzy domain ontology and Support Vector Machine: a proposal to automate online review classification
Appl. Soft Comput.
(2016) - et al.
Supervised classification using probabilistic decision graphs
Comput. Stat. Data Anal.
(2009) - et al.
Ontology-based sequence labelling for automated information extraction for supporting bridge data analytics
Procedia Eng.
(2016)
Linked data: the story so far
Int. J. Semant. Web Inf. Syst.
Linked data evolving the web into a global data space
Mol. Ecol.
Understanding, building and using ontologies
Int. J. Hum. Comput. Stud.
Formal ontologies and information systems
FOIS’98 Conference
The plant structure ontology, a unified vocabulary of anatomy and morphology of a flowering plant
Plant Physiol.
The environment ontology: contextualising biological and biomedical entities
J. Biomed. Semant.
The chemical information ontology: provenance and disambiguation for chemical data on the biological semantic web
PLoS One
A semantic web ontology for small molecules and their biological targets
J. Chem. Inf. Model.
Model tool to describe chemical structures in XML format utilizing structural fragments and chemical ontology
J. Chem. Inf. Model.
Disease Ontology: a backbone for disease semantic integration
Nucleic Acids Res.
Gene Ontology: tool for the unification of biology
Can. Inst. Food Sci. Technol. J.
BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in Biological Networks
Bioinformatics
Materials ontology an infrastructure for exchanging materials information and knowledge
Data Sci. J.
STSM: an infrastructure for unifying steel knowledge and discovering new knowledge
Int. J. Database Theory Appl.
DBpedia-A multilingual cross-domain knowledge base
DBpedia-A large-scale, multilingual knowledge base extracted from wikipedia
Semantic Web
Inside YAGO2s: a transparent information extraction architecture
Cited by (4)
A novel knowledge graph development for industry design: A case study on indirect coal liquefaction process
2022, Computers in IndustryCitation Excerpt :Secondly, considering the particularity of HAZOP text, we skillfully conceive a novel and reliable information extraction model (HAINEX) based on deep learning in combination with data science, HAINEX can extract the ISK in HAZOP reports based on the ISKSF, which is a practical application that can extend the perspective of data science in engineering design about the industrial information with strong structure and logic. Briefly, HAINEX consists of three modules: an optimized pre-training language model termed IBERT for extracting semantic features, an encoder for obtaining the context features through the bidirectional long short-term memory network (BiLSTM) (Hochreiter et al., 1997; Lindemann et al., 2021), and a decoder based on conditional random field (CRF) (Sutton and Mccallum, 2006; Zhang et al., 2018) with an improved industrial loss function termed IL. HAINEX improves the efficiency of ISK extraction by treating features as a candidate set and screening them.
A comprehensive review of conditional random fields: variants, hybrids and applications
2020, Artificial Intelligence Review