Relationship Matching of Data Sources: A Graph-Based Approach

Feng, Zaiwen; Mayer, Wolfgang; Stumptner, Markus; Grossmann, Georg; Huang, Wangyu

doi:10.1007/978-3-319-91563-0_33

Zaiwen Feng^15,16,
Wolfgang Mayer¹⁵,
Markus Stumptner¹⁵,
Georg Grossmann¹⁵ &
…
Wangyu Huang¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10816))

Included in the following conference series:

International Conference on Advanced Information Systems Engineering

2562 Accesses
1 Citations

Abstract

Relationship matching is a key procedure during the process of transforming structural data sources, like relational data bases, spreadsheets into the common data model. The matching task refers to the automatic identification of correspondences between relationships of source columns and the relationships of the common data model. Numerous techniques have been developed for this purpose. However, the work is missing to recognize relationship types between entities in information obtained from data sources in instance level and resolve ambiguities. In this paper, we develop a method for resolving ambiguous relationship types between entity instances in structured data. The proposed method can be used as standalone matching techniques or to complement existing relationship matching techniques of data sources. The result of an evaluation on a large real-world data set demonstrated the high accuracy of our approach (>80%).

You have full access to this open access chapter, Download conference paper PDF

Schema matching based on SQL statements

Article 09 May 2019

Entity Resolution

Keywords

1 Introduction

Information pertaining to law enforcement activities is obtained from multiple sources and in a variety of data formats, which must be consolidated into a common data model to facilitate searching and long-term data management. Due to manually mapping data sources to the common data model is a tedious task, a handful of mapping design systems have been developed. These systems include InfoSphere Data Architect (from Clio [15]), BizTalk Mapper [22], Altova MapForce [21], and Stylus Studio [23]. All of these systems are based on the same general methodology that was first proposed in Clio [15]. Several approaches have been proposed to automate this process. Most of these approaches [9,10,11, 19] focus on semantic labelling, annotating data fields, or source attributes, with classes and/or properties of common data model. However, a precise mapping needs to describe the semantic relations between the source attributes in addition to their types.

In recent years, several works have already addressed relationship matching problem. Karma^{Footnote 1} [2,3,4, 19, 20] is an information integration tool that enables users to quickly and easily integrate data from a variety of data sources including databases, spreadsheets, JSON, and Web APIs. To use Karma, end-users firstly import the domain ontologies they want to use for modeling the data. The system then automatically suggests semantic labels for each columns of source data. Later, they exploit the created semantic labels and the domain ontologies to learn high-quality relationships and finally a semantic model for the loaded data source. Karma has been used to model the data from Smithsonian American Art Museum^{Footnote 2} and then publish it into the LD cloud. However, there exist some limitations: Karma is not effective to be applied in disambiguating multiple relationship types between two recognized entity instances when integrating data sources into knowledge graph based on a semantic model. This requirement, however, is fairly frequent in the Integrated Law Enforcement (ILE) Project of D2D CRC^{Footnote 3} because there might exist multiple relationships between a pair of neighboring classes in the common data models used [17]. For example, there are 54 different kinds of relationship types between the class Person and the class Location, and 119 relationship types between Person and Person.

In this paper, we extend Karma and present a novel approach that disambiguates different types of relations between the data fields in the data sources including databases and spreadsheets. The main contribution of our approach is a mechanism to distinguish and then obtain a correct relationship type (e.g. lives at) between two recognized entity instances (e.g. John Smith and 5 Long Road) of a knowledge graph, even though there exist multiple kinds of relationships between a pair of classes (e.g. Person and Location) that the above entity instances corresponds to in the common data model. This technique is beneficial to automate tasks of transforming structured data sources into the linked data based on the common data model. To our knowledge, no previous work specially deals with distinguishing relationship types of knowledge graph in the context of data integration.

This paper is structured as follows. Section 2 demonstrates a motivation example of our work, followed by Sect. 3 that describes our approach. Section 4 gives an evaluation of the approach in this paper. Section 5 presents a review of related work. In Sect. 6 we conclude the paper and discuss future work.

2 Motivation Example

In this section, we explain the problem of learning relationship types between recognized entity instances by giving a concrete example that will be used throughout the paper to illustrate our approach. Figure 1 shows a common data model where the ovals represent classes (e.g. Organization, Person, Location, etc.), and rectangles stand for the data attributes of a class (e.g. number, street, postcode and state). We formally define a semantic type to be a pair consisting of a domain class and one of its data properties $ \left\langle {class\_uri, property\_uri} \right\rangle $ [19]. The solid lines denote the relationship between different classes (e.g. located_in between the class Organization and the class Location), and dashed lines link class and its corresponding data attributes.

As shown above, $ S1 - S6 $ are spreadsheets of a new data source in this scenario, including S1: Person_Address, S2: Organization, S3: Publication, S4: Bank_Account, S5: Bank and S6: Bank_Transaction_Record. We want to match all the data values of the new data source to the common data model shown in the top of Fig. 1.

The first step in mapping the spreadsheets $ S1 - S6 $ shown in Fig. 1 to the common data model is to label its attributes with data attributes. For example, the correct semantic types for the fifth column of $ S1 $ (with data value John Smith, Mary Brown and David Smith) are $ \left\langle {Person, name} \right\rangle $, for the sixth column (with data value 39, 24 and 34) are $ \left\langle {Person, age} \right\rangle $. Various techniques can be employed to automate the labeling task [12, 13, 19]. However, a mapping that only includes the types of the attributes is not enough because it does not reveal how the attributes relate to each other. To build a precise mapping, we need a second step that determines how the semantic labels should be connected to capture the intended meaning of the data. In this work, we assume that the labeling step is already done and we focus on distinguishing the relationship types.

Assume that we have already obtained a knowledge graph based on the common data model and some other data sources. The initial knowledge graph includes an amount of semantic content. Now we intend to import the new data source $ S1 - S6 $ into the current knowledge graph. As shown in the red-coloured rounded rectangle at the left bottom of Fig. 1, for the new data source $ S1 $, all the columns have been annotated by the data properties of the class Person and the class Location in the common data model respectively. The attributes of name and age in the table are annotated by the data property Name and Age of the Person entity. Similarly, the attributes of number of address, street name, post code, and state name in the table are annotated by the data property Number, Street, Postcode, State of the Location entity.

However, the current way of modelling does not always correctly represent the semantics of this new data source at the instance-level. The reason is that although there exist multiple relationship types between the class Person and the class Location, i.e. works at, lives at, rents house at and shops at, in the common data model (shown at the top of Fig. 1), the relationship between a person instance and a location instance in the knowledge graph is ambiguous unless it is designated manually or captured explicitly in the new data source. For an instance, consider the 3^rd tuple of $ S1, $ in case the relationship type between David Smith and Smith Street is not designated yet, we do not know what the real relationship type between them is if we simply depend on the common data model, leading that the relationship between a person instance and a location instance (e.g. David Smith and Smith Street) in the knowledge graph is ambiguous.

Now the problem is if we leverage the initial linked data as background knowledge to distinguish relationship types between attributes of new data sources? The basic idea of our approach is to exploit the initial linked data as knowledge background to distinguish relationship types between attributes of new data sources. Once we have identified the semantic types of the source attributes, we can search the linked data and slice it into a bundle of fragments of knowledge graph. For example, we will obtain four kinds of smaller linked data graphs, which contains a specific relationship type, including lives at, rents house at, works at and shops at, between a Person instance and Location instance. Then, we learn from these knowledge graph fragments as examples to infer relationship types for the new data sources.

3 Our Approach

Our approach to automatically distinguishing multiple relationship types between a pairwise recognized entity instances rests on graph extraction, graph matching, and machine learning for relationship type classification. The inputs to our approach are a repository of (RDF) linked data in a domain, and a data source whose attributes are already labeled with the correct semantic types. The output is an updated knowledge graph expressing how the missed relationship types are disambiguated.

The overall approach includes three steps, which are shown in Fig. 2. Step 1: we slice the knowledge graph into a bundle of graphs (e.g. B1, B2, N1 and N2). Specified relationship types are at the center of a group of these graphs. We then categorize these graphs into groups (e.g. t or n) according to its central relationship. Step 2: we extract frequent subgraphs for each group of graphs. Step 3: we select part of these frequent subgraphs in Step 2 as discriminative feature set (e.g. F1, F2 and F3), code a feature matrix and build an appropriate classifier (e.g. Neural Net or Decision Tree).

Now suppose we have new data sources, which contain ambiguous relationships between columns that needs to be clarified. First, we will import all of data values of the new data sources into the existing knowledge graph. Then, similar with Step 1, we slice the knowledge graph into a set of graphs, each of which contains an ambiguous relationship that needs to be clarified. The obtained classifier can be used to classify any of these graphs into a certain group (t or n) in Step 1, and then we can clarify an ambiguous relationship by identifying the proper relationship type through classifying a graph containing the ambiguous relationship based on a data pool of linked data from and across multiple data sources.

3.1 Building Boundary Graph

The essence of our approach is to analyze the graph structure around a special relationship type (e.g. lives in). However, most knowledge graphs are likely to be very large. The knowledge graph needs to be sliced into a set of smaller ones as examples for training. All these sliced boundary graphs are at the center of a special relationship type.

A boundary graph is a directed graph with the relationship $ r $ between a pair of anchor vertices $ x1 $ and $ x2 $, and a given distance from the farthest nodes to the anchor vertices, where:

$ x1, x2 $: the anchor vertices of the boundary graph;
$ r $: the central relationship between $ x1 $ and $ x2 $ of the boundary graph;
$ l $: the maximum length of the farthest node of the boundary graph with the start of $ x1 $ or $ x2 $.
$ maxDegree $: the maximum degree for each vertex.

We give the procedure for creating a boundary graph from a RDF repository, which is the output of Karma. First, we discover a central relationship $ r $ and its corresponding anchor points $ x1 $ and $ x2 $ and we construct an initial boundary graph. Subsequently, we extend the initial boundary graph with a process of depth-limited breadth-first search. As such, the size of the extended graph is controlled by the maximum distance from the anchor vertices and maximum degree of each vertex.

Figure 3 shows an example of boundary graph. The anchor points of this boundary graph are an instance of the class Person and an instance of class Location with the relationship typed as rents_house_at (green-colored) between them. In this example, the maximum length from the farthest vertex to the anchor points $ x1 $ or $ x2 $ (blue-colored) is 3.

3.2 Extracting Patterns from Boundary Graphs

Although a boundary graph is a sliced fragment from the whole knowledge graph, it contains a lot of graph patterns related to the anchor points and the central relationship. Given a set of boundary graphs with a specified relationship type, we mine the schema-level patterns connecting the instances of the classes. Each pattern is a graph in which the nodes correspond to classes and the links correspond to relations in the common data model.

Formally, given a Boundary Graph Dataset, $ BGD = \left\{ {G_{0} ,G_{1} , \ldots ,G_{n} } \right\} $, each boundary graph $ G_{i} \in BGD $ has the anchor points $ x_{1} $ and $ x_{2} $ and the central relationship $ r $, $ support\left( g \right) $ denotes the number of graphs (in $ BGD $) in which $ g $ is a subgraph. The problem of extracting patterns from the set of boundary graphs can be phrased as finding subgraphs $ g $ $ {\text{s}}.{\text{t}}. support\left( g \right) \ge minSup $ (a minimum support threshold). To filter out the subgraphs of which the size is too small, we set a minimal edge number and a minimal node number as the bound of size of $ g $. We extract the frequent subgraphs in $ BGD $ using the gSpan algorithm [6]. These extracted frequent subgraphs denote the graph patterns of $ BGD $.

Figure 4 shows four graph patterns extracted from the boundary graphs with the central relationship typed as rents_house_at (One of the BG is shown in Fig. 3). The first sub-graph shows that, if a Person rented a house in a certain Location and both were linked to a Medicare Card, then the underlying pattern is that the Person is the owner of this Medicare Card, and the Location is the registered location with the same Medicare Card. The second sub-graph demonstrates that, if a Person rented a house in a certain Location and both were linked to a Bank Account, then the underlying pattern is that the Person is the owner of this Bank Account, and the Location is the registered location with the same Bank Account. The third sub-graph shows a pattern that, if a Person rented a house in a certain Location where a Property was located in, then there might be another Person who is the owner of that Property. The last sub-graph shows that, if a Person rented a house in a certain Location where a property was located in, then if another Person is the owner of that Property, the tenant could be related to a Bank Transaction Record, which is related to a Bank Account that is owned to the property owner.

3.3 Classifying Boundary Graphs

Suppose there are a set of relationship types $ R = \left\{ {r_{1} ,r_{2} , \ldots ,r_{m} } \right\} $ and a Boundary Graph Dataset $ BGD = \left\{ {G_{1} ,G_{2} , \ldots ,G_{N} } \right\} \left( {m \le N} \right) $, where each boundary graph $ G_{i} \left( {i \le N} \right) $ contains a specified relationship type r $ \left( {r \in R} \right) $. For instance, suppose we have a $ BGD $ that includes 50 boundary graphs with left anchor vertex Person and right anchor vertex Location. There are 2 different central relations, i.e. $ r_{1} = rents\_house\_at, $ and $ r_{2} = works\_at $.

We pose the problem of distinguishing relationship types as a boundary graph classification task. Given a set of $ N $ training examples of the form $ \left( {x_{1} ,y_{1} } \right), \ldots ,\left( {x_{N} ,y_{N} } \right) $ such that $ x_{i} $ is the feature vector of the $ i^{th} $ example (i.e., example boundary graph $ G_{i} , G_{i} \in BGD $) and $ y_{i} $ is the label (i.e., central relationship type $ r $, $ r \in R $), our learning algorithm seeks a function $ g:X \to Y $, where $ X $ is the input space and $ Y $ is the output space.

The features used in our algorithm are graph patterns (see Sect. 3.2) that appear frequently in a set of boundary graphs with a certain relationship type $ r \left( {r \in R} \right) $. Let $ BG_{i} \left( {1 \le i \le m} \right) $ be a group of boundary graphs with the central relationship type $ r $ $ (r_{i} \in R) $ (e.g. works_at). We leverage the method described in Sect. 3.2 to find the frequent subgraphs set $ F_{i} $ for $ BG_{i} $. As such, each $ BG_{1} ,BG_{2} , \ldots ,BG_{m} $ has its corresponding frequent feature set $ F_{1} ,F_{2} , \ldots ,F_{m} $, respectively. Since we are interested in finding the most Discriminative Feature Set (DFS) for the classification work, we ignore all the subgraphs that are common between $ F_{i} $ and $ F_{1} \cup F_{2} \cup \ldots \cup F_{i - 1} \cup F_{i + 1} \cup \ldots \cup F_{m} $. Let $ F_{i}^{{\prime }} = F_{i} - \left( {F_{1} \cup F_{2} \cup \ldots \cup F_{i - 1} \cup F_{i + 1} \cup \ldots \cup F_{m} } \right) $ be the DFS for $ F_{i} $, thus obtaining $ DFS = F_{1}^{{\prime }} \cup F_{2}^{{\prime }} \cup \ldots \cup F_{m}^{{\prime }} $, which we use for classifying all of the boundary graphs in $ BGD $. Once we have the feature vectors for all the boundary graph sets, we train a classification algorithm to discriminate between the relationship types we seek to disambiguate.

After obtaining the DFS, we can compute the feature vector of $ G^{r} $ (i.e. $ G_{X}^{r} $) using a subgraph matching algorithm [7] to find the exact matching. Let $ G_{X}^{r} $ be a vector of length $ p \left( {p = \left| {DFS} \right|} \right) $, where the $ i^{th} $ entry in $ G_{X}^{r} $ is 1 if $ x_{i} \in DFS $ is a subgraph of $ G^{r} $.

For an instance, let the frequent sub-graphs of the boundary graphs $ G_{1} - G_{3} $ with relationship type rents_house_at be $ F_{1} = \left\{ {f_{1} ,f_{2} ,f_{3} ,f_{4} } \right\} $. Similarly, let the frequent sub-graphs of the boundary graphs $ G_{4} - G_{6} $ with relationship type works_at be $ F_{2} = \left\{ {f_{5} ,f_{6} ,f_{7} ,f_{8} } \right\} $. There is not any intersection between $ F_{1} $ and $ F_{2} $. Thus, we obtain the discriminative feature set $ DFS = \left\{ {f_{1} ,f_{2} ,f_{3} ,f_{4} ,f_{5} ,f_{6} ,f_{7} ,f_{8} } \right\} $. We compute the feature vector of each graph using a subgraph matching algorithm and finally code the matrix for training model, as shown in Table 1. A classifier could be built based on this matrix.

Table 1. Matrix for training model

Full size table

Given $ G^{r} $ containing a central relationship $ r $ that we seek to distinguish, we compute the feature vector of $ G^{r} $ using a subgraph matching algorithm based on $ DFS $ and apply the trained classifier to predict the relationship type between the anchors of $ G^{r} $.

4 Evaluation

A comprehensive performance study has been conducted in our experiments on real world dataset. We applied YAGO (Yet Another Great Ontology) [5] data set, which is a massive semantic knowledge base, derived from Wikipedia, WordNet and GeoNames. Currently, YAGO has knowledge of more than 10 million instances of entities (like persons, organizations, cities, etc.), 99 relationship types, and contains more than 120 million facts about these entities. For an instance, a typical YAGO fact is shown below.

$$ {\text{ < Wouter}}\_{\text{Vrancken > }}\,{\text{ < playsFor > }}\,{\text{ < K}}.{\text{V}}.\_{\text{Kortrijk > }} $$

Here, K.V. Kortriik is a Belgian professional football club, which is annotated with the entity Organization. Wouter Wrancken, who is annotated with the entity Person, is a former Belgian defensive midfielder in association football. The relationship between Person and Organization is playsFor in this case.

Our performance tests show that our method has much better accuracy on the YAGO dataset than Karma. Our method also demonstrates a good scalability on YAGO dataset since it succeeds in completing the match of relationships with 1 K boundary graphs containing over 100 nodes.

All of our experiments are done on a 2.5 GHZ Intel Core i7 PC with 16 GB main memory, running OS X 10.11.6. We used gSpan provided by Yan et al. [6] to get the frequent subgraphs during the process of experiment, and Exact Subgraph Matching algorithm library provided by Liu et al. [7] to verify subgraph-graph isomorphism.

We evaluated our approach using multiple relationship types between different entities in YAGO knowledge graph. Table 2 shows the number of boundary graphs (# BG) we sliced from YAGO with different relationship types. For an instance, there are 4 different relationship types, i.e. influences (R1), hasAcademicAdvisor (R2), isMarriedTo (R3) and hasChild (R4), between two person entities. 22820 boundary graphs are sliced and extracted from YAGO with central relation type influences.

Table 2. Boundary graphs extracted from the YAGO knowledge graph

Full size table

In our experiment, any group of data set can be described by four parameters: (1) $ \left| N \right| $, the total number of graphs generated, (2) $ \left\{ {R_{1} ,R_{2} , \ldots ,R_{m} } \right\} $, m different central relationship types that the data set involves, (3) $ \left| L \right| $, the maximum length of the farthest node of the boundary graph with the start of anchor points, and (4) $ \left| I \right| $, the maximum degree of each node in the graph. We choose these four parameter settings because these determine the characteristics of boundary graph.

4.1 Accuracy Test

We used cross-validation as the strategy for testing the accuracy rate of our method. For testing accuracy of Karma’s method, we adopt the strategy described as follows. Suppose that we have a tuple of sets of boundary graphs $ (BG_{1} ,BG_{2} , \ldots ,BG_{n} ) $. Each element $ BG_{i} $ represents a set of boundary graphs with the central relationship $ R_{i} $. We will conduct n rounds of experiments. In the $ i $^th round of the experiment, we fetch 5 boundary graphs from $ BG_{i} $ as our testing graphs. We pretend not to know the central relationship types in these 5 graphs and try to predict them. Then, we consolidated all of the rest of boundary graphs from $ \{ BG_{1} , \ldots ,BG_{i - 1} ,BG_{i + 1} , \ldots , BG_{n} \} $ into a merged weighted graph as our training graph. According to the scoring formula in [4], each edge is assigned the weigh $ 1 - x/\left( {n + 1} \right) $ where $ n $ is the number of known boundary graphs and $ x $ is the number of graph identifiers the edge is tagged with. Next, we compare the predicted value with the true value for every testing graph, respectively. For each testing graph, we set the testing result $ y_{j} \left( {j \le 5} \right) $ as 1 if the predicted value is equivalent to the true one, or 0 if not. We average these 5 testing results, and then obtain the accuracy rate $ Y_{i} $ for the $ i $ round of the experiment, i.e. $ Y_{i} = \sum\nolimits_{j = 1}^{5} {y_{i} /5} $. We define the final accuracy for the tuple of boundary graphs $ \{ BG_{1} ,BG_{2} , \ldots ,BG_{n} \} $ as:

$$ Accuracy = \sum\nolimits_{i = 1}^{n} {Y_{i} /n} $$

Let us take #1 comparative experiments on accuracy rate for R1–R4 as an example. In the group of experiments, we have 10, 10, 10, 14 boundary graphs with central relationship type influences (R1), hasAcademicAdvisor (R2), isMarriedTo (R3) and hasChild (R4), respectively. We will perform 4 rounds of experiments. In the first round of experiment, 5 boundary graphs are extracted from 10 boundary graphs with central relation type influences (R1, the 2^nd row and the 3^rd column of Table 3). However, we pretend not to know the relationship type of these 5 boundary graphs and try to predict them and them compare them with the true relationship type. The rest of 39 (44 − 5 = 39) boundary graphs are used to be training set. In the second round, we take 5 boundary graph with hasAcademicAdvisor (R2) as testing set and so on for the 3^rd and 4^th round of experiment.

Table 3. Comparative experiments on accuracy rate for R1–4

Full size table

Tables 3, 4, 5 and 6 shows the accuracy comparison between our method and the method used in Karma. The column “No.” in Tables 3, 4, 5 and 6 refers to the experiment number. We find that the accuracy rate of our method is 3–4 times better than Karma’s method. Karma’s method keeps a stable accuracy rate (25%) when we apply it on 44, 80, 100, 120, 140 and 400 boundary graphs. These experimental results match our previous theoretical analysis. The reason is that, no matter how large the generated Steiner tree is, Karma’s algorithm always selects the maximum-frequency edge between two anchor points. For each experiment, we take only one boundary graph as the testing graph, which contains one of four special relationship types, and the merged graph is generated by the rest of boundary graphs so that it is constant. The frequency of all the edges between two anchor points is kept unchanged throughout an experiment. That is why the accuracy rate of Karma is 25%.

Table 4. Comparative experiments on accuracy rate for R5–8

Full size table

Table 5. Comparative experiments on accuracy rate for R9–10

Full size table

Table 6. Comparative experiments on accuracy rate for R11–12

Full size table

As we can see in the table, the accuracy in distinguishing RT5–RT8 is lower than other kinds of relationship types. We observed that the graph patterns extracted from RT5–RT8 are very similar, and therefore the available discriminative feature set is smaller than other data sources.

4.2 Scalability Test

We applied our method on 100, 400 and 1000 YAGO boundary graphs $ \left( {\left| L \right| = 2, \left| I \right| = 10} \right) $ and 1666 graphs $ \left( {\left| L \right| = 2, \left| I \right| = 5} \right) $, respectively. Table 7 shows the experimental result of the test. We find that the accuracy rate is over 85% for all the groups of data sets.

Table 7. Scalability test using our method and Karma

Full size table

Columns T1–T3 in Table 7 show the time required for individual steps of our method. T1 denotes the time taken to obtain the frequent subgraphs. T2 stands for the time consumed for computing discriminative feature set and coding the feature matrix through subgraph matching algorithm. T3 represents the time taken for building a Neural Net based on the feature matrix. T4 represents the time taken for building a merged graph as background knowledge based on known boundary graphs for Karma. During our experiments, slicing the YAGO knowledge graph into boundary graphs was time consuming. For instance, it took around 12 h to slice 1000 boundary graphs. However, considering that the training process is offline, the training time of our method is considered acceptable. The training of the classifier itself is relatively fast. For example, it took 29.7 s for coding a feature matrix for training based on 1000 boundary graphs, and 88 s for training a Neural Net based on the matrix.

5 Related Work

The work presented in this paper relates to two main streams of research, namely relationship matching and disambiguation in conceptual model.

In recent years, there are some efforts to automatically infer the implicit relationships of tables. In Karma [3, 4], given some sample data from the new source, they leverage the knowledge in the domain ontology and the known semantic models to construct a weighed graph that represents the space of plausible semantic model for the new source. They then exploit Steiner Tree algorithm compute the top k semantic models containing the disambiguated relationships. Limaye et al. [12] used YAGO to annotate web tables and generate binary relationships using machine learning approaches. However, this approach is limited to the labels and relations defined in the YAGO ontology. Venetis et al. [13] presented a scalable approach to describe the semantics of tables on the Web. To recover the semantic of tables, they leverage a database of class labels and relationships automatically extracted from the Web. They attach a class label to a column if a sufficient number of the values in the column are identified with that label in the database of class labels, and analogously for binary relationships. Although these approaches are very useful in publishing semantic data from tables, they are limited in learning the semantics of relations. Both of these approaches only infer binary relationships between pair of columns via a simple match of source node and target node of the relationship. Some other recent work leverages the Linked Open Data (LOD) cloud to capture the semantics of sources. Schaible et al. [14] extracted schema-level patterns (SLPs) from linked data and generate a ranked list of vocabulary terms for reuse in modelling tasks. SLPs are (sts, ps, ots) triples where sts and ots are sets of RDF types and ps is a set of RDF object properties. For example, the SLP $ \left( {\left\{ {Person, Player} \right\}, \left\{ {knows} \right\}, \left\{ {Person, Coach} \right\}} \right) $ indicates that some instances of $ {\text{Person}} \cap {\text{Player}} $ are connected to some instances of $ {\text{Person}} \cap C{\text{oach}} $ via the object property $ knows $. Taheriyan et al. [2] mines the small graph patterns occurring in the LOD and combine them to build a graph that will be used to infer semantic relations. Our work differs from these works as our relationship matching method works on distinguishing many relationship types between two entities at the instance-level.

The relationship matching in our work is dealing with actually a disambiguation problem that disambiguate multiple relationship types between entity instances. There has already existed a lot of work with regard to resolve ambiguity for conceptual models. Mens et al. [8] proposed an inconsistency detection approach by using graph transformation rules to detect ambiguity in UML class models and state machine diagrams, and then automatically rework the defects in such models with resolution rules. A prominent instance of ambiguity is the usage of homonymous or synonymous words. Pittke et al. [1] proposed a technique that detects and resolves terminological ambiguities in large conceptual model collections. The challenge of word sense disambiguation relates to determining the sense of a word in a given context. Supervised machine-learning techniques (e.g. [16]) and clustering approaches (e.g. [18]) are employed to identify context similar words. Our idea is analogous to the above work that tries to infer the correct meaning from the context of the ambiguous relationship. However, to the best of our knowledge, there hasn’t been work on disambiguating relationship types between entity instances of conceptual model.

6 Conclusion

We proposed a novel method to distinguish relationship types between recognized entity instances as an extension of Karma. How to distinguish multiple relationship types between two recognized entity instances automatically is an essential part of build a precise knowledge graph from huge data sources. The core idea of our work is to exploit the small graph patterns occurring in a bundle of boundary graphs, which are sliced from the existing linked data, with a specific central relationship type, to hypothesize the relationship types between recognized entity instances within a new data source. The experiment result on YAGO demonstrated the high accuracy of the approach (>80%) to distinguish multiple relationship types between recognized entity instances automatically.

There still exist some limitations for our work. On one hand, we observed that our approach is limited in gaining high accuracy rate on distinguishing some of specific relationship types, for example, wasBornIn and diedIn between two person instances. The reason is that most of the graph patterns related to these two relations are similar so that it is hard to obtain a sufficient discriminative feature set. One direction of our future work is to investigate more efficient features, not only graph structure, to classify boundary graphs more efficiently. One the other hand, our current approach comes with the assumption that the relationship matching problem is a classification problem. The cases where relationship types exist that go beyond the already known ones cannot be addressed in this paper yet.

Notes

References

Pittke, F., Leopold, H., Mendling, J.: Automatic detection and resolution of lexical ambiguity in process models. IEEE Trans. Softw. Eng. 41(6), 526–544 (2015)
Article Google Scholar
Taheriyan, M., Knoblock, C.A., Szekely, P., Ambite, J.L.: Leveraging linked data to discover semantic relations within data sources. In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 549–565. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4_33
Chapter Google Scholar
Knoblock, C.A., et al.: Semi-automatically mapping structured sources into the semantic web. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 375–390. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30284-8_32
Chapter Google Scholar
Taheriyan, M., Knoblock, C., Szekely, P., Ambite, J.L.: Learning the semantics of structured data sources. Web Sem. Sci. Serv. Agents World Wide Web 37, 152–169 (2016)
Article Google Scholar
YAGO Official Website, 27 July 2017. http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/YAGO-naga/YAGO/downloads/
Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: Proceedings of the 2002 International Conference on Data Mining (ICDM 2002). IEEE (2002)
Google Scholar
Liu, H., Keselj, V., Blouin, C.: Exploring a subgraph matching approach for extracting biological events from literature. Comput. Intell. 30(3), 600–635 (2014)
Article MathSciNet Google Scholar
Mens, T., Van Der Straeten, R., D’Hondt, M.: Detecting and resolving model inconsistencies using transformation dependency analysis. In: Nierstrasz, O., Whittle, J., Harel, D., Reggio, G. (eds.) MODELS 2006. LNCS, vol. 4199, pp. 200–214. Springer, Heidelberg (2006). https://doi.org/10.1007/11880240_15
Chapter Google Scholar
Ramnandan, S.K., Mittal, A., Knoblock, C.A., Szekely, P.: Assigning semantic labels to data sources. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 403–417. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18818-8_25
Chapter Google Scholar
Dhamankar, R., Lee, Y., Doan, A., Halevy, A., Domingos, P.: iMAP: discovering complex semantic matches between database schemas. In: International Conference on Management of Data (SIGMOD), New York, NY, pp. 383–394 (2004)
Google Scholar
Qian, L., Cafarella, M.J., Jagadish, H.V.: Sample-driven schema mapping. In: SIGMOD 2012 (2012)
Google Scholar
Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. PVLDB 3(1), 1338–1347 (2010)
Google Scholar
Venetis, P., Halevy, A., Madhavan, J., Pa̧sca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recovering semantics of tables on the web. Proc. VLDB Endow. 4(9), 528–538 (2011)
Article Google Scholar
Schaible, J., Gottron, T., Scherp, A.: TermPicker: enabling the reuse of vocabulary terms by exploiting data from the linked open data cloud. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 101–117. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34129-3_7
Chapter Google Scholar
Popa, L., Velegrakis, Y., Hernández, M., Miller, R., Fagin, R.: Translating web data. In: VLDB, pp. 598–609 (2002)
Chapter Google Scholar
Navigli, R., Ponzetto, S.P.: Joining forces pays off: multilingual joint word sense disambiguation. In: EMNLP-CoNLL (2012), pp. 1399–1410. ACL (2012)
Google Scholar
Grossmann, G., Kashefi, A.K., Feng, Z., Li, W., Kwashie, S., Liu, J., Mayer, W., Stumptner, M.: Integrated law enforcement platform federated data model. Technical report, Data 2 Decision CRC (2017)
Google Scholar
Pantel, P., Lin, D.: Discovering word senses from text. In: SIGKDD (2002), pp. 613–619. ACM (2002)
Google Scholar
Pham, M., Alse, S., Knoblock, C.A., Szekely, P.: Semantic labeling: a domain-independent approach. In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 446–462. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4_27
Chapter Google Scholar
Knoblock, C.A., et al.: Lessons learned in building linked data for the American art collaborative. In: d’Amato, C., Fernandez, M., Tamma, V., Lecue, F., Cudré-Mauroux, P., Sequeda, J., Lange, C., Heflin, J. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 263–279. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_26
Chapter Google Scholar
Altova mapforce. http://www.altova.com/mapforce.html
Microsoft biztalk server. https://www.microsoft.com/en-au/cloud-platform/biztalk
Stylus studio. http://www.stylusstudio.com/

Download references

Acknowledgements

This research was partially funded by the Data to Decisions Cooperative Research Centre (D2D CRC). We appreciate Dave Blockow and Troy Wuttke in D2D CRC for providing technical support as we developed the prototype.

Author information

Authors and Affiliations

Advanced Computing Research Centre, University of South Australia, Adelaide, Australia
Zaiwen Feng, Wolfgang Mayer, Markus Stumptner & Georg Grossmann
School of Computer, Wuhan University, Wuhan, China
Zaiwen Feng
Data to Decisions CRC, Adelaide, Australia
Wangyu Huang

Authors

Zaiwen Feng
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Mayer
View author publications
You can also search for this author in PubMed Google Scholar
Markus Stumptner
View author publications
You can also search for this author in PubMed Google Scholar
Georg Grossmann
View author publications
You can also search for this author in PubMed Google Scholar
Wangyu Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zaiwen Feng .

Editor information

Editors and Affiliations

Department of Computer Science, Norwegian University of Science and Technology, Trondheim, Norway
John Krogstie
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Hajo A. Reijers

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feng, Z., Mayer, W., Stumptner, M., Grossmann, G., Huang, W. (2018). Relationship Matching of Data Sources: A Graph-Based Approach. In: Krogstie, J., Reijers, H. (eds) Advanced Information Systems Engineering. CAiSE 2018. Lecture Notes in Computer Science(), vol 10816. Springer, Cham. https://doi.org/10.1007/978-3-319-91563-0_33

Download citation

DOI: https://doi.org/10.1007/978-3-319-91563-0_33
Published: 17 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91562-3
Online ISBN: 978-3-319-91563-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Relationship Matching of Data Sources: A Graph-Based Approach

Abstract