Automatically generating data linkages using class-based discriminative properties

https://doi.org/10.1016/j.datak.2014.03.001Get rights and content

Abstract

A challenge for Linked Data is to link instances from different data sources that denote the same real-world object. Millions of high-quality owl:sameAs linkages have been generated, but potential ones are still considerable. Traditional similarity-based methods to this data linkage problem do not scale well since they exhaustively compare every pair of instances. In this paper, we propose an automatic approach to data linkage generation for Linked Data. Specifically, a highly-accurate training set is automatically generated based on equivalence reasoning and common prefix blocking. The contexts of the instances in the training set, after extracting, are pairwise matched in order to learn discriminative property pairs supporting linkage discovery. For a particular class pair and a pay-level-domain pair, the discriminability of each property pair is measured, and a few property pairs with high discriminability are aggregated in order to be reused in the future to link instances between the same classes and domains. The experimental results show that our approach achieves good accuracy against some complex methods in two OAEI tests and the BTC2011 dataset.

Introduction

The Semantic Web (SW) is an effort by the W3C Semantic Web Activity, with the purpose of realizing data integration and sharing among different applications and parties. As of today, many prominent ontologies have been developed for data publishing in various domains, which suggest common classes and properties widely used across data sources.

At the instance level, however, there is lack of agreement among sources on the use of common URIs to denote a real-world object. Due to the distributed nature of the SW, it frequently happens that multiple instances in diverse sources denote the same object, i.e., refer to an identical thing (also known as URI aliases [1] or coreferents). Such examples exist in the areas of personal profiles, academic publications, media or geographical data, etc.

Data linkage, also referred to as instance matching or object coreference resolution, aims at linking different instances for the same object. It is important to data-centric applications such as heterogeneous data integration or mining systems, SW search engines and browsers. Driven by the Linking Open Data (LOD) initiative, millions of instances have been linked with owl:sameAs explicitly [2], whose semantics defines that all the URIs linked with this property should identify the same resource. But compared to billions of URIs on the SW, there still exists a large amount of instances that potentially denote the same objects without being interlinked yet. For example, at least 70 instances crawled by the Falcons search engine [3] seem to denote Tim Berners-Lee, the director of W3C, but merely six have been linked with owl:sameAs. An analysis on the LOD cloud also indicates that, out of 31 billion RDF statements less than 500 million represent linkages between data sources, and most sources only link to one another.1

In the SW community, traditional work addresses the data linkage problem mainly from two directions: one is by performing equivalence reasoning in terms of standard OWL semantics, e.g., through owl:sameAs and some “key” properties [4], [5]; the other is by similarity computation, with the intuition that instances denote the same object if having similar property–values [6], [7]. Recent work also uses machine learning and crowdsourcing to cope with complex data linkage tasks [8], [9], [10], [11]. Generally speaking, the reasoning-based methods infer explicit linkages but may miss many potentials, while the similarity-based ones often suffer from high computational costs as they exhaustively compare all pairs of instances [12]; many of them have not been aware of the commonalities behind the abstract types of the instances and their publishers. For example, different data publishers prefer social security number, login name, address or even their combinations to disambiguate customers, but hobby or age is less likely to be used. It will facilitate data linkage in the future if such properties can be learnt and reused.

In this paper, we propose an automatic approach, called ADL, which differs from current similarity-based methods in learning a set of important properties for disambiguating instances (referred to as discriminative properties). The methodological steps of ADL, shown in Fig. 1, can be divided into the offline part and the online part:

  • For the offline learning, a highly-accurate training set is automatically established. The training set consists of two sets of instance pairs holding the linkages or not, referred to as positive examples and negative examples, resp. The contexts (i.e., a kind of integrated units over RDF triples) for the instances in the training set are extracted in terms of RDF sentences [13], and pairwise matched with a lightweight linguistic matcher V-Doc [14], in order to discover discriminative property pairs, where a discriminative property pair consists of two matchable properties discriminative to link instances. For a specific class pair and a pay-level-domain pair,2 the discriminability of each property pair is measured by information gain, revealing the global and implicit preference of data publishers on characterizing a type of objects.

  • For the online linking, given a new instance as input, the class that it belongs to and its pay-level-domain are firstly extracted, and then the counterparting classes and pay-level-domains in the training set are chosen. The instances, with the properties in the related discriminative property pairs, are found out, and their values are matched with that of the input using V-Doc. The similarities from different discriminative property pairs are linearly aggregated with equal weighting, in order to determine whether to generate an instance linkage.

We develop an open source tool and test its accuracy on three cases: the PR and NYT tests in the Ontology Alignment Evaluation Initiative (OAEI) as well as the Billion Triples Challenge (BTC2011) dataset. The experimental results show that, compared with several existing methods, our method achieves good precision and recall with the help of only a few discriminative property pairs. Moreover, the proposed approach is ready to be integrated with other methods, e.g., the found discriminative properties can be used for cost-effective candidate selection [12].

This paper is organized as follows. We define the data linkage problem in Section 2 and discuss related work in Section 3. In 4 Training set generation, 5 Discriminative property discovery, we present our approach to learn class-based discriminative property pairs. Evaluation is reported in Section 6. Finally, Section 7 concludes the paper with future work.

Section snippets

Problem statement

Let I be the set of URIs, B be the set of blank nodes and L be the set of literals. A triple 〈s, p, o  (I  B) × I × (I  B  L) is called an RDF triple. An RDF graph G is a set of RDF triples, and can be serialized to an RDF document.

For an RDF graph G, a URI u is a class (resp. property) if G entails the RDF triple 〈u, rdf : type, rdfs : Class〉 (resp. 〈u, rdf : type, rdf : Property〉). If a URI u is not either a class, a property or both, then u is treated as an instance, implying the assumption that

Related work

In the SW field, researchers addressed the data linkage problem mainly from two directions: one is by equivalence reasoning. Glaser et al. [4] implemented a Co-reference Resolution Service (CRS) mainly by owl:sameAs. Hogan et al. [5], [7] conducted large-scale object consolidation in terms of the analysis on IFPs. Saïs et al. [18] designed a language RDFS + for reference reconciliation, which combined FPs, IFPs and owl:disjointWith in OWL as well as SWRL rules. The KnoFuss system [19] used

Training set generation

Due to the large scale of the SW, it is time-consuming to manually build a training set with both broad coverage and good accuracy. Thanks to the LOD initiative, millions of instances have been interlinked with owl:sameAs explicitly. This enlightens us to utilize these equivalence relations to automatically generate a highly-accurate, moderate-scale training set.

Discriminative property discovery

Discriminative property pairs are learnt by comparing the contexts of the instances in the training set. In accordance with many similarity-based approaches, e.g., [7], [20], [26], it is assumed that the instances that constitute linkages share similar property–values, and a few properties are more important for characterizing the denoted real-world object.

Evaluation

We developed an open source tool for the proposed method ADL. In this section, we will report our experimental results on two OAEI tests (namely, PR and NYT) as well as the BTC2011 dataset. All the tests were conducted on a PC with an Intel Core 2 Duo 2.4 GHz CPU, 4 GB memory, Windows 7 and Java 6. The datasets were stored on an IBM x3850 M2 server with two Xeon Quad 2.4 GHz CPUs, 8 GB memory, Red Hat Enterprise Linux Server 5 and MySQL 5.1. The detailed results and source code are available at our

Conclusion

Data linkage is important to establish semantic interoperability and realize data integration on the SW. In this paper, we proposed an automatic approach for learning discriminative property pairs to link instances, which characterized common patterns of instances from their abstract types and domains. Our main contributions are summarized as follows:

  • We presented an automatic method to build a highly-accurate training set by performing equivalence reasoning and common prefix blocking. The

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China under Grant Nos. 61370019, 61223003 and 61321491, and in part by the Natural Science Foundation of Jiangsu Province under Grant No. BK2011189. We appreciate the students participating in the evaluation. We also thank the anonymous reviewers for their valuable comments.

Wei Hu is a lecturer at the Department of Computer Science and Technology, Nanjing University. He received his B.Sc. degree in Computer Science and Technology in 2005, and his Ph.D. degree in Computer Software and Theory in 2009 both from the Southeast University. His research interests include Semantic Web, ontology engineering and data integration.

References (41)

  • R. Isele et al.

    Learning expressive linkage rules using genetic programming

  • A.-C. Ngonga Ngomo et al.

    Eagle: efficient active learning of link specifications using genetic programming

  • Y. Yang et al.

    Distributed human computation framework for linked data co-reference resolution

  • D. Song et al.

    Automatically generating data linkages using a domain-independent candidate selection approach

  • Y. Qu et al.

    Constructing virtual documents for ontology matching

  • G. Tummarello et al.

    RDFSync: efficient remote synchronization of RDF models

  • J. Urbani et al.

    OWL reasoning with WebPIE: calculating the closure of 100 billion triples

  • L. Getoor et al.

    Entity resolution: theory, practice & open challenges

  • F. Saïs et al.

    L2R: a logical method for reference reconciliation

  • A. Nikolov et al.

    Refining instance coreferencing results using belief propagation

  • Cited by (11)

    • A bootstrapping approach to entity linkage on the Semantic Web

      2015, Journal of Web Semantics
      Citation Excerpt :

      The approaches based on equivalence reasoning mainly make use of owl:sameAs [6] and other special properties [7,19], while the approaches based on similarity computation compare the property-values of entities to see how similar they can refer to the same resource [8,9,36]. Recent studies also use machine learning to improve the accuracy of similarity computation [5,11,12]. Usually, the similarity-based approaches assume pairwise datasets as input and their computational costs are high, so it is hard to directly conduct them on massive data sources or even the whole SW.

    • Automatic schema-independent linked data instance matching system

      2018, Information Retrieval and Management: Concepts, Methodologies, Tools, and Applications
    View all citing articles on Scopus

    Wei Hu is a lecturer at the Department of Computer Science and Technology, Nanjing University. He received his B.Sc. degree in Computer Science and Technology in 2005, and his Ph.D. degree in Computer Software and Theory in 2009 both from the Southeast University. His research interests include Semantic Web, ontology engineering and data integration.

    Rui Yang is a Master's student at the Department of Computer Science and Technology, Nanjing University. He received his B.Sc. degree in Computer Science and Technology from the Nanjing University in 2013. His research interests include Semantic Web, linked data and recommender systems.

    Yuzhong Qu is a full professor and Ph.D. supervisor at the Department of Computer Science and Technology, Nanjing University. He received his B.Sc and M.Sc. degrees in Mathematics from the Fudan University in 1985 and 1988 respectively, and his Ph.D. degree in Computer Software from the Nanjing University in 1995. His research interests include Semantic Web, Web science and novel software technology for the Web. His research is continuously supported by the National Natural Science Foundation of China.

    View full text