Automatically generating data linkages using class-based discriminative properties
Introduction
The Semantic Web (SW) is an effort by the W3C Semantic Web Activity, with the purpose of realizing data integration and sharing among different applications and parties. As of today, many prominent ontologies have been developed for data publishing in various domains, which suggest common classes and properties widely used across data sources.
At the instance level, however, there is lack of agreement among sources on the use of common URIs to denote a real-world object. Due to the distributed nature of the SW, it frequently happens that multiple instances in diverse sources denote the same object, i.e., refer to an identical thing (also known as URI aliases [1] or coreferents). Such examples exist in the areas of personal profiles, academic publications, media or geographical data, etc.
Data linkage, also referred to as instance matching or object coreference resolution, aims at linking different instances for the same object. It is important to data-centric applications such as heterogeneous data integration or mining systems, SW search engines and browsers. Driven by the Linking Open Data (LOD) initiative, millions of instances have been linked with owl:sameAs explicitly [2], whose semantics defines that all the URIs linked with this property should identify the same resource. But compared to billions of URIs on the SW, there still exists a large amount of instances that potentially denote the same objects without being interlinked yet. For example, at least 70 instances crawled by the Falcons search engine [3] seem to denote Tim Berners-Lee, the director of W3C, but merely six have been linked with owl:sameAs. An analysis on the LOD cloud also indicates that, out of 31 billion RDF statements less than 500 million represent linkages between data sources, and most sources only link to one another.1
In the SW community, traditional work addresses the data linkage problem mainly from two directions: one is by performing equivalence reasoning in terms of standard OWL semantics, e.g., through owl:sameAs and some “key” properties [4], [5]; the other is by similarity computation, with the intuition that instances denote the same object if having similar property–values [6], [7]. Recent work also uses machine learning and crowdsourcing to cope with complex data linkage tasks [8], [9], [10], [11]. Generally speaking, the reasoning-based methods infer explicit linkages but may miss many potentials, while the similarity-based ones often suffer from high computational costs as they exhaustively compare all pairs of instances [12]; many of them have not been aware of the commonalities behind the abstract types of the instances and their publishers. For example, different data publishers prefer social security number, login name, address or even their combinations to disambiguate customers, but hobby or age is less likely to be used. It will facilitate data linkage in the future if such properties can be learnt and reused.
In this paper, we propose an automatic approach, called ADL, which differs from current similarity-based methods in learning a set of important properties for disambiguating instances (referred to as discriminative properties). The methodological steps of ADL, shown in Fig. 1, can be divided into the offline part and the online part:
- •
For the offline learning, a highly-accurate training set is automatically established. The training set consists of two sets of instance pairs holding the linkages or not, referred to as positive examples and negative examples, resp. The contexts (i.e., a kind of integrated units over RDF triples) for the instances in the training set are extracted in terms of RDF sentences [13], and pairwise matched with a lightweight linguistic matcher V-Doc [14], in order to discover discriminative property pairs, where a discriminative property pair consists of two matchable properties discriminative to link instances. For a specific class pair and a pay-level-domain pair,2 the discriminability of each property pair is measured by information gain, revealing the global and implicit preference of data publishers on characterizing a type of objects.
- •
For the online linking, given a new instance as input, the class that it belongs to and its pay-level-domain are firstly extracted, and then the counterparting classes and pay-level-domains in the training set are chosen. The instances, with the properties in the related discriminative property pairs, are found out, and their values are matched with that of the input using V-Doc. The similarities from different discriminative property pairs are linearly aggregated with equal weighting, in order to determine whether to generate an instance linkage.
We develop an open source tool and test its accuracy on three cases: the PR and NYT tests in the Ontology Alignment Evaluation Initiative (OAEI) as well as the Billion Triples Challenge (BTC2011) dataset. The experimental results show that, compared with several existing methods, our method achieves good precision and recall with the help of only a few discriminative property pairs. Moreover, the proposed approach is ready to be integrated with other methods, e.g., the found discriminative properties can be used for cost-effective candidate selection [12].
This paper is organized as follows. We define the data linkage problem in Section 2 and discuss related work in Section 3. In 4 Training set generation, 5 Discriminative property discovery, we present our approach to learn class-based discriminative property pairs. Evaluation is reported in Section 6. Finally, Section 7 concludes the paper with future work.
Section snippets
Problem statement
Let I be the set of URIs, B be the set of blank nodes and L be the set of literals. A triple 〈s, p, o〉 ∈ (I ∪ B) × I × (I ∪ B ∪ L) is called an RDF triple. An RDF graph is a set of RDF triples, and can be serialized to an RDF document.
For an RDF graph , a URI u is a class (resp. property) if entails the RDF triple 〈u, rdf : type, rdfs : Class〉 (resp. 〈u, rdf : type, rdf : Property〉). If a URI u is not either a class, a property or both, then u is treated as an instance, implying the assumption that
Related work
In the SW field, researchers addressed the data linkage problem mainly from two directions: one is by equivalence reasoning. Glaser et al. [4] implemented a Co-reference Resolution Service (CRS) mainly by owl:sameAs. Hogan et al. [5], [7] conducted large-scale object consolidation in terms of the analysis on IFPs. Saïs et al. [18] designed a language RDFS + for reference reconciliation, which combined FPs, IFPs and owl:disjointWith in OWL as well as SWRL rules. The KnoFuss system [19] used
Training set generation
Due to the large scale of the SW, it is time-consuming to manually build a training set with both broad coverage and good accuracy. Thanks to the LOD initiative, millions of instances have been interlinked with owl:sameAs explicitly. This enlightens us to utilize these equivalence relations to automatically generate a highly-accurate, moderate-scale training set.
Discriminative property discovery
Discriminative property pairs are learnt by comparing the contexts of the instances in the training set. In accordance with many similarity-based approaches, e.g., [7], [20], [26], it is assumed that the instances that constitute linkages share similar property–values, and a few properties are more important for characterizing the denoted real-world object.
Evaluation
We developed an open source tool for the proposed method ADL. In this section, we will report our experimental results on two OAEI tests (namely, PR and NYT) as well as the BTC2011 dataset. All the tests were conducted on a PC with an Intel Core 2 Duo 2.4 GHz CPU, 4 GB memory, Windows 7 and Java 6. The datasets were stored on an IBM x3850 M2 server with two Xeon Quad 2.4 GHz CPUs, 8 GB memory, Red Hat Enterprise Linux Server 5 and MySQL 5.1. The detailed results and source code are available at our
Conclusion
Data linkage is important to establish semantic interoperability and realize data integration on the SW. In this paper, we proposed an automatic approach for learning discriminative property pairs to link instances, which characterized common patterns of instances from their abstract types and domains. Our main contributions are summarized as follows:
- •
We presented an automatic method to build a highly-accurate training set by performing equivalence reasoning and common prefix blocking. The
Acknowledgments
This work is supported in part by the National Natural Science Foundation of China under Grant Nos. 61370019, 61223003 and 61321491, and in part by the Natural Science Foundation of Jiangsu Province under Grant No. BK2011189. We appreciate the students participating in the evaluation. We also thank the anonymous reviewers for their valuable comments.
Wei Hu is a lecturer at the Department of Computer Science and Technology, Nanjing University. He received his B.Sc. degree in Computer Science and Technology in 2005, and his Ph.D. degree in Computer Software and Theory in 2009 both from the Southeast University. His research interests include Semantic Web, ontology engineering and data integration.
References (41)
- et al.
Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora
J. Web Semant.
(2012) - et al.
Matching large ontologies: a divide-and-conquer approach
Data Knowl. Eng.
(2008) - et al.
Ontology matching with semantic verification
J. Web Semant.
(2009) - et al.
Architecture of the World Wide Web
(2004) - et al.
SameAs networks and beyond: analyzing deployment status and implications of owl:sameAs in linked data
- et al.
Searching linked objects with Falcons: approach, implementation and evaluation
Int. J. Semant. Web Inf. Syst.
(2009) - et al.
Managing co-reference on the semantic web
- et al.
Performing object consolidation on the semantic web data
- et al.
Knowledge-driven Multimedia Information Extraction and Ontology Evolution
Ch. Ontology and Instance Matching
(2011) - et al.
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking
Learning expressive linkage rules using genetic programming
Eagle: efficient active learning of link specifications using genetic programming
Distributed human computation framework for linked data co-reference resolution
Automatically generating data linkages using a domain-independent candidate selection approach
Constructing virtual documents for ontology matching
RDFSync: efficient remote synchronization of RDF models
OWL reasoning with WebPIE: calculating the closure of 100 billion triples
Entity resolution: theory, practice & open challenges
L2R: a logical method for reference reconciliation
Refining instance coreferencing results using belief propagation
Cited by (11)
A bootstrapping approach to entity linkage on the Semantic Web
2015, Journal of Web SemanticsCitation Excerpt :The approaches based on equivalence reasoning mainly make use of owl:sameAs [6] and other special properties [7,19], while the approaches based on similarity computation compare the property-values of entities to see how similar they can refer to the same resource [8,9,36]. Recent studies also use machine learning to improve the accuracy of similarity computation [5,11,12]. Usually, the similarity-based approaches assume pairwise datasets as input and their computational costs are high, so it is hard to directly conduct them on massive data sources or even the whole SW.
An efficient approach for automated token formation for record de-duplication with special reference to real-time data-warehouse environment
2019, International Journal of Engineering and Advanced TechnologyAutomatic schema-independent linked data instance matching system
2018, Information Retrieval and Management: Concepts, Methodologies, Tools, and ApplicationsScLink: supervised instance matching system for heterogeneous repositories
2017, Journal of Intelligent Information Systems
Wei Hu is a lecturer at the Department of Computer Science and Technology, Nanjing University. He received his B.Sc. degree in Computer Science and Technology in 2005, and his Ph.D. degree in Computer Software and Theory in 2009 both from the Southeast University. His research interests include Semantic Web, ontology engineering and data integration.
Rui Yang is a Master's student at the Department of Computer Science and Technology, Nanjing University. He received his B.Sc. degree in Computer Science and Technology from the Nanjing University in 2013. His research interests include Semantic Web, linked data and recommender systems.
Yuzhong Qu is a full professor and Ph.D. supervisor at the Department of Computer Science and Technology, Nanjing University. He received his B.Sc and M.Sc. degrees in Mathematics from the Fudan University in 1985 and 1988 respectively, and his Ph.D. degree in Computer Software from the Nanjing University in 1995. His research interests include Semantic Web, Web science and novel software technology for the Web. His research is continuously supported by the National Natural Science Foundation of China.