Exploring Linked Data with contextual tag clouds
Introduction
We present the contextual tag cloud system1 as an attempt to address the following questions: How can we help casual users explore the Linked Open Data (LOD) cloud? Can we provide a more detailed summary of linkages beyond the LOD cloud diagram?2 Can we help data providers find potential errors or missing links in a multi-source dataset of mixed quality? When a user wants to design a SPARQL query for an unfamiliar dataset, they must resolve three basic questions: (1) Syntactic Correctness: “What classes are available?” (2) Semantic Correctness: “Does this class refer to the concept I expect?” (3) Meaningful Results: “Does the dataset hold enough knowledge coded with the vocabulary I choose?” Since there are two aspects of a dataset: the ontological terms (classes and properties) and the instances, the questions cannot be answered by only viewing the ontology axioms or only inspecting a small sample of instances. A combined view of both aspects is necessary. Furthermore, there are two types of linkages: ontological alignment and owl:sameAs links between instances. The usability of multi-source RDF dataset is largely affected by the erroneous or missing links of both kinds in the dataset. If we can emphasize the unlikely facts, then data providers will have a tool to help them uncover such problems in the dataset.
Our solution is to use tag clouds to display statistical information about the distribution of instances among various ontological terms. A key feature is that each tag cloud is relative to a type consisting of ontological terms that is dynamically defined by the user. In analogy to traditional Web 2.0 tag cloud systems, an instance is like a web document or photo, but is “tagged” with formal ontological classes, as opposed to folksonomies. Tags are then another name for the categories of instances. We extend the expressiveness and treat classes, properties and inverse properties as tags that are assigned to any instances that use these ontological terms in their triples. The font sizes in the tag cloud reflect the number of matching instances for each tag. We allow the user to change their focus on a specific subset of instances in the dataset by specifying a combination of ontological terms as the context on the fly, and then the resulting contextual tag cloud will resize tags to indicate intersection with this context.
With any uncurated dataset, one must maintain a healthy skepticism towards all axioms. Although materialization can lead to many interesting facts, a single erroneous axiom could generate thousands of errors. Rather than attempting to guess which axioms are worthwhile, our system supports multiple levels of inference; and at any time a user can view tag clouds with the same context under different entailment regimes, which helps users understand the dataset better and helps data providers investigate possible errors in the dataset.
Starting from our initial version of the system [1] that used DBPedia data, we add features and load the entire BTC2012 dataset. This complex dataset contains 1.4 billion triples, from which we extract 198.6M unique instances, and assign more than 380K tags to these instances. This multi-source, large-scale dataset brings us challenges in achieving acceptable runtime performance, affordable preprocessing, and user-interface design. The rest of the paper is organized as follows: we first formally define the concepts and computation problems, and then showcase some use scenarios along with introduction to system functionalities; then we discuss the preprocessing steps, online computation and multi-level inference; after that we provide some experimental results; then we compare with related works; and lastly we conclude.
Section snippets
Basic concepts
Given an RDF dataset, an entailment regime defines what kind of entailment rules will be applied to the explicit triples. In our implementation, we have two specific sets of rules: for sub/equivalent class/property entailment (rdfs5, rdfs7, rdfs9, rdfs113); and for property domain/range entailment (rdfs2, rdfs3). We also support the combination of these two sets, leading to four distinct entailment regimes .
Let be the
System features and use cases
The initial tag cloud has context or semantically , and the tags in the cloud reflect the absolute sizes of instances related to each tag. We put classes and properties into two separate views, so that users will not treat a property called “author” (which may have domain Publication) as a class name by mistake. To emphasize that difference, we also add an icon with “C” or “P” in front of each tag. If a tag is clicked, it will be added to the current context, and then a new tag
Infrastructure
Our main challenge is to compute for efficiently. There are two ways to approach this problem: (1) ensure efficient calculation of for any ; and (2) prune unnecessary calls of . Thus we need to correctly structure the repository and develop affordable preprocessing. Our previous experiments [1] showed that an RDBMS with decomposed storage model [2], [3] is not as efficient as using an Information Retrieval (IR) style index for this specific application purpose,
Experiments
Our system is implemented in Java and we conduct all experiments on a RedHat machine with a 12-core Intel 2.8 GHz processor and 40 GB memory.
We apply our preprocessing approach to all five subsets of the BTC2012 dataset, as well as the full dataset, and plot the time/space for datasets against their numbers of total triples in Fig. 7, which shows the scalability of our preprocessing approach. The disk space is for both the index and the no-inference co-occurrence matrix, and is dominated by the
Related work
Many recent systems for exploring RDF datasets, such as /facet [4], gFacet [5] and BrowseRDF [6], use or extend the faceted browsing idea: a user can construct a selection query by adding constraints and each new added constraint will update the interface to display further facet options based on the current selection query results. Our system is similar to faceted browsing systems in the sense that each tag in a contextual tag cloud is a “boolean facet” that can be added to the query.
Conclusion
In this paper we introduce the features and use cases of the contextual tag cloud system, and describe the underlying infrastructure. The contextual tag cloud system is a novel tool that helps both casual users and data providers explore the BTC dataset: by treating classes and properties as tags, we can visualize patterns of co-occurrence and get summaries of the instance data. From the common patterns users can better understand the distribution of data in the KB; and from the rare
Acknowledgments
This project was partially sponsored by the U.S. Army Research Office (W911NF-11-C-0215). The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
References (14)
- X. Zhang, J. Heflin, Using tag clouds to quickly discover patterns in linked data sets, in: COLD Workshop,...
- Z. Pan, J. Heflin, DLDB: Extending relational databases to support Semantic Web queries, in: SSWS Workshop, 2003, pp....
- D.J. Abadi, A. Marcus, S.R. Madden, K. Hollenbach, Scalable Semantic Web data management using vertical partitioning,...
- M. Hildebrand, J. van Ossenbruggen, L. Hardman, /facet: a browser for heterogeneous Semantic Web repositories, in:...
- P. Heim, J. Ziegler, S. Lohmann, gFacet: a browser for the web of data, in: IMC-SSW Workshop, vol. 417, 2008, pp....
- E. Oren, R. Delbru, S. Decker, Extending faceted navigation for RDF data, in: ISWC, 2006, pp....
- X. Zhang, D. Song, S. Priya, J. Heflin, Infrastructure for efficient exploration of large scale linked data via...