Exploring Linked Data with contextual tag clouds

doi:10.1016/j.websem.2013.12.004

Journal of Web Semantics

Volume 24, January 2014, Pages 33-39

https://doi.org/10.1016/j.websem.2013.12.004 Get rights and content

Abstract

In this paper we present the contextual tag cloud system: a novel application that helps users explore a large scale RDF dataset. Unlike folksonomy tags used in most traditional tag clouds, the tags in our system are ontological terms (classes and properties), and a user can construct a context with a set of tags that defines a subset of instances. Then in the contextual tag cloud, the font size of each tag depends on the number of instances that are associated with that tag and all tags in the context. Each contextual tag cloud serves as a summary of the distribution of relevant data, and by changing the context, the user can quickly gain an understanding of patterns in the data. Furthermore, the user can choose to include RDFS taxonomic and/or domain/range entailment in the calculations of tag sizes, thereby understanding the impact of semantics on the data. In this paper, we describe how the system can be used as a query building assistant, a data explorer for casual users, or a diagnosis tool for data providers. To resolve the key challenge of how to scale to Linked Data, we combine a scalable preprocessing approach with a specially-constructed inverted index, use three approaches to prune unnecessary counts for faster online computations, and design a paging and streaming interface. Together, these techniques enable a responsive system that in particular holds a dataset with more than 1.4 billion triples and over 380,000 tags. Via experimentation, we show how much our design choices benefit the responsiveness of our system.

Introduction

We present the contextual tag cloud system¹ as an attempt to address the following questions: How can we help casual users explore the Linked Open Data (LOD) cloud? Can we provide a more detailed summary of linkages beyond the LOD cloud diagram?² Can we help data providers find potential errors or missing links in a multi-source dataset of mixed quality? When a user wants to design a SPARQL query for an unfamiliar dataset, they must resolve three basic questions: (1) Syntactic Correctness: “What classes are available?” (2) Semantic Correctness: “Does this class refer to the concept I expect?” (3) Meaningful Results: “Does the dataset hold enough knowledge coded with the vocabulary I choose?” Since there are two aspects of a dataset: the ontological terms (classes and properties) and the instances, the questions cannot be answered by only viewing the ontology axioms or only inspecting a small sample of instances. A combined view of both aspects is necessary. Furthermore, there are two types of linkages: ontological alignment and owl:sameAs links between instances. The usability of multi-source RDF dataset is largely affected by the erroneous or missing links of both kinds in the dataset. If we can emphasize the unlikely facts, then data providers will have a tool to help them uncover such problems in the dataset.

Our solution is to use tag clouds to display statistical information about the distribution of instances among various ontological terms. A key feature is that each tag cloud is relative to a type consisting of ontological terms that is dynamically defined by the user. In analogy to traditional Web 2.0 tag cloud systems, an instance is like a web document or photo, but is “tagged” with formal ontological classes, as opposed to folksonomies. Tags are then another name for the categories of instances. We extend the expressiveness and treat classes, properties and inverse properties as tags that are assigned to any instances that use these ontological terms in their triples. The font sizes in the tag cloud reflect the number of matching instances for each tag. We allow the user to change their focus on a specific subset of instances in the dataset by specifying a combination of ontological terms as the context on the fly, and then the resulting contextual tag cloud will resize tags to indicate intersection with this context.

With any uncurated dataset, one must maintain a healthy skepticism towards all axioms. Although materialization can lead to many interesting facts, a single erroneous axiom could generate thousands of errors. Rather than attempting to guess which axioms are worthwhile, our system supports multiple levels of inference; and at any time a user can view tag clouds with the same context under different entailment regimes, which helps users understand the dataset better and helps data providers investigate possible errors in the dataset.

Starting from our initial version of the system [1] that used DBPedia data, we add features and load the entire BTC2012 dataset. This complex dataset contains 1.4 billion triples, from which we extract 198.6M unique instances, and assign more than 380K tags to these instances. This multi-source, large-scale dataset brings us challenges in achieving acceptable runtime performance, affordable preprocessing, and user-interface design. The rest of the paper is organized as follows: we first formally define the concepts and computation problems, and then showcase some use scenarios along with introduction to system functionalities; then we discuss the preprocessing steps, online computation and multi-level inference; after that we provide some experimental results; then we compare with related works; and lastly we conclude.

Section snippets

Basic concepts

Given an RDF dataset, an entailment regime $R$ defines what kind of entailment rules will be applied to the explicit triples. In our implementation, we have two specific sets of rules: $R_{S u b}$ for sub/equivalent class/property entailment (rdfs5, rdfs7, rdfs9, rdfs11³); and $R_{D R}$ for property domain/range entailment (rdfs2, rdfs3). We also support the combination of these two sets, leading to four distinct entailment regimes $R = {0̸, R_{S u b}, R_{D R}, R_{S u b} \cup R_{D R}}$ .

Let $I$ be the

System features and use cases

The initial tag cloud has context $T = 0̸$ or semantically $T = owl : Thing$ , and the tags in the cloud reflect the absolute sizes of instances related to each tag. We put classes and properties into two separate views, so that users will not treat a property called “author” (which may have domain Publication) as a class name by mistake. To emphasize that difference, we also add an icon with “C” or “P” in front of each tag. If a tag is clicked, it will be added to the current context, and then a new tag

Infrastructure

Our main challenge is to compute $f_{R} ({t} \cup T)$ for $\forall t \in T$ efficiently. There are two ways to approach this problem: (1) ensure efficient calculation of $f_{R} (T)$ for any $T$ ; and (2) prune unnecessary calls of $f_{R} ({t} \cup T)$ . Thus we need to correctly structure the repository and develop affordable preprocessing. Our previous experiments [1] showed that an RDBMS with decomposed storage model [2], [3] is not as efficient as using an Information Retrieval (IR) style index for this specific application purpose,

Experiments

Our system is implemented in Java and we conduct all experiments on a RedHat machine with a 12-core Intel 2.8 GHz processor and 40 GB memory.

We apply our preprocessing approach to all five subsets of the BTC2012 dataset, as well as the full dataset, and plot the time/space for datasets against their numbers of total triples in Fig. 7, which shows the scalability of our preprocessing approach. The disk space is for both the index and the no-inference co-occurrence matrix, and is dominated by the

Related work

Many recent systems for exploring RDF datasets, such as /facet [4], gFacet [5] and BrowseRDF [6], use or extend the faceted browsing idea: a user can construct a selection query by adding constraints and each new added constraint will update the interface to display further facet options based on the current selection query results. Our system is similar to faceted browsing systems in the sense that each tag in a contextual tag cloud is a “boolean facet” that can be added to the query.

Conclusion

In this paper we introduce the features and use cases of the contextual tag cloud system, and describe the underlying infrastructure. The contextual tag cloud system is a novel tool that helps both casual users and data providers explore the BTC dataset: by treating classes and properties as tags, we can visualize patterns of co-occurrence and get summaries of the instance data. From the common patterns users can better understand the distribution of data in the KB; and from the rare

Acknowledgments

This project was partially sponsored by the U.S. Army Research Office (W911NF-11-C-0215). The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

References (14)

X. Zhang, J. Heflin, Using tag clouds to quickly discover patterns in linked data sets, in: COLD Workshop,...
Z. Pan, J. Heflin, DLDB: Extending relational databases to support Semantic Web queries, in: SSWS Workshop, 2003, pp....
D.J. Abadi, A. Marcus, S.R. Madden, K. Hollenbach, Scalable Semantic Web data management using vertical partitioning,...
M. Hildebrand, J. van Ossenbruggen, L. Hardman, /facet: a browser for heterogeneous Semantic Web repositories, in:...
P. Heim, J. Ziegler, S. Lohmann, gFacet: a browser for the web of data, in: IMC-SSW Workshop, vol. 417, 2008, pp....
E. Oren, R. Delbru, S. Decker, Extending faceted navigation for RDF data, in: ISWC, 2006, pp....
X. Zhang, D. Song, S. Priya, J. Heflin, Infrastructure for efficient exploration of large scale linked data via...

There are more references available in the full text version of this article.

Cited by (0)

View full text