Elsevier

Data & Knowledge Engineering

Volume 87, September 2013, Pages 405-424
Data & Knowledge Engineering

Automated discovery of multi-faceted ontologies for accurate query answering and future semantic reasoning

https://doi.org/10.1016/j.datak.2013.05.005Get rights and content

Abstract

There has been a surge of interest in the development of probabilistic techniques to discover meaningful data facts across multiple datasets provided by different organizations. The key aim is to approximate the structure and content of the induced data into a concise synopsis in order to extract meaningful data facts. Performing sensible queries across unrelated datasets is a complex task that requires a complete understanding of each contributing database's schema to define the structure of its information. Alternative approaches that use data modeling enterprise tools have been proposed, in order to give users without complex schema knowledge the ability to query databases. Unfortunately, data modeling-based matching is a content-based technique and incurs significant query evaluation costs, due to attribute level pairwise comparisons. We propose a multi-faceted classification technique for performing structural analysis on knowledge domain clusters, using a novel Ontology Guided Data Linkage (OGDL) framework. This framework supports self-organization of contributing databases through the discovery of structural dependencies, by performing multi-level exploitation of ontological domain knowledge relating to tables, attributes and tuples. The framework thus automates the discovery of schema structures across unrelated databases, based on the use of direct and weighted correlations between different ontological concepts, using a h-gram (hash gram) record matching technique for concept clustering and cluster mapping. We demonstrate the feasibility of our OGDL algorithms through a set of accuracy, performance and scalability experimental tests run on real-world datasets, and show that our system runs in polynomial time and performs well in practice. To the best of our knowledge, this is the first attempt initiated to solve data linkage problems using a multi-faceted cluster mapping strategy, and we believe that our approach presents a significant advancement towards accurate query answering and future real-time online semantic reasoning capacity.

Introduction

The process of mining structures of databases is a significant task, with the aim to acquire crucial fact finding information that is not otherwise available, or that would require time-consuming and expensive manual procedures. Accurate integration of internet-based information can provide valuable insights that are useful for evidence-based decision making, especially for crucial events. However, the real-world scenario of internet data integration is hampered by data inconsistencies, large quantities of data of variable quality, and its scattering among a multitude of disparate heterogeneous databases without common schemas. Traditional approaches use similarity scores that compare tuple values from different attributes, and declare it as matches if the score is above a certain threshold. These approaches perform quite well when comparing similar databases with clean data. However, when dealing with a large amount of variable data, comparison of tuple values alone is not enough [1], [2]. It is necessary to apply domain knowledge when attempting to perform data linkage where there are inconsistencies in the data. The same problem applies to database migrations, and to other data intensive tasks that involve disparate databases without common schemas. Furthermore, the creation of data linkages between heterogeneous databases requires the discovery of all possible primary and foreign key relationships that may exist between different attribute pairs, on a global spectrum.

While conducting our research, we analyzed real-world data collected from a variety of sources, with the assumption that the lack of a common domain indicates a shortage of multi-domain experimentation in the area of research. Findings from this analysis indicate that relational database schemas that are invariant in time hold valuable information in its tables, fields and tuple schemas that aid the identification of semantically similar objects. This supports the derivation of hierarchical relationships up to the instance level in a data space approach [3]. Unfortunately, schema-based techniques lack the capacity for computational linkage using domain information; and are very expensive when applied to unrelated and/or noisy databases.

Ontologies are expected to play a significant role in various application domains in the emerging Semantic Web, linking databases semantically. Furthermore, the ability to efficiently and effectively perform ontology reuse is commonly acknowledged to play a crucial role in the large-scale dissemination of ontologies and ontology-driven technologies [6]. A key step in the integration of databases is the identification of these semantic correspondences among ontology attribute pairs [2], [4], [5]. An ontology is defined as a (meta) data schema, providing a controlled vocabulary of concepts, with explicitly defined and machine processable semantics. Ontology matching is the process of finding such correspondences between semantically related entities of different ontologies [2]. Therefore, we primarily focused our research on the identification of semantic data coordination, using ontology matching principles. However, ontology matching at an attribute level can be very expensive and have varying relevance. For instance, a table or an attribute can have multiple ontologies, as shown in Fig. 1(A), which demonstrates ontology correspondences as references between two input schema's table attributes. As can be seen from the diagram, it depicts two input schemas with similar ontologies: on the left there is a representation of an ‘online transaction processing’ database with data of the provision of discounts and special offers, in multiple countries and in multiple currencies. On the right there is a representation of its ‘data warehouse equivalent’, used to develop various business intelligence (BI) reports and to perform data modeling, as well as a number of other data mining tasks. The dotted arrows in Fig. 1(A) indicate tables and attributes matching instances between multiple schemas and multiple ontologies. For instance, the ‘title’ attribute from the ‘dbo.CurrencyInfo’ table is referenced to the ‘name’ attribute in the ‘sales.Currency’ table. Fig. 1(A) further illustrates that schemas overlap each other in general, and that each schema can also have unique information, not present in any other schema (for example, ‘currencies’ and ‘exchange rates’).

To address these challenges, we consider the problem of discovering ontological instances in domain specific clusters that reveals how different tables, attributes and tuples are organized between and within databases to support information flow. The objective is to develop a schema structure abstract to provide a more logical information flow view. Eusenat et al. in [2] has shown that such ontology based methods can be highly effective, when combined with other methodologies. In this paper we will consider a new type of data linkage approach, namely, the exploitation of hidden relationships between tables, attributes and tuples towards knowledge discovery at different levels of data abstraction, including the ontology, schema and instance levels. Pairwise matching of attributes between different data source tables is a suitable approach for small databases. However, real-world data collected from enterprise organizations can have hundreds of tables and thousands of columns (see Table 4). Hence, performing pairwise attribute matching can be highly expensive in terms of associated computational costs, which is perhaps the main drawback found in existing data linkage methods, and this is also what restricts its performance in terms of accuracy. In order to reduce the number of pairwise comparisons, we employ a ‘multi-layer’ ontology-based clustering technique, by modeling large amounts of input information into high-density clusters at different levels. As some chains-of-relationships have stronger correlation weights than others, we focused our research on the identification of such correspondences between crucial attributes, together with its semantic information flow.

Semantic information is used as data abstraction principles to perform data linkage. The development of a novel system that embodies this approach faces a number of challenges. Our solution to handle these challenges will integrate a variety of approaches, by extending existing methods and proposing a new multi-faceted strategy Ontology Guided Data Linkage (OGDL) framework.

Small inconsistencies in records can prevent matching between two otherwise identical set of records. To deal with this problem, the authors have previously presented a novel h-gram (hash gram) [21] technique for probabilistic record matching. The h-gram technique is aimed at reducing the runtime costs when comparing records, and to get probabilistic results in a timely manner. The h-gram matching process extends to traditional n-grams by the transformation of the grams into equivalent numerical realities, thus overcoming the disadvantages of random-assignation hashing systems. It also provides more options for gram scaling and for error threshold tolerance. This is similar to the approach taken by [8], although we do not store hash codes of all the sample data. We reduce the cost associated with record matching by utilizing scale based hashing; increasing matching probability through fine turning; and by reducing the cost associated with the storage of most frequent hash codes of matching records. We employed the h-gram technique within our OGDL framework to create and correlate clusters at different levels and thus significantly improving the OGDL framework's performance.

Fig. 1(B) shows the general architecture of the OGDL framework. The framework uses different datasets as input and performs data uncertainty analysis for data cleaning and to organize data into homogeneous strata groups. The strata samples are used to form different cluster levels. The framework then performs cluster stem-and-leaf joins, using a multi-faceted cluster mapping technique. These results are further analyzed to construct hierarchical cluster mapping trees. The ontological structures are summarized as candidate, primary, partial, and foreign key relational data (linkage) relationships. The final results have the potential to be integrated in knowledge based data analysis tools to support sensible query making to discover meaningful and accurate data facts.

Our research findings suggest that single order classification of data does not provide the necessary flexibility to accurately define semantic mappings of variables. For instance, different organizations typically maintain different rules and standards for storage of their business data, and there are instances of such databases being poorly designed, and/or without data models. Platform independent databases that target the global marketplace have also emerged in recent years. The variability of the quality of such data sources leads to the risk that the semantic flow of the data (as per their relationships) is not in a fixed direction. In order to increase the probability of discovering correlated clusters, we applied a ‘multi-faceted ontology-based cluster mapping’ strategy. The overarching objective is to develop the ontological domain information as represented in its tables, attributes and tuples, in multiple facets (arrangements), instead of by a predetermined order. The aim is to capture the flow of meaningful semantic data and to concurrently construct self-expanding hierarchical semantic tree structures, which is crucial for high quality data linkage. We describe methods for constructing three different kinds of representations: sequential; parallel; and mixed facets. A sequential facet aims to classify data based on the ontological findings of table level clusters, followed by attribute level clusters and then tuple level clusters. A parallel facet does not prioritize any sequence order, and equally classifies data based on the chance of finding pairs within table level clusters; within attribute level clusters or within tuple level clusters. A mixed facet classifies data through combined cross referencing at the table, attribute and tuple cluster levels. The obtained results can be easily narrowed down in order to discover candidate keys, primary keys, foreign keys, and partially related keys. These results can potentially be integrated with IBM or Microsoft's Query-by-Example (QBE) tools with the aim to make sensible queries that discover meaningful and accurate (data) facts.

The first contribution of this paper addresses our approach to data uncertainty. In order to perform a successful data linkage between disparate noisy datasets, the data needs to be organized in a format that supports user-friendly access to different sets and subsets of data. Prior to the data linkage process, the data uncertainty process organizes variable datasets into a uniform representation. The second contribution is the introduction of our OGDL framework. The OGDL framework creates multi-layer clusters within each sample set, based on its ontological essence. The clusters self-expand through the application of a multi-faceted cluster mapping strategy, applied on a global spectrum. The framework results are further drilled-down to create schema structures. The resulting schema structures can easily be integrated in existing data mining tools to enhance knowledge discovery. The third contribution presents an extensive evaluation of the OGDL framework as applied to real-world databases in experimental tests for accuracy, performance and scalability analysis.

This paper is organized as follows. In Section 2 we review a previous work related to our research. In Section 3 we describe our approach to resolve the research problem by introducing the OGDL framework. We discuss data uncertainty in Section 4, where we present the classifying and sampling techniques that we have applied to research this problem. In Section 5 we introduce our recently proposed h-gram record matching technique. In 6 Multi-layer ontological cluster formation, 7 Multi-faceted cluster mapping we introduce multi-layer clustering and multi-faceted cluster mapping processes. In Section 8 we show how our approach can be used to find schema. In Section 9 we discuss our experimental results and in Section 10 we introduce our conclusions and recommendations for future work.

Section snippets

Related work

The problem of extracting semantic structures from variable databases can be addressed at different levels of complexity. Pure semantic based extraction, using thesauri based dictionaries, presents one extreme [2], [9]. Problem formulation based on syntactic approaches presents the other extreme. In general, many sophisticated data linkage techniques have been applied which can be broadly classified into deterministic, probabilistic and modern approaches [2]. In the past, iterative techniques

Ontology Guided Data Linkage (OGDL) framework

We propose a ‘knowledge based’ multi-faceted cluster mapping technique, which aims at extracting probable relationships between correlated data clusters on a structure level. We formally introduce the linkage problem through our proposed OGDL framework and show how our algorithms can be applied to variable databases. Our framework intends to create a feasible method for discovering related information, as part of bottom-up system managed process that allows top-down information extraction

Data uncertainty analysis

When dealing with large volumes of data (numeric; categorical; string based; etc.) obtained from different sources, we are vulnerable to different types of ‘data uncertainties’ such as different formats; Null values; length constraints; typographical errors; and shorthand notations, which may well be one of the biggest obstacles to performing successful data linkage. An important initial step for successful linkage is data cleaning and standardization, as noisy, incomplete and incorrect

Hash grams

In [21], the authors proposed a new probabilistic h-gram (hash gram) approach to improve the performance of record matching, by extending traditional n-grams and by utilizing scale based hashing for string similarity testing. An important aspect of the h-gram technique is that it significantly reduces (when compared to similar approaches) the number of comparisons required for performing duplicate record detection, on a variety of data types and data sizes, and that it allows detailed analysis

Multi-layer ontological cluster formation

Clustering is the task of organizing data into groups (clusters) such that similar or close data objects are put in the same cluster [23]. We define ontological clustering as the process of globally and significantly reducing the number of semantic entities at multiple levels, with the aim to reduce its data linkage computational expense. We build clusters in result to hash grams of related taxonomy definitions with small h-gram dissimilarity distances. This approach provides a good balance of

Multi-faceted cluster mapping

The extraction of meaningful data facts relies not only on the discovery of different sets of ontological clusters: In addition, hierarchical relationships have to be established between clusters. We introduce the ‘multi-faceted’ cluster mapping strategy in order to capture structural relationships between different ontological clusters in different arrangements, which provide us with an advantage to the discovery of hierarchical relationships when compared to existing approaches. This approach

Transforming cluster correlations into schema mapping

Creating schema structures requires the identification of candidate; primary; and foreign key relationships. In the previous section, we performed inter-clusters mapping, using the OGDL framework to perform incremental pairwise comparison between attributes in each table: Semantic cluster mappings determine possible candidate keys. but this does not give us information regarding their relationships.

In order to achieve this, we take our approach a step further and expand our framework by

Experimental evaluation

Many organizations freely share their real-world data (as data files, datasets, data models, etc.) on the internet. These data are mostly in third-normal form, but the lack of a gold standard for data integration represents one of the major challenges in evaluating such real-world data collected from multiple domains. Towards this purpose, we quantify the benefits of our proposed framework and measure the sensitivity of our framework results using a 10-fold cross validation approach. It is

Conclusion and future work

In this paper, we first introduced the ‘data uncertainty’ concept that necessitates robust cleaning and automatic data categorization prior to running the bulk of the data classification processes. By performing this step firstly in the OGDL approach, we provided the means for tightly integrating attributes based on ontological domain information; and we introduced a simple unified learning model that can tag frequently occurring clusters. We then presented a practical method for discovering

Mohammed A. S. Gollapalli is a Microsoft Certified Technical Specialist currently pursuing his PhD at the University of Queensland. He obtained his Bachelors in IT from Canberra University in 2004 and Masters in IT from Griffith University in 2005. He started his career in IT industry in 2005 and has been working for different Government Departments and Multinational Corporations. His research interests are Data Mining, Cloud Computing, Query Processing, eHealth, Data Extraction and Knowledge

References (31)

  • E. Simperl

    Reusing ontologies on the Semantic Web: a feasibility study

    Journal of Data & Knowledge Engineering (DKE)

    (Oct 2009)
  • C. Lee

    Automated ontology construction for unstructured text documents

    Journal of Data & Knowledge Engineering (DKE)

    (March 2007)
  • M. Gollapalli et al.

    Ontology Guided Data Linkage framework for discovering meaningful data facts

  • J. Euzenat et al.

    Ontology Matching

    (2007)
  • M. Franklin et al.

    From databases to dataspaces: a new abstraction for information management

  • S. Fenz

    An ontology-based approach for constructing Bayesian networks

  • W. Wu et al.

    Discovering topical structures of databases

  • P. Christen

    Automatic record linkage using seeded nearest neighbour and support vector machine classification

  • H. Koehler et al.

    Sampling dirty data for matching attributes

  • I. Bhattacharya et al.

    Iterative record linkage for cleaning and integration

  • H. Kim et al.

    Parallel linkage

  • Y. Hong et al.

    Record linkage as DNA sequence alignment problem

  • M. Gagnon

    Ontology-based integration of data sources

  • A. Bonifati et al.

    Schema mapping verification: the spicy way

  • A. Radwan et al.

    Top-K generation of integrated schemas based on directed and weighted correspondences

  • Cited by (0)

    Mohammed A. S. Gollapalli is a Microsoft Certified Technical Specialist currently pursuing his PhD at the University of Queensland. He obtained his Bachelors in IT from Canberra University in 2004 and Masters in IT from Griffith University in 2005. He started his career in IT industry in 2005 and has been working for different Government Departments and Multinational Corporations. His research interests are Data Mining, Cloud Computing, Query Processing, eHealth, Data Extraction and Knowledge Visualization. He is a member of MCP and ACM.

    Xue Li is an Associate Professor in the School of Information Technology and Electrical Engineering at the University of Queensland. He graduated in Computer Software from Chongqing University, Chongqing, China in 1982 and obtained the MSc degree in Computer Science from the University of Queensland in 1990 and the PhD degree in Information Systems from Queensland University of Technology in 1997. His major areas of research interests and expertise include Data Mining and Intelligent Web Information Systems. He is a member of ACM, IEEE, and SIGKDD.

    Ian A. Wood received the BSc, BE, and PhD degrees from the University of Queensland, Australia, in 1993, 1994, and 2004, respectively. He is currently a lecturer in statistics at the School of Mathematics and Physics at the University of Queensland, where he conducts research on high-dimensional statistical analysis, stochastic optimization and bioinformatics.

    View full text