Elsevier

Information Systems

Volume 39, January 2014, Pages 80-107
Information Systems

Integrating domain heterogeneous data sources using decomposition aggregation queries

https://doi.org/10.1016/j.is.2013.06.003Get rights and content

Highlights

  • We introduce the decomposition aggregation query (DAQ) to handle domain heterogeneity.

  • We develop query rewriting algorithms for DAQs.

  • We process DAQs in two phases and optimize the query processing.

Abstract

The decomposition aggregation query (DAQ) we introduce in this paper extends semantic integration queries by allowing query translation to create aggregate queries based on the DAQ's novel three role structure. We describe the application of DAQs in integrating domain heterogeneous data sources, the new semantics of DAQ answers and the query translation algorithm called “aggregation rewriting”.

A central problem of optimizing DAQ processing requires determining the data sources towards which the DAQ is translated. Our source selection algorithm has cover-finding and partitioning steps which are optimized to 1. lower the processing overhead while speeding up query answering and 2. eliminate duplicates with minimal overhead. We establish connections between source selection optimizations and classic NP-hard optimizations and resolve the optimization problems with efficient solvers. We empirically study both the DAQ query translation and the source selection algorithms using real-world and synthetic data sets; the results show satisfying scalability both in size of aggregations and data sources for the query translation algorithms and the source selection algorithms save a good amount of computational resources.

Introduction

Many applications require querying multiple databases with heterogeneous schemas. We refer to all systems that translate queries over databases and thus support querying multiple, heterogeneous, independently maintained databases as semantic integration systems. Many semantic integration architectures exist, including data integration (e.g., [26], [27]), peer data management systems (PDMSs, e.g., [4], [28]), and dataspaces [20].

Existing semantic integration approaches generally focus on cases where the schemas are heterogeneous, but the entities in the sources are from the same domain — the data sources are domain homogeneous. For example, querying independently maintained bibliography databases is a widely used example for integrating domain homogeneous data sources. Although bibliographic records may have different schemas in different data sources, the schemas represent the same kind of objects. Semantic integration must also manage domain heterogeneity. For example, the 50+ Amazon public data sets [3] are categorized into 8 domains and the 7144 databases on Freebase [21] belong to 86 domains. Integrating domain heterogeneous schemas is more challenging than integrating domain homogeneous schemas since it requires transforming attributes of objects in one domain to those in another domain. One major difficulty is that entities in one domain may not have direct correspondences in another domain. Although existing semantic integration systems support simple transformations such as concatenating first name and last name in one schema to form a full name in another schema, integrating domain heterogeneous data requires additional support. As shown in the following example, motivated by a real-world disaster management project (JIIRP [36]), the new demands require developing new techniques to manage associations between domain heterogeneous entities and answer cross-domain queries.

Example 1.1 Cell–building heterogeneity

Consider planning responses to disasters. The planners' domain abstracts infrastructure elements as “cells,” which are logical units that perform a single function (e.g., a hospital complex) and the engineers model seismic damage on “buildings.” 

Traditional semantic integration systems motivate their work with both domain homogeneous and domain heterogeneous examples. However, the mappings used by existing semantic integration systems typically do not handle domain heterogeneous sources. For example, traditional mappings (e.g., [49], [58]) cannot translate queries between the cells and buildings in Example 1.1 since there are no common objects to relate each other. The mappings in [39], [40] consider objects from different domains; however, the system still requires common attributes to relate data records (e.g., a gene-record and a research paper about the gene are linked by the gene's id). Deeper semantic relationships between objects, e.g., those that require aggregation, as in the following example, are not managed or used for integration.

Example 1.2 shows a simplified scenario of how the relationships between heterogeneous objects in Example 1.1 can be expressed using aggregation:

Example 1.2 Aggregation between cells and buildings

Table 1 shows a snapshot of the cells (in the planners' domain) and building damage assessment (in the engineers' domain). A cell's damage (Cell.damage) is estimated by averaging the damage to the constituent buildings (avg (BdnDmg.Damage)). Monetary loss (Cell.loss) is estimated by summing the buildings' losses (sum(BdnDmg.Loss)).

Lacking systematic integration, the Cell table is manually populated by aggregating records in the BdnDmg table for each cell (e.g., C1 and C2). This is challenging because those calculating the losses are unfamiliar with seismic damage assessment. Additionally, the laborious process of populating a cell discourages users from defining new cells or updating cell attributes. Our goal is to systematically transform queries on cells into aggregate queries over building seismic assessments (i.e., to automatically compute the “unknowns” in the Cell table in Table 1). 

Computing the damage and loss attributes in Example 1.2 requires translating queries on Cells into aggregate queries on BdnDmg data. One difficulty is that if the users do not know the data sources beforehand, it is impossible to pre-determine the aggregate queries associated with the Cells.

This problem motivates the focus of this work: How can we bridge the gap between domain heterogeneous data sources by automating transforming queries over complex compounds into the components that form them.

Answering domain heterogeneous aggregate queries is challenging for three reasons: (1) domain heterogeneity prevents users from manually translating queries due to their limited domain knowledge; (2) users may not know (or even desire to know) that their query requires aggregating data from multiple sources and (3) multiple databases with potential duplicates, varying answers, and different subsets of relevant data must be collected from multiple databases. These challenges require supporting fully automatic query translation at the system level.

In addition to our running example, there are many other cases requiring solution to domain heterogeneous aggregate queries. For example:

  • Estimating the cost to build a room. The cost of a room is estimated by decomposing the room into its constituent parts (e.g., windows and beams) and then aggregating their costs from providers' databases. Lawrence et al. [42] focused creating and maintaining the mappings necessary to coordinate updates, not how to choose which part of the aggregation came from which source, or what to do with conflicting values.

  • Quickly aggregating data in order to detect fraud or to improve performance analysis. This was studied in the context of streams [23], but did not focus on the fact that the data may come from multiple sources (e.g., identify theft can be better traced by combining information from bank accounts, credit cards, medical records, and criminal records). In this case, the exact figures do not matter—it is quickly finding abnormalities in the overall trend that is a problem.

  • In market analysis, data is pulled from a number of domain heterogeneous sources in order to find potential candidates to target with advertising. For example, one might look for the total purchases of customers who are between 18 and 30 and have a household income of between $30,000 and $60,000 [43]. As in our running scenario, the goal is to look for answers that are viable, but since the different sources may have different values, it may not always be possible to determine one “true” answer.

  • Wireless sensor networks often need to quickly make security decisions and provide environmental information [18] based on aggregating the data involved, even though the sensors are likely to return conflicting values.

We abstract such integration needs using what we term a “Decomposition Aggregate Query (DAQ).” While we formally define DAQs in Section 3, we informally define and motivate the key features now. The main feature of a DAQ that differentiates it from queries asked over a semantic integration system is that a DAQ is asked over a different level of granularity than a standard semantic integration query. For example, calculating the unknown values in the Cells in Domain A of Table 1 requires aggregating values from the buildings in Domain B in Table 1. Hence a DAQ considers two main types of objects: (1) compounds, which are the high-level entities that the aggregate query is asked over (e.g., “Cell”), and (2) components, which are the sub-elements needed for the aggregation (e.g., “Buildings”).

Breaking down DAQs into the components which the queried compounds consist of and aggregating them requires a change in the roles of the nodes in a semantic integration system. Traditional semantic integration systems consider that nodes have one of the two roles: a Query Node (q-node) asks the queries and Answer Nodes (a-nodes) provide the answers. In both traditional semantic integration systems and in our case, different nodes may fill different roles for different queries; one node may fill several roles, even in the same query (e.g., a q-node may provide some answers for its own query, thus also filling the role of one of the a-nodes). In a DAQ, however, having only two roles is insufficient, since a DAQ cannot be answered without breaking down the compounds into components. Hence we provide a novel 3-role structure, which consists of a third type of node: a Decomposition Node (d-node), which breaks the compounds into components—the sub-elements needed for the aggregation (e.g., “BdnGIS”). Thus, the 3-role structure bridges the gaps between domains.

Fig. 1 shows a simple 3-role structure for our running example. I is the q-node, which issues the query over a set of compounds, e.g., “Cell.” J is the d-node, which decomposes the initial query into a set of queries over components, and K1 and K2 are the a-nodes which process the decomposed queries over components. As noted above, in a real semantic integration system, the roles may well be intertwined, with one physical node filling one or more of the different roles described here.

The 3-role structure is identified1 by examining the mappings between data sources. No additional work is needed to “set up” the 3-role structures. The cell, building, and damage assessment databases in Fig. 1 are automatically identified to form a 3-role structure, as we show later in the planning step (Section 4.2) of DAQ answering.

This paper defines decomposition aggregation queries (DAQs) over this 3-role structure and shows how to process DAQs using aggregation rewritings. Processing DAQs has some distinct features that are different from existing semantic integration query answering: 1. Instead of only querying objects of equivalent granularity, queries over compounds are transformed into queries over the constituent components. 2. Instead of processing open queries for “tuples satisfying some condition,” DAQs are closed: they solve for a pre-determined set of compounds. Hence, in addition to sending translated queries during query answering, data sets are also sent among data sources. 3. Decomposition of compounds on the 3-role structure relies on data sources that do not directly answer the query; previously only data sources that directly answered the query were useful. 4. When multiple data sources for the components are present, choosing the right sources to use becomes important. Moreover, the distribution of answers from different combinations of data sources is also of interest.

DAQs are answered as aggregate queries. When multiple data sources are available to compute an aggregation, the resulting answer depends on which sources are used to provide inputs to the aggregation. Hence, there are a whole range of viable answers to DAQs depending on which values come from which sources. This causes many challenges to query optimization, which we briefly describe here and discuss in further detail in Section 3.2.

We design DAQ query optimization as a two-phase optimization operation that consists of a source selection operation (phase 1) and distribution estimation operation (phase 2). In phase 1, the goal of query processing is to return one scalar value per compound computed by aggregating the compound's components. As there exist multiple choices to combine data sources to obtain such scalar answers, the phase 1 optimization aims to find source combinations that result in the least system overhead. Therefore, phase 1 optimization is also known as “source selection optimization.” Details of phase 1 optimizations are covered in Section 6 where the following challenges are addressed:

  • 1.

    The size of data source combinations can be exponentially large.

  • 2.

    The optimization problems are intrinsically NP hard.

  • 3.

    The exact set of components that can be aggregated on a data source is unknown before DAQ processing.

As a complement to phase 1 optimization that finds a scalar answer for an incoming DAQ, phase 2 optimization provides a comprehensive view on the distribution of scalar answers computable for source combinations. The large size of viable answers again is the biggest challenge to the scalability of the query processing. Section 7 briefly describes phase 2 optimization; a more comprehensive solution including a full theoretical analysis can be found in [66].

Our specific contributions are as follows:

  • 1.

    We identify the problem and challenges of semantically integrating domain heterogeneous schemas using aggregations. We propose an architecture-independent solution to semantically integrate domain heterogeneous schemas using DAQs over 3-role structures (Section 1).

  • 2.

    We formalize DAQs and propose to answer DAQs with a two-phase query answering scheme which respect to both the efficiency and quality of query answering (Section 3).

  • 3.

    We propose “aggregation rewriting” to translate DAQs into aggregate queries using 3-role structures. The query rewriting algorithm, working together with existing query rewriting techniques, supports generic queries and is provably scalable (4 DAQ processing with 3-role structure, 5 Generalization).

  • 4.

    We describe query optimization for phase 1 query answering which efficiently computes a viable aggregate answer compatible with relational model. We study the complexity of the optimization (Section 6).

  • 5.

    We sketch Value Estimation, our solution for phase 2 answering, which helps users with additional time to better understand the space of the possible viable answers discovered above, thus completing the whole DAQ processing framework. Our phase 2 solution is fully described in [66]. Both query rewriting and query optimizations are empirically verified (Section 7).

In addition to the above sections, Section 2 describes preliminaries, Section 8 empirically evaluates our work, and Section 9 summarizes related work. Section 10 concludes.

Section snippets

Preliminaries

Since different data sources have different schemas; processing queries over multiple data sources requires translating queries from one schema to another. Semantic mappings relate elements in one schema to elements in another. Some semantic integration architectures use schema mappings [41], [45], while others use data mappings [4]. To be as general as possible, we allow both. The concepts are formally defined as follows:

Definition 2.1 Schema mapping

A schema mapping from a source schema σ to a target schema τ, denoted as

Decomposition aggregation queries

As motivated in the Introduction, answering a query over high-level compounds in a domain heterogeneous semantic integration system requires decomposing the original queries into the components that the compounds comprise of. In addition to more standard query nodes (q-nodes) and answer nodes (a-nodes), we reify the role of decomposing the original query as the job of a decomposition node (d-node) as shown in Fig. 1. Formally:

Definition 3.1 3-Role structure

A 3-role structure is a directed graph T(V(q,d,A),E), where data

Overview of DAQ answering

Example 4.1 DAQ answering

As depicted in Fig. 1, data sources I, J, K1 and K2 form the 3-role structure to process the DAQ Q in Example 1.2. First, the cells in Q are decomposed to buildings on the d-node J. Then, aggregation rewriting translates Q into aggregate queries over buildings for a-nodes K1 and K2; each covers a fraction of the buildings decomposed from the cells. The results of the aggregates are then merged at J and the final answer is returned to the q-node I, thus completing DAQ answering. 

As shown in Fig.

Generalization

We have thus far presented a basic aggregation rewriting algorithm in which the inputs to the algorithm are conjunctive queries stripped down to only elements required to be transformed into aggregations. In practice, queries are usually more complicated. There can be more joins; not all queries fit directly with mappings (e.g., keys are not explicit in the query body). Also, an a-node does not necessarily contain all the components in a compound as is demonstrated in the algorithm. We describe

DAQ query optimization for phase 1

This and the following section are devoted to DAQ query optimization. We perform a two-phase query optimization for DAQ. In phase 1, the optimization focuses on finding a combination of data sources so that a scalar viable answer to the DAQ can be returned while keeping the system overhead of DAQ processing as low as possible. In phase 2 optimization, we extract statistical information to better describe the viable answer distributions. Phase 2 generally requires more processing power but

DAQ query optimization for phase 2

Phase 1 answering returns a scalar viable answer which is compatible to the aggregation defined in relational model. In phase 2 query processing, we address the fact that the viable answers to a DAQ is a distribution of values rather than one unique value. While a full discussion of phase 2 query answering is beyond the scope of this report, we include enough information here to see the full solution to the two phase DAQ processing; see [66] for phase 2 algorithmic details.

The goal of phase 2

Empirical study

In this section, we present our empirical study on DAQ query processing. We first evaluate the aggregate rewriting algorithms and then test the phase 1 optimizations. Some of the phase 2 results were presented inline in Section 7 and we omit it for space considerations.

Related work

Cohen et al. [13] show that the complexity of rewriting 3aggregation queries is NP-complete for complete rewritings and polynomial algorithms for partial rewritings only exist for linear queries. Cohen et al. [13] provide a theoretical framework to unify previous rewriting aggregation works. Our work differs from these approaches since we discover aggregations and the grouping functions to transform a query to aggregations instead of using aggregation views for query rewriting.

In [2], the

Conclusion and future work

In this paper we introduced a new query type to integrate domain heterogeneous data sources. This new query type, named the decomposition aggregate query (DAQ), is processed over the 3-role structure which bridges the gap of domain heterogeneity between data sources in sematic integration. A 3-role structure has data sources called the d-nodes that accomplish the task of decomposing compounds into components and help translate non-aggregate queries over compounds into aggregate queries

Acknowledgments

We thank Michael Lawrence, Chao Yan, Lin Xu, Yanling Cai (from the Department of Statistics), Melsa Smith, Laks Lakshmanan and Raymond Ng for helpful discussions. The Building damage assessment database is provided by Kate Thibert and Hugon Juarez in the Civil Engineering Department. The GIS database is maintained by Alejandro Cervantes in the Geography Department. This work is partially funded by NSERC and Public Safety Canada.

References (67)

  • J. Beasley et al.

    A genetic algorithm for the set covering problem

    European Journal of Operational Research

    (1996)
  • P. Beraldi et al.

    The probabilistic set-covering problem

    Operations Research

    (2002)
  • Canada climate data,...
  • V. Chvatal

    A greedy heuristic for the set-covering problem

    Mathematics of Operations Research

    (1979)
  • S. Cohen, W. Nutt, A. Serebrenik, Rewriting aggregate queries using views, in: Symposium on Principles of Database...
  • S. Cohen et al.

    Rewriting queries with arbitrary aggregation functions using views

    Transactions on Database Systems

    (2006)
  • Considine, Hadjieleftheriou, Li, Byers, Kollios, Robust approximate aggregation in sensor data management systems,...
  • J. Considine, F. Li, G. Kollios, J. Byers, Approximate aggregation techniques for sensor databases, in: ICDE, 2004, pp....
  • M. Cygan et al.

    Exponential-time approximation of weighted set cover

    Information Processing Letters

    (2009)
  • E. Davis, Inference in datalog,...
  • M.J. Franklin et al.

    From databases to dataspacesa new abstraction for information management

    SIGMOD Record

    (2005)
  • Freebase datasets,...
  • S. Ganguly et al.

    Bifocal sampling for skew-resistant join size estimation

    SIGMOD Record

    (1996)
  • A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M.J. Strauss, Surfing wavelets on streams: one-pass summaries for...
  • G. Gottlob, A. Nash, Data exchange: computing cores in polynomial time, in: PODS, 2006, pp....
  • P.J. Haas, J.F. Naughton, S. Seshadri, A.N. Swami, Fixed-precision estimation of join selectivity, in: PODS, 1993, pp....
  • A. Halevy

    Answering queries using viewsa survey

    VLDB Journal

    (2001)
  • A. Halevy, A. Rajaraman, J. Ordille, Data integration: the teenage years, in: VLDB,...
  • A.Y. Halevy, Z.G. Ives, D. Suciu, I. Tatarinov, Piazza: data management infrastructure for semantic web applications,...
  • H. Hoos et al.

    Stochastic Local SearchFoundations & Applications

    (2004)
  • Y. Ioannidis, The history of histograms (abridged), in: VLDB,...
  • Y.E. Ioannidis et al.

    Optimal histograms for limiting worst-case error propagation in the size of join results

    ACM Transactions on Database Systems

    (1993)
  • Z. Ives et al.

    The orchestra collaborative data sharing system

    SIGMOD Record

    (2008)
  • Cited by (9)

    • Review on the advancements of disambiguation in semantic question answering system

      2017, Information Processing and Management
      Citation Excerpt :

      The conceptual ambiguity occurs because the concept used in the KB is “totalPopulation”, instead of “inhabitant” . A conceptual disambiguation solution is required to disambiguate the KB concept while searching for matches from homogeneous or heterogeneous resources (Xu & Pottinger, 2014). Heterogeneous resource disambiguation is a process to select the best KB(s) in the event that multiple KB-concept matches are encountered from multiple resources.

    • Mapping RDF knowledge bases using exchange samples

      2016, Knowledge-Based Systems
      Citation Excerpt :

      There are many initiatives whose goal is to link these knowledge bases, which is the first step to perform complex integration processes [36]. Integration usually refers to several crucial tasks, such as data integration [68], virtual integration [38], data warehousing [32], model evolution [29], model matching [26], record linkage [42,53], or data exchange [28]. In this article, we focus on data exchange, whose goal is to populate a target knowledge base using data that come from one or more source knowledge bases.

    • Facilitating user interaction with data

      2017, CEUR Workshop Proceedings
    View all citing articles on Scopus
    View full text