Integrating domain heterogeneous data sources using decomposition aggregation queries
Introduction
Many applications require querying multiple databases with heterogeneous schemas. We refer to all systems that translate queries over databases and thus support querying multiple, heterogeneous, independently maintained databases as semantic integration systems. Many semantic integration architectures exist, including data integration (e.g., [26], [27]), peer data management systems (PDMSs, e.g., [4], [28]), and dataspaces [20].
Existing semantic integration approaches generally focus on cases where the schemas are heterogeneous, but the entities in the sources are from the same domain — the data sources are domain homogeneous. For example, querying independently maintained bibliography databases is a widely used example for integrating domain homogeneous data sources. Although bibliographic records may have different schemas in different data sources, the schemas represent the same kind of objects. Semantic integration must also manage domain heterogeneity. For example, the 50+ Amazon public data sets [3] are categorized into 8 domains and the 7144 databases on Freebase [21] belong to 86 domains. Integrating domain heterogeneous schemas is more challenging than integrating domain homogeneous schemas since it requires transforming attributes of objects in one domain to those in another domain. One major difficulty is that entities in one domain may not have direct correspondences in another domain. Although existing semantic integration systems support simple transformations such as concatenating first name and last name in one schema to form a full name in another schema, integrating domain heterogeneous data requires additional support. As shown in the following example, motivated by a real-world disaster management project (JIIRP [36]), the new demands require developing new techniques to manage associations between domain heterogeneous entities and answer cross-domain queries. Example 1.1 Cell–building heterogeneity Consider planning responses to disasters. The planners' domain abstracts infrastructure elements as “cells,” which are logical units that perform a single function (e.g., a hospital complex) and the engineers model seismic damage on “buildings.” □
Traditional semantic integration systems motivate their work with both domain homogeneous and domain heterogeneous examples. However, the mappings used by existing semantic integration systems typically do not handle domain heterogeneous sources. For example, traditional mappings (e.g., [49], [58]) cannot translate queries between the cells and buildings in Example 1.1 since there are no common objects to relate each other. The mappings in [39], [40] consider objects from different domains; however, the system still requires common attributes to relate data records (e.g., a gene-record and a research paper about the gene are linked by the gene's id). Deeper semantic relationships between objects, e.g., those that require aggregation, as in the following example, are not managed or used for integration.
Example 1.2 shows a simplified scenario of how the relationships between heterogeneous objects in Example 1.1 can be expressed using aggregation: Example 1.2 Aggregation between cells and buildings Table 1 shows a snapshot of the cells (in the planners' domain) and building damage assessment (in the engineers' domain). A cell's damage (Cell.damage) is estimated by averaging the damage to the constituent buildings (avg (BdnDmg.Damage)). Monetary loss (Cell.loss) is estimated by summing the buildings' losses (sum(BdnDmg.Loss)). Lacking systematic integration, the Cell table is manually populated by aggregating records in the BdnDmg table for each cell (e.g., C1 and C2). This is challenging because those calculating the losses are unfamiliar with seismic damage assessment. Additionally, the laborious process of populating a cell discourages users from defining new cells or updating cell attributes. Our goal is to systematically transform queries on cells into aggregate queries over building seismic assessments (i.e., to automatically compute the “unknowns” in the Cell table in Table 1). □
Computing the damage and loss attributes in Example 1.2 requires translating queries on Cells into aggregate queries on BdnDmg data. One difficulty is that if the users do not know the data sources beforehand, it is impossible to pre-determine the aggregate queries associated with the Cells.
This problem motivates the focus of this work: How can we bridge the gap between domain heterogeneous data sources by automating transforming queries over complex compounds into the components that form them.
Answering domain heterogeneous aggregate queries is challenging for three reasons: (1) domain heterogeneity prevents users from manually translating queries due to their limited domain knowledge; (2) users may not know (or even desire to know) that their query requires aggregating data from multiple sources and (3) multiple databases with potential duplicates, varying answers, and different subsets of relevant data must be collected from multiple databases. These challenges require supporting fully automatic query translation at the system level.
In addition to our running example, there are many other cases requiring solution to domain heterogeneous aggregate queries. For example:
- •
Estimating the cost to build a room. The cost of a room is estimated by decomposing the room into its constituent parts (e.g., windows and beams) and then aggregating their costs from providers' databases. Lawrence et al. [42] focused creating and maintaining the mappings necessary to coordinate updates, not how to choose which part of the aggregation came from which source, or what to do with conflicting values.
- •
Quickly aggregating data in order to detect fraud or to improve performance analysis. This was studied in the context of streams [23], but did not focus on the fact that the data may come from multiple sources (e.g., identify theft can be better traced by combining information from bank accounts, credit cards, medical records, and criminal records). In this case, the exact figures do not matter—it is quickly finding abnormalities in the overall trend that is a problem.
- •
In market analysis, data is pulled from a number of domain heterogeneous sources in order to find potential candidates to target with advertising. For example, one might look for the total purchases of customers who are between 18 and 30 and have a household income of between $30,000 and $60,000 [43]. As in our running scenario, the goal is to look for answers that are viable, but since the different sources may have different values, it may not always be possible to determine one “true” answer.
- •
Wireless sensor networks often need to quickly make security decisions and provide environmental information [18] based on aggregating the data involved, even though the sensors are likely to return conflicting values.
We abstract such integration needs using what we term a “Decomposition Aggregate Query (DAQ).” While we formally define DAQs in Section 3, we informally define and motivate the key features now. The main feature of a DAQ that differentiates it from queries asked over a semantic integration system is that a DAQ is asked over a different level of granularity than a standard semantic integration query. For example, calculating the unknown values in the Cells in Domain A of Table 1 requires aggregating values from the buildings in Domain B in Table 1. Hence a DAQ considers two main types of objects: (1) compounds, which are the high-level entities that the aggregate query is asked over (e.g., “Cell”), and (2) components, which are the sub-elements needed for the aggregation (e.g., “Buildings”).
Breaking down DAQs into the components which the queried compounds consist of and aggregating them requires a change in the roles of the nodes in a semantic integration system. Traditional semantic integration systems consider that nodes have one of the two roles: a Query Node (q-node) asks the queries and Answer Nodes (a-nodes) provide the answers. In both traditional semantic integration systems and in our case, different nodes may fill different roles for different queries; one node may fill several roles, even in the same query (e.g., a q-node may provide some answers for its own query, thus also filling the role of one of the a-nodes). In a DAQ, however, having only two roles is insufficient, since a DAQ cannot be answered without breaking down the compounds into components. Hence we provide a novel 3-role structure, which consists of a third type of node: a Decomposition Node (d-node), which breaks the compounds into components—the sub-elements needed for the aggregation (e.g., “BdnGIS”). Thus, the 3-role structure bridges the gaps between domains.
Fig. 1 shows a simple 3-role structure for our running example. I is the q-node, which issues the query over a set of compounds, e.g., “Cell.” J is the d-node, which decomposes the initial query into a set of queries over components, and K1 and K2 are the a-nodes which process the decomposed queries over components. As noted above, in a real semantic integration system, the roles may well be intertwined, with one physical node filling one or more of the different roles described here.
The 3-role structure is identified1 by examining the mappings between data sources. No additional work is needed to “set up” the 3-role structures. The cell, building, and damage assessment databases in Fig. 1 are automatically identified to form a 3-role structure, as we show later in the planning step (Section 4.2) of DAQ answering.
This paper defines decomposition aggregation queries (DAQs) over this 3-role structure and shows how to process DAQs using aggregation rewritings. Processing DAQs has some distinct features that are different from existing semantic integration query answering: 1. Instead of only querying objects of equivalent granularity, queries over compounds are transformed into queries over the constituent components. 2. Instead of processing open queries for “tuples satisfying some condition,” DAQs are closed: they solve for a pre-determined set of compounds. Hence, in addition to sending translated queries during query answering, data sets are also sent among data sources. 3. Decomposition of compounds on the 3-role structure relies on data sources that do not directly answer the query; previously only data sources that directly answered the query were useful. 4. When multiple data sources for the components are present, choosing the right sources to use becomes important. Moreover, the distribution of answers from different combinations of data sources is also of interest.
DAQs are answered as aggregate queries. When multiple data sources are available to compute an aggregation, the resulting answer depends on which sources are used to provide inputs to the aggregation. Hence, there are a whole range of viable answers to DAQs depending on which values come from which sources. This causes many challenges to query optimization, which we briefly describe here and discuss in further detail in Section 3.2.
We design DAQ query optimization as a two-phase optimization operation that consists of a source selection operation (phase 1) and distribution estimation operation (phase 2). In phase 1, the goal of query processing is to return one scalar value per compound computed by aggregating the compound's components. As there exist multiple choices to combine data sources to obtain such scalar answers, the phase 1 optimization aims to find source combinations that result in the least system overhead. Therefore, phase 1 optimization is also known as “source selection optimization.” Details of phase 1 optimizations are covered in Section 6 where the following challenges are addressed:
- 1.
The size of data source combinations can be exponentially large.
- 2.
The optimization problems are intrinsically NP hard.
- 3.
The exact set of components that can be aggregated on a data source is unknown before DAQ processing.
As a complement to phase 1 optimization that finds a scalar answer for an incoming DAQ, phase 2 optimization provides a comprehensive view on the distribution of scalar answers computable for source combinations. The large size of viable answers again is the biggest challenge to the scalability of the query processing. Section 7 briefly describes phase 2 optimization; a more comprehensive solution including a full theoretical analysis can be found in [66].
Our specific contributions are as follows:
- 1.
We identify the problem and challenges of semantically integrating domain heterogeneous schemas using aggregations. We propose an architecture-independent solution to semantically integrate domain heterogeneous schemas using DAQs over 3-role structures (Section 1).
- 2.
We formalize DAQs and propose to answer DAQs with a two-phase query answering scheme which respect to both the efficiency and quality of query answering (Section 3).
- 3.
We propose “aggregation rewriting” to translate DAQs into aggregate queries using 3-role structures. The query rewriting algorithm, working together with existing query rewriting techniques, supports generic queries and is provably scalable (4 DAQ processing with 3-role structure, 5 Generalization).
- 4.
We describe query optimization for phase 1 query answering which efficiently computes a viable aggregate answer compatible with relational model. We study the complexity of the optimization (Section 6).
- 5.
We sketch Value Estimation, our solution for phase 2 answering, which helps users with additional time to better understand the space of the possible viable answers discovered above, thus completing the whole DAQ processing framework. Our phase 2 solution is fully described in [66]. Both query rewriting and query optimizations are empirically verified (Section 7).
In addition to the above sections, Section 2 describes preliminaries, Section 8 empirically evaluates our work, and Section 9 summarizes related work. Section 10 concludes.
Section snippets
Preliminaries
Since different data sources have different schemas; processing queries over multiple data sources requires translating queries from one schema to another. Semantic mappings relate elements in one schema to elements in another. Some semantic integration architectures use schema mappings [41], [45], while others use data mappings [4]. To be as general as possible, we allow both. The concepts are formally defined as follows: Definition 2.1 Schema mapping A schema mapping from a source schema to a target schema , denoted as
Decomposition aggregation queries
As motivated in the Introduction, answering a query over high-level compounds in a domain heterogeneous semantic integration system requires decomposing the original queries into the components that the compounds comprise of. In addition to more standard query nodes (q-nodes) and answer nodes (a-nodes), we reify the role of decomposing the original query as the job of a decomposition node (d-node) as shown in Fig. 1. Formally: Definition 3.1 3-Role structure A 3-role structure is a directed graph , where data
Overview of DAQ answering
Example 4.1 DAQ answering As depicted in Fig. 1, data sources I, J, K1 and K2 form the 3-role structure to process the DAQ Q in Example 1.2. First, the cells in Q are decomposed to buildings on the d-node J. Then, aggregation rewriting translates Q into aggregate queries over buildings for a-nodes K1 and K2; each covers a fraction of the buildings decomposed from the cells. The results of the aggregates are then merged at J and the final answer is returned to the q-node I, thus completing DAQ answering. □
As shown in Fig.
Generalization
We have thus far presented a basic aggregation rewriting algorithm in which the inputs to the algorithm are conjunctive queries stripped down to only elements required to be transformed into aggregations. In practice, queries are usually more complicated. There can be more joins; not all queries fit directly with mappings (e.g., keys are not explicit in the query body). Also, an a-node does not necessarily contain all the components in a compound as is demonstrated in the algorithm. We describe
DAQ query optimization for phase 1
This and the following section are devoted to DAQ query optimization. We perform a two-phase query optimization for DAQ. In phase 1, the optimization focuses on finding a combination of data sources so that a scalar viable answer to the DAQ can be returned while keeping the system overhead of DAQ processing as low as possible. In phase 2 optimization, we extract statistical information to better describe the viable answer distributions. Phase 2 generally requires more processing power but
DAQ query optimization for phase 2
Phase 1 answering returns a scalar viable answer which is compatible to the aggregation defined in relational model. In phase 2 query processing, we address the fact that the viable answers to a DAQ is a distribution of values rather than one unique value. While a full discussion of phase 2 query answering is beyond the scope of this report, we include enough information here to see the full solution to the two phase DAQ processing; see [66] for phase 2 algorithmic details.
The goal of phase 2
Empirical study
In this section, we present our empirical study on DAQ query processing. We first evaluate the aggregate rewriting algorithms and then test the phase 1 optimizations. Some of the phase 2 results were presented inline in Section 7 and we omit it for space considerations.
Related work
Cohen et al. [13] show that the complexity of rewriting 3aggregation queries is NP-complete for complete rewritings and polynomial algorithms for partial rewritings only exist for linear queries. Cohen et al. [13] provide a theoretical framework to unify previous rewriting aggregation works. Our work differs from these approaches since we discover aggregations and the grouping functions to transform a query to aggregations instead of using aggregation views for query rewriting.
In [2], the
Conclusion and future work
In this paper we introduced a new query type to integrate domain heterogeneous data sources. This new query type, named the decomposition aggregate query (DAQ), is processed over the 3-role structure which bridges the gap of domain heterogeneity between data sources in sematic integration. A 3-role structure has data sources called the d-nodes that accomplish the task of decomposing compounds into components and help translate non-aggregate queries over compounds into aggregate queries
Acknowledgments
We thank Michael Lawrence, Chao Yan, Lin Xu, Yanling Cai (from the Department of Statistics), Melsa Smith, Laks Lakshmanan and Raymond Ng for helpful discussions. The Building damage assessment database is provided by Kate Thibert and Hugon Juarez in the Civil Engineering Department. The GIS database is maintained by Alejandro Cervantes in the Geography Department. This work is partially funded by NSERC and Public Safety Canada.
References (67)
- et al.
Efficient approximation of min set cover by moderately exponential algorithms
Theoretical Computer Science
(2009) - et al.
Processing approximate aggregate queries in wireless sensor networks
Information Systems
(2006) - et al.
Data exchangesemantics and query answering
Theoretical Computer Science
(2005) - et al.
SEMINTa tool for identifying attribute correspondences in heterogeneous databases using neural networks
Data & Knowledge Engineering
(2000) - S. Abiteboul, O. Duschka, Complexity of answering queries using materialized views, in: Symposium on Principles of...
- F.N. Afrati, P.G. Kolaitis, Answering aggregate queries in data exchange, in: PODS,...
- Amazon public datasets,...
- et al.
The hyperion projectfrom data integration to data coordination
SIGMOD Record
(2003) - et al.
A dynamic subgradient-based branch-and-bound procedure for set covering
Operations Research
(1996) - C. Baquero, P.S. Almeida, R. Menezes, Fast estimation of aggregates in unstructured networks, in: International...
A genetic algorithm for the set covering problem
European Journal of Operational Research
The probabilistic set-covering problem
Operations Research
A greedy heuristic for the set-covering problem
Mathematics of Operations Research
Rewriting queries with arbitrary aggregation functions using views
Transactions on Database Systems
Exponential-time approximation of weighted set cover
Information Processing Letters
From databases to dataspacesa new abstraction for information management
SIGMOD Record
Bifocal sampling for skew-resistant join size estimation
SIGMOD Record
Answering queries using viewsa survey
VLDB Journal
Stochastic Local SearchFoundations & Applications
Optimal histograms for limiting worst-case error propagation in the size of join results
ACM Transactions on Database Systems
The orchestra collaborative data sharing system
SIGMOD Record
Cited by (9)
Review on the advancements of disambiguation in semantic question answering system
2017, Information Processing and ManagementCitation Excerpt :The conceptual ambiguity occurs because the concept used in the KB is “totalPopulation”, instead of “inhabitant” . A conceptual disambiguation solution is required to disambiguate the KB concept while searching for matches from homogeneous or heterogeneous resources (Xu & Pottinger, 2014). Heterogeneous resource disambiguation is a process to select the best KB(s) in the event that multiple KB-concept matches are encountered from multiple resources.
Mapping RDF knowledge bases using exchange samples
2016, Knowledge-Based SystemsCitation Excerpt :There are many initiatives whose goal is to link these knowledge bases, which is the first step to perform complex integration processes [36]. Integration usually refers to several crucial tasks, such as data integration [68], virtual integration [38], data warehousing [32], model evolution [29], model matching [26], record linkage [42,53], or data exchange [28]. In this article, we focus on data exchange, whose goal is to populate a target knowledge base using data that come from one or more source knowledge bases.
Improved continuous query plan with cluster weighted dominant querying in synthetic datasets
2019, Cluster ComputingSelf-adaptive based model for ambiguity resolution of the Linked Data Query for Big data analytics
2018, International Journal of Integrated EngineeringFacilitating user interaction with data
2017, CEUR Workshop Proceedings