Optimization of sub-query processing in distributed data integration systems
Introduction
As cloud and grid computing is becoming more and more popular, increasing number of applications needs to access and process data from multiple distributed sources. For example, a bioinformatics application needs to query autonomous databases across the world to access different types of proteins and protein–protein interaction information located at different storage clouds.
Data integration in Clouds/Grids is a promising solution for combining and analyzing data from different stores. Several projects (e.g., OGSA-DQP Lynden et al., 2009; CoDIMS-G Fontes et al., 2004; and GridDB-Lite Narayanan et al., 2003) have been developed to study data integration in distributed environments. For example, OGSA-DQP (Lynden et al., 2009) is a service-oriented, distributed query processor, which provides effective declarative support for service orchestration. It is based on an infrastructure consisting of distributed services for efficient evaluation of distributed queries over OGSA-DAI wrapped data sources and analysis resources available as services.
Queries to data integration systems are generally formulated in virtual schemas. Given a user query, a data integration system typically processes the query by translating it into a query plan and evaluating the query plan accordingly. A query plan consists of a set of sub-queries formulated over the data sources and operators specifying how to combine results of the sub-queries to answer the user query. As Clouds/Grids are generally built over wide-area networks, high communication cost is the main reason of leading to slow query response time. Therefore, query performance can be improved by minimizing communication cost. In this paper, our objective is to reduce communication overhead and therefore improve query performance, through optimizing sub-query processing.
We optimize sub-query processing by exploiting data sharing opportunities among sub-queries. IGNITE is a method proposed in Lee et al. (2007) to detect data sharing opportunities across concurrent distributed queries. By combining multiple similar data requests issued to the same data source, and further to a common data request, IGNITE can reduce communication overhead, thereby increase system throughput. However, IGNITE does not utilize parallel data transmission so that it does not always improve query performance. Our approach proposed here enhances IGNITE by addressing its drawbacks so that query performance in distributed systems can be further improved.
Our data integration system employs an operator-centric data flow execution model, also proposed in Harizopoulos et al. (2005). Each operator corresponds to a μEngine, which has local threads for data processing and data dispatching. Queries are processed by routing data through μEngines. All the μEngines work in parallel, thus they can fully utilize intra-query parallelism. Based on such an operator-centric data flow execution model, all similar query plans are allocated to the same group of μ Engines. Therefore sub-queries from different queries are grouped in a common place for processing to enable data sharing across the sub-queries.
In the μEngine for processing sub-queries, a query reconstruction mechanism with a Merge-Partition (MP) reconstruction algorithm is developed. The query reconstruction mechanism can construct a set of new queries to eliminate data redundancy among the sub-queries being processed by the μEngine. All the sub-query answers can be obtained by evaluating the new queries and therefore the required communication overhead can be reduced.
The rest of the paper is organized as follows. Section 2 presents related work. Section 3 describes the execution model of our DIS. Section 4 proposes the Merge-Partition (MP) query reconstruction algorithm used in our DIS. Section 5 discusses the experiments that we conducted to evaluate our solution. Section 6 concludes the paper.
Section snippets
Related work
IGNITE system proposed in Lee et al. (2007) was developed based on the PostgreSQL database, and is a work mostly related to the work presented in this paper. IGNITE decouples the source wrappers from the execution engine (adopted from the PostgreSQL database), and enables the execution engine to send sub-queries to same source, which therefore makes data sharing across sub-queries possible. Meanwhile, IGNITE employs the iterator model proposed in Graefe (1993) so that sub-queries may have delay
Query engine
In this section, we discuss the execution engine of our DIS. The engine employs a data flow style execution model (Section 3.1), based on it, sub-queries can be gathered to a common place for evaluation through source wrappers (Section 3.2). We also discuss in Section 3.3, in detail, why is required to have a delay for each request in order to better utilize data sharing.
Query reconstruction algorithm
In this section, we introduce the Merge-Partition (MP) reconstruction algorithm applied in our DIS. First, we model the problem of query reconstruction in Section 4.1. Then, in Section 4.2, we present the algorithm to see how it reconstructs a set of queries and computes the answers of the queries.
Evaluation
In this section, we present our evaluation method and results. The overall experimental setup is discussed in Section 5.1, followed by the detailed discussion of each experiment and its results in Section 5.2.
Conclusion
Distributed data sources can be heterogeneous, and managing, analyzing, and processing data from different sources in an integrated way is becoming more and more important. Distributed data integration applications are always processed on distributed infrastructures, and communication cost becomes the main factor of determining query response time. Therefore we can expect that query performance can be improved by minimizing communication cost. The objective of this paper is to propose an
Acknowledgement
This Work is supported by Natural Science Foundation of China (60803121, 60773145, 60911130371, 90812001, 60963005), National High-Tech R&D (863) Program of China (2009AA01A130, 2006AA01A101, 2006AA01A108, 2006AA01A111, 2006AA01A117), MOE-Intel Foundation and Tsinghua National Laboratory for Information Science and Technology (TNLIST) Cross-discipline Foundation.
References (20)
- et al.
The design and implementation of OGSA-DQP: A service-based distributed query processor
Future Generation Computer Systems
(2009) - et al.
On estimating the cardinality of the projection of a database relation
ACM Transactions on Databases
(1989) - et al.
Pipelining in Multi-Query Optimization. In PODS
(2001) - et al.
Adaptive query processing
Foundations and Trends in Databases
(2007) - Fontes V, Schulze B, Dutra M, et al. CoDIMS-G: a data and program integration service for the grid. In: Proceedings of...
Query evaluation techniques for large databases
ACM Comput Surv
(1993)- et al.
Selectivity estimation using probabilistic models
SIGMOD
(2001) - et al.
Optimizing queries using materialized views: a practical, scalable solution
SIGMOD
(2001) - Gounaris Anastasios. Resource aware query processing on the grid. PhD thesis, School of Computer Science of the...
- Harizopoulos S, Shkapenyuk V, Ailamaki A. QPipe: a simultaneously pipelined relational query engine. In: Proceedings of...
Cited by (20)
Optimization in the sensor cloud: Taxonomy, challenges, and survey
2021, Recent Trends in Computational Intelligence Enabled Research: Theoretical Foundations and ApplicationsRobust heuristic algorithms for exploiting the common tasks of relational cloud database queries
2015, Applied Soft Computing JournalCitation Excerpt :Providing an illusion of infinite resources with increasing database workloads is an NP-Hard optimization problem where the tasks need to be scheduled optimally in order to answer the required services [1,2]. Cloud database query engines can take advantage of common tasks and efficiently manage the resources by using a well-known database optimization technique, Multiple Query Optimization (MQO) [3–8]. Although MQO requires significant search for the identification of common tasks among queries, it has been successfully applied to complex Online Analytical Processing (OLAP) queries that involve big data processing and common tasks [9,10].
Optimization of correlate subquery based on distributed database
2021, Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical UniversityOptimized execution method for queries with materialized views: Design and implementation
2021, Journal of Intelligent and Fuzzy SystemsOLAP parallel query processing in clouds with C-ParGRES
2020, Concurrency and Computation: Practice and Experience