Optimization of sub-query processing in distributed data integration systems

https://doi.org/10.1016/j.jnca.2010.06.007Get rights and content

Abstract

Data integration system (DIS) is becoming paramount when Cloud/Grid applications need to integrate and analyze data from geographically distributed data sources. DIS gathers data from multiple remote sources, integrates and analyzes the data to obtain a query result. As Clouds/Grids are distributed over wide-area networks, communication cost usually dominates overall query response time. Therefore we can expect that query performance can be improved by minimizing communication cost.

In our method, DIS uses a data flow style query execution model. Each query plan is mapped to a group of μEngines, each of which is a program corresponding to a particular operator. Thus, multiple sub-queries from concurrent queries are able to share μEngines. We reconstruct these sub-queries to exploit overlapping data among them. As a result, all the sub-queries can obtain their results, and overall communication overhead can be reduced. Experimental results show that, when DIS runs a group of parameterized queries, our reconstructing algorithm can reduce the average query completion time by 32–48%; when DIS runs a group of non-parameterized queries, the average query completion time of queries can be reduced by 25–35%.

Introduction

As cloud and grid computing is becoming more and more popular, increasing number of applications needs to access and process data from multiple distributed sources. For example, a bioinformatics application needs to query autonomous databases across the world to access different types of proteins and protein–protein interaction information located at different storage clouds.

Data integration in Clouds/Grids is a promising solution for combining and analyzing data from different stores. Several projects (e.g., OGSA-DQP Lynden et al., 2009; CoDIMS-G Fontes et al., 2004; and GridDB-Lite Narayanan et al., 2003) have been developed to study data integration in distributed environments. For example, OGSA-DQP (Lynden et al., 2009) is a service-oriented, distributed query processor, which provides effective declarative support for service orchestration. It is based on an infrastructure consisting of distributed services for efficient evaluation of distributed queries over OGSA-DAI wrapped data sources and analysis resources available as services.

Queries to data integration systems are generally formulated in virtual schemas. Given a user query, a data integration system typically processes the query by translating it into a query plan and evaluating the query plan accordingly. A query plan consists of a set of sub-queries formulated over the data sources and operators specifying how to combine results of the sub-queries to answer the user query. As Clouds/Grids are generally built over wide-area networks, high communication cost is the main reason of leading to slow query response time. Therefore, query performance can be improved by minimizing communication cost. In this paper, our objective is to reduce communication overhead and therefore improve query performance, through optimizing sub-query processing.

We optimize sub-query processing by exploiting data sharing opportunities among sub-queries. IGNITE is a method proposed in Lee et al. (2007) to detect data sharing opportunities across concurrent distributed queries. By combining multiple similar data requests issued to the same data source, and further to a common data request, IGNITE can reduce communication overhead, thereby increase system throughput. However, IGNITE does not utilize parallel data transmission so that it does not always improve query performance. Our approach proposed here enhances IGNITE by addressing its drawbacks so that query performance in distributed systems can be further improved.

Our data integration system employs an operator-centric data flow execution model, also proposed in Harizopoulos et al. (2005). Each operator corresponds to a μEngine, which has local threads for data processing and data dispatching. Queries are processed by routing data through μEngines. All the μEngines work in parallel, thus they can fully utilize intra-query parallelism. Based on such an operator-centric data flow execution model, all similar query plans are allocated to the same group of μ Engines. Therefore sub-queries from different queries are grouped in a common place for processing to enable data sharing across the sub-queries.

In the μEngine for processing sub-queries, a query reconstruction mechanism with a Merge-Partition (MP) reconstruction algorithm is developed. The query reconstruction mechanism can construct a set of new queries to eliminate data redundancy among the sub-queries being processed by the μEngine. All the sub-query answers can be obtained by evaluating the new queries and therefore the required communication overhead can be reduced.

The rest of the paper is organized as follows. Section 2 presents related work. Section 3 describes the execution model of our DIS. Section 4 proposes the Merge-Partition (MP) query reconstruction algorithm used in our DIS. Section 5 discusses the experiments that we conducted to evaluate our solution. Section 6 concludes the paper.

Section snippets

Related work

IGNITE system proposed in Lee et al. (2007) was developed based on the PostgreSQL database, and is a work mostly related to the work presented in this paper. IGNITE decouples the source wrappers from the execution engine (adopted from the PostgreSQL database), and enables the execution engine to send sub-queries to same source, which therefore makes data sharing across sub-queries possible. Meanwhile, IGNITE employs the iterator model proposed in Graefe (1993) so that sub-queries may have delay

Query engine

In this section, we discuss the execution engine of our DIS. The engine employs a data flow style execution model (Section 3.1), based on it, sub-queries can be gathered to a common place for evaluation through source wrappers (Section 3.2). We also discuss in Section 3.3, in detail, why is required to have a delay for each request in order to better utilize data sharing.

Query reconstruction algorithm

In this section, we introduce the Merge-Partition (MP) reconstruction algorithm applied in our DIS. First, we model the problem of query reconstruction in Section 4.1. Then, in Section 4.2, we present the algorithm to see how it reconstructs a set of queries and computes the answers of the queries.

Evaluation

In this section, we present our evaluation method and results. The overall experimental setup is discussed in Section 5.1, followed by the detailed discussion of each experiment and its results in Section 5.2.

Conclusion

Distributed data sources can be heterogeneous, and managing, analyzing, and processing data from different sources in an integrated way is becoming more and more important. Distributed data integration applications are always processed on distributed infrastructures, and communication cost becomes the main factor of determining query response time. Therefore we can expect that query performance can be improved by minimizing communication cost. The objective of this paper is to propose an

Acknowledgement

This Work is supported by Natural Science Foundation of China (60803121, 60773145, 60911130371, 90812001, 60963005), National High-Tech R&D (863) Program of China (2009AA01A130, 2006AA01A101, 2006AA01A108, 2006AA01A111, 2006AA01A117), MOE-Intel Foundation and Tsinghua National Laboratory for Information Science and Technology (TNLIST) Cross-discipline Foundation.

References (20)

  • Steven Lynden et al.

    The design and implementation of OGSA-DQP: A service-based distributed query processor

    Future Generation Computer Systems

    (2009)
  • Rafiul Ahad et al.

    On estimating the cardinality of the projection of a database relation

    ACM Transactions on Databases

    (1989)
  • Nilesh N. Dalvi et al.

    Pipelining in Multi-Query Optimization. In PODS

    (2001)
  • Amol Deshpande et al.

    Adaptive query processing

    Foundations and Trends in Databases

    (2007)
  • Fontes V, Schulze B, Dutra M, et al. CoDIMS-G: a data and program integration service for the grid. In: Proceedings of...
  • Goetz Graefe

    Query evaluation techniques for large databases

    ACM Comput Surv

    (1993)
  • Lise Getoor et al.

    Selectivity estimation using probabilistic models

    SIGMOD

    (2001)
  • J. Goldstein et al.

    Optimizing queries using materialized views: a practical, scalable solution

    SIGMOD

    (2001)
  • Gounaris Anastasios. Resource aware query processing on the grid. PhD thesis, School of Computer Science of the...
  • Harizopoulos S, Shkapenyuk V, Ailamaki A. QPipe: a simultaneously pipelined relational query engine. In: Proceedings of...
There are more references available in the full text version of this article.

Cited by (20)

  • Optimization in the sensor cloud: Taxonomy, challenges, and survey

    2021, Recent Trends in Computational Intelligence Enabled Research: Theoretical Foundations and Applications
  • Robust heuristic algorithms for exploiting the common tasks of relational cloud database queries

    2015, Applied Soft Computing Journal
    Citation Excerpt :

    Providing an illusion of infinite resources with increasing database workloads is an NP-Hard optimization problem where the tasks need to be scheduled optimally in order to answer the required services [1,2]. Cloud database query engines can take advantage of common tasks and efficiently manage the resources by using a well-known database optimization technique, Multiple Query Optimization (MQO) [3–8]. Although MQO requires significant search for the identification of common tasks among queries, it has been successfully applied to complex Online Analytical Processing (OLAP) queries that involve big data processing and common tasks [9,10].

  • Optimization of correlate subquery based on distributed database

    2021, Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University
  • OLAP parallel query processing in clouds with C-ParGRES

    2020, Concurrency and Computation: Practice and Experience
View all citing articles on Scopus
View full text