Elsevier

Information Systems

Volume 82, May 2019, Pages 17-32
Information Systems

MIND: An approach to optimize communication time via middleware tuning

https://doi.org/10.1016/j.is.2018.12.005Get rights and content

Abstract

Minimizing the communication time due to the transfer over a network of the intermediary results produced during the execution of a distributed query is a fundamental problem in distributed database management systems. We take a new look at this problem by investigating the relationship between the communication time and a remote data access middleware. We focus on two middleware parameters that are manually tuned by database administrators or programmers: the fetch size (i.e., the number of tuples that are communicated at once) and the message size (i.e., the size of the buffer at the middleware level). We present an experimental study which shows that these parameters have a crucial impact on the communication time. Then, we propose the MIND framework, which tunes the aforementioned middleware parameters, while adapting to different queries (that may vary in terms of selectivity) and networks (that may vary in terms of bandwidth). The main technical contributions of MIND are (i) a communication time estimation function that takes into account the middleware parameters, the size of the query result and the network environment, and (ii) an iterative optimization algorithm to find the fetch size and the message size that allow a good trade-off between low resource consumption and low communication time. We conclude with an experimental study that emphasizes the effectiveness of the MIND framework.

Introduction

Data transfer over a network is an inherent task of distributed query processing in the various existing distributed data management architectures [1], [2]. Despite the tremendous advances made in both networking and telecommunication technology from one side, and distributed computing and data management techniques from another side, the cost underlying data transfer (called hereafter communication time) is still an important source of performance problems. This is mainly due to the size of intermediate results which are produced during the execution of a distributed query plan and needed to be transmitted over the network to be processed by subsequent operations of the considered plan.

As a consequence, minimizing the communication time has been recognized for a long time as one of the major research challenges in distributed data management area [1], [2]. A long-standing research effort has been devoted to this problem, which led to the development of various distributed query optimization techniques, such as distributed and parallel join algorithms [1], row blocking [2], query batching [3], [4], prefetching and caching techniques [5], just to mention a few.

In this paper, we take a complementary look to the problem of optimizing the time for communicating query results in a distributed environment, by focusing on how data is transferred over a network. To achieve this goal, we investigate the relationship between the communication time and the middleware configuration. Indeed, today, most programs (including application programs, DBMSs, and modern massively parallel frameworks like Apache Hive1 and Apache Spark2 ) interact with data management systems using a remote data access middleware such as ODBC [3], JDBC [4], or a proprietary middleware [6]. A remote data access middleware (or simply, a middleware in the sequel) is a layer on top of a network protocol that is in charge of managing the connectivity and data transfer between a client application and a data server in distributed and heterogeneous environments. Of particular interest to our concerns, a middleware determines how data is divided into batches and messages before being communicated over the network. As we demonstrate in the sequel, this impacts drastically the communication time. We analyze the middleware-based communication model and we identify empirically two middleware parameters that have a crucial impact on the communication time:

  • the fetch size, denoted F, which defines the number of tuples in a batch that is communicated at once to an application consuming the data, and

  • the message size, denoted M, which defines the size in bytes of the middleware buffer and corresponds to the amount of data that can be communicated at once from the middleware to the network.

The parameters F and M can be tuned in almost all standard or DBMS-specific middleware [3], [4], [7], where they are usually set manually by database administrators or programmers. The main thesis of this work is that tuning the middleware parameters F and M is (i) an important problem because the middleware parameters have a great impact on the communication time of a query result and on resource, in particular memory consumption, and also (ii) a non-trivial task because the optimal values of the parameters are query-dependent and network-dependent.

We briefly illustrate our thesis via Example 1.

Example 1

We consider the following three queries having different results and tuple sizes:

  • Q1: result of 32GB=165M tuples × 205B/tuple;

  • Q3: result of 4.5GB=165M tuples × 27B/tuple;

  • Q6: result of 1.5KB=55 tuples ×27B/tuple).

Moreover, we take two networks: high-bandwidth (10Gbit/s) and low-band-width (50Mbit/s). Finally, we consider the following three different middleware configurations:

  • Configuration C1: F=110K tuples and M=4KB;

  • Configuration C2: F=22K tuples and M=32KB;

  • Configuration C3: F=110 tuples and M=1.5KB.

We make the following observations:

  • (i)

    The communication time is sensitive to the middleware configuration. We report in Table 1 the communication times (in seconds) for Q1, Q3 and Q6, in the high-bandwidth network. For each query, we observe that different middleware configurations drive dramatically different communication times. For example, the communication time needed to transfer the result of query Q1 varies from 20.48 s in configuration C2 to 833.15 s in the configuration C3.

  • (ii)

    The best middleware configuration is query-dependent. We consider again Table 1, which reports the communication times (in seconds) for Q1, Q3 and Q6, in the high-bandwidth network. We observe that C1 is the best configuration for Q3, whereas C2 is the best configuration for Q1 and Q6.

  • (iii)

    The best middleware configuration is network-dependent. We report in Table 2 the communication times (in seconds) for Q3, in both high- and low-bandwidth networks. We observe that C1 is the best configuration for Q3 in the high-bandwidth network, whereas C2 is the best for Q3 in the low-bandwidth one.

Moreover, to illustrate an additional dimension of the optimization problem besides the communication time, we report at Table 1 the memory consumption which corresponds to the amount of memory used by the middleware at the destination site to store a batch of the data being transferred. It is worth noting that this is a critical resource at the middleware level since in practical situations several queries are executed simultaneously and hence a use of inappropriate configurations could lead the destination site to run out of memory. This is why, to avoid such a situation, most of current technical documentations are rather conservative and tend to recommend small values for the F parameter at the expense of the communication time [4]. Table 1 shows that the amount of memory used by the middleware varies depending both on the query and the considered configuration. Moreover, we observe that, for the three considered queries, the configuration C3 gives the worst communication times while it is, however, optimal in terms of memory consumption. In the case of the queries Q1 and Q3, the configurations C1 and C2 require more memory than C3 but they improve drastically the communication times compared to C3. This is not true in the case of query Q6 where we can observe that the configuration C2 uses relatively much more memory than C3 (nearly an increase by a factor of 300) but do not improve significantly the communication time. This shows that increasing memory consumption do not lead necessary to a proportional improvement in the communication time.

In this paper, we present MIND (MIddelware tuNing by the DBms), a framework for tuning the fetch size F and the message size M while preserving a trade-off between communication time and resource consumption. Our approach is (i) automatic (to alleviate the effort of database administrators and programmers), (ii) query-adaptive (since every query has its own optimal middleware parameters), and (iii) network-adaptive (since every network has its own optimal middleware parameters).

Our main contributions are as follows:

  • We present an experimental study (Section 3) having as goal to emphasize that the middleware configuration has a crucial impact on the time of communicating query results, and that research efforts need to be made to integrate the middleware parameters F and M into the DBMS optimizer. Our study is extensive in the sense that we present a total number of 43K tests, spread over 7K distinct scenarios (two networks of different bandwidth × six queries of different selectivity × up to 629 different middleware configurations, depending on the result tuple size of each query). In particular, we show that the values of the middleware parameters F and M that minimize the communication time are query- and network-dependent. Moreover, we point out that none of the current recommendations found in technical documentations (e.g., [3], [4], [7] ) for tuning the middleware parameters is able to find the optimal values since such strategies do not take into account the query- and network-dependency.

  • We propose a middleware-aware communication time estimation function that differentiate between the messages depending on their position in a batch (Section 4). Moreover, to take into account the network environment, we present an effective strategy for calibrating the network-dependent parameters of the communication time estimation function.

  • We consider an optimization problem that consists in computing the values of the parameters F and M that give a trade-off between resource consumption, expressed in terms of F and M, and communication time. We rely on an iterative approach that starts with initial (small) values of the two middleware parameters F and M, and iterates to improve the estimation by updating the initial values. This allows us to quickly find (always in less than a second) values of the middleware parameters for which the improvement in terms of communication time estimation between two consecutive iterations is not relevant compared to the price to pay in term of resource consumption. In practice, this translates to a good trade-off between low resource consumption and low communication time. The optimization algorithm is presented in Section 5.

  • We present an evaluation of the MIND framework using both real world and synthetic queries (Section 6). In particular, we point out the improvement that we obtain over the current strategies for middleware tuning in terms of trade-off between communication time and resource consumption. We also discuss the query- and network-adaptivity of MIND.

Section snippets

Related work

Existing state of the art DBMSs do not support automatic tuning of the middleware parameters. Moreover, to the best of our knowledge, the database research community does not have well-established strategies for middleware tuning. It is currently the task of the database administrators and programmers to manually tune the middleware to improve the system performance. Although existing technical documentations e.g., [4], [7] put forward some recommendations, none of which being query- and

Impact of the middleware

In this section, we present an experimental study emphasizing that the middleware configuration has a crucial impact on the time of communicating query results. We present the considered distributed architecture in Section 3.1, the experimental setup in Section 3.2, and we discuss our empirical observations in Section 3.3.

Communication cost model

This section is devoted to the presentation and the evaluation of the communication cost function of the MIND framework.

Middleware optimization

We consider the problem of computing the values of F and M that minimize the estimated communication cost of a query Q, while preserving a trade-off w.r.t. resources consumption. Indeed, when F and/or M increases more resources are consumed and hence focusing only on the minimization of the estimation function is not the most interesting solution from the practical point of view. In the sequel, we explicitly quantify the trade-off between communication time and resource consumption and then we

Evaluation of MIND

This section presents an evaluation of the MIND framework using the astronomical dataset of 34GB introduced in Section 3.2 and the 17 queries of Table 3, Table 5. Note that, the real queries of Table 3 are evaluated over the two network configurations (high- and low-bandwidth) while, for practical reasons mainly due the response time, the synthetic queries of Table 5 are evaluated only in the context of a high-bandwidth network. In the sequel, we point out three main issues: (i) improvement

Concluding remarks and future work

In this paper, we showed that the middleware configuration has a major impact on the communication time of a query result in a distributed environment. Then, we presented the MIND framework, which tunes two middleware parameters (the fetch size F and the message size M), while adapting to different queries (that vary in terms of selectivity) and network environments (that vary in terms of bandwidth). The main technical contributions of MIND are a communication time estimation function (that

Acknowledgments

Part of this work is funded by the CNRS, France MASTODONS PetaSky project and the LabEx, France Imobs3. The experiments were performed on the Galactica platform funded by the CNRS, France PlaSciDo program, the European Commission (Feder funds) and the Région Auvergne, France.

We are also grateful to the referees for their comments that help us to improve considerably the presentation of the paper.

References (23)

  • SattlerK. et al.

    QUIET: continuous query-driven index tuning

  • ÖzsuM.T. et al.

    Principles of Distributed Database Systems

    (2011)
  • KossmannD.

    The state of the art in distributed query processing

    ACM Comput. Surv.

    (2000)
  • GeigerK.

    Inside ODBC

    (1995)
  • ShiraziJ.

    Java Performance Tuning

    (2003)
  • RamachandraK. et al.

    Holistic optimization by prefetching query results

  • G. Bulumulle, Oracle middleware layer Net8 performance tuning utilizing underlying network protocol,...
  • BurlesonD.K.

    Oracle Tuning

  • HaasL.M. et al.

    Optimizing queries across diverse data sources

  • MackertL.F. et al.

    R* optimizer validation and performance evaluation for distributed queries

  • BeameP. et al.

    Communication steps for parallel query processing

  • Cited by (0)

    View full text