Keywords

1 Introduction

Partitioning design is an important topic to shared-nothing, parallel database systems. In such environments, data is partitioned across to multiple nodes, each of which is essentially an independent computer. In general, the communications and data movements between nodes are much more expensive than relational operations in one node. So a good partitioning plan should minimize them. The problem can be formally defined as [3]:

figure a

Here, the Q is the statement in workload, and the \({f_Q}\) is the weight of Q, it can be the multiplicity of Q in the workload, and Cost(Q,P) is the cost of statement Q if the database is partitioned by P. Many researchers have introduced a great deal of algorithms to search for the optimal partitioning plan [12,13,14], such as genetic algorithm, simulated annealing, hill climbing, branch and bound, etc. Even so, this problem is far from being resolved.

On one hand, different statements may prefer different partitioning plans, and those plans may mutually contradictory, so a one-size-fits-all partitioning plans cannot achieve the overall optimal performance for the workload.

On the other hand, in many real-world database systems, data is replicated to increase the robustness and availability, such as Microsoft SQL Azure [5] has 3 full database replicas, Amazon’s Relational Database Service [4] has up to 5 read replicas of MySQL database, and Facebook’s TAO also has 3 replicas of MySQL database [6]. In current practice of partitioning design, all replicas are uniformly organized. These two contradictory aspects motivate our research effort.

In this paper, we use multi-plan coexistence way to address this problem. In this fashion, replicas are organized by different partitioning plans heterogeneously, so different subsets of workload can be routed to different replicas for better performance.

The contributions of our work can be summarized as follows.

  1. (1)

    We analyze the reasons for the contradiction of the statements. Our analysis reveals that the requirements for different attributes of the same table will result in contradictory partitioning plans.

  2. (2)

    We propose an approach of replica-aware partitioning design. We leverage the access graph of SQL statement and database schema to measure the distance between statements, and use k-medoids to cluster the workload, and generate partitioning plan for each cluster, and organize replicas with these plans.

  3. (3)

    We evaluate of our approach of Replica-Aware Partitioning Design (abbr. RAPD), and compare it with the Divergent Design [7], Schism [8], and the Current Practice of Partitioning Design (abbr. CPPD). The evaluation results demonstrate that our approach could increase system performance up to a factor of 4x compared with CPPD, and has a higher efficiency in the process of searching for partitioning plans.

2 Related Work

Many researchers have done a lot of outstanding work in replica-aware partitioning design, like the CliqueSquare [9], Divergent Design [7], and Trojan Data Layouts [11], etc.

Fig. 1.
figure 1

Comparison of process flows for generating partitioning plans

Firstly, we have different process flow of generating partitioning plans (Fig. 1). They use the DB Advisor of DB2 [14] as a black box to generate partitioning plans, and iteratively evaluate the plans with an object function and repartition the workload until the plans satisfy some conditions. In contrast, we use data mining algorithms to classify the workload firstly, and use the branch-and-bound algorithm to search for the optimal partitioning plan for each subset of the workload. Secondly, we have different policy in dealing with the update statement, they assign the update statements in every cluster to generate partitioning plans while we use the Query Completion to address the update statements.

Jindal et al. proposed the Trojan Layout [11], we are different from them in 2 aspects: (i) they focus on improving the performance of Map phase of Map-Reduce jobs while we focus on data partitioning of distributed database systems, (ii) the Trojan Layout is designed for organizing data blocks of distributed file systems, and our approach is designed for organizing replications.

In addition to the works mentioned above, many researchers have done lots of works [8, 12,13,14,15]. The biggest difference between us and them is that most of them treat the workload as a whole, and they did not take replicas into consideration to search for partitioning plans.

3 Motivation and Overview of Our Approach

In this section, we present the analysis of the contradictions which motivate our research effort, and provide an overview of our approach.

3.1 Analysis of Contradiction

Different SQL statements may prefer different partitioning plans. Assuming a table Table 1 has two attributes: ID and Name, and two query statements:

  • SELECT * FROM table1 WHERE ID = ‘123456’

  • SELECT * FROM table1 WHERE NAME = ‘Michael’

Obviously, the optimal partitioning plans for these two query statements are contradictory. The first statement prefers Table 1 to be partitioned by ID while the second one prefers the table to be partitioned by NAME.

Although these contradictions can be partially addressed by creating more indices, but indices need extra storage and are very time consuming, so this limits the administrators from creating too many indices. Since we have full replications of tables already, then the contradictions can be addressed by organizing replications with different partitioning plans, more importantly, this will not increase the overall storage overhead.

In many real-world systems, many tables are connected by foreign keys, for example, TPC-E has 33 tables, 27 of which have a total of 50 foreign keys, that will lead to more prominent contradictions. In this paper, we make the assumption that two statements are contradictory if they access to different attributes of the same table.

3.2 Overview of Our Approach

Compared with [3], we take the replications into account, then the problem of replica-aware partitioning design can be define as:

figure b

The core idea of our approach is to cluster contradictory statements into different clusters (W \(_1\)... W \(_n\)), and generate partitioning plan (P \(_1\)...P \(_n\)) for each of them, then organize replicas with these plans. The Cost(W \(_{i}\),P \(_{i}\)) is the cost of cluster W \(_i\) when the replica is partitioned by P \(_i\).

Fig. 2.
figure 2

Overview of our approach

Figure 2 shows the overview of our approach. Firstly, we traverse the workload and use data schema to get the distance matrix of statements. Secondly, we use k-medoids algorithm to cluster the workload. Thirdly, we use branch-and-bound algorithm to search the optimal partitioning plan for each statement cluster. Finally, we use these partitioning plans to organize replicas, and route each statement to the appropriate replica.

4 Clustering the Workload

The clustering algorithm we used is inspired by the well-known k-medoids, it has two important issues: the distance function and the number of clusters.

4.1 Distance Function

Many types of data can be used to measure the distance between statements [2, 8, 10, 16]. Because the access graph can demonstrate the relationship between statements, so we use it to measure the distance of statements.

An access graph is derived from workload, it describes the relationship between a statement and the attributes (Example 1). If a statement Q accesses an attribute A, then the corresponding cell has a value of 1, otherwise it has a value of 0.

Example 1

A workload with 10 statements

figure c

According to the analysis described in Sect. 3.1, we consider 2 types of access graphs.

Attributes Belong to Different Tables. In Example 1, attributes A1–A4 belong to different tables. For workload-driven algorithms which automatically generate partitioning plans [3, 12,13,14], the smaller the difference in attributes that accessed by statements, the more likely they are to produce a better partitioning plan. So if the number of replicas is 3, then the cluster result is {{Q1}, {Q2, Q3, Q4}, {Q5, Q6, Q7, Q8, Q9, Q10}}, because in each cluster, all statements access the same attributes. Inspired by CG-COST [3], we use the co-accessed attributes to measure the distance.

Denote \(C_i\) is the number of attributes accessed by \(Q_i\) , and \(C_{i,j}\) is the number of attributes that \(Q_i\) and \(Q_j\) co-accessed, then the distance is defined as:

$$\begin{aligned} dist1(Q_i,Q_j)=1-\frac{2\cdot C_{i,j}}{C_i+C_j} \end{aligned}$$
(1)

We can easily verify that the absolute-error criterion (Eq. 2) of the clustering result of Example 1 is minimal, where k is the number of clusters, and W is the workload.

$$\begin{aligned} E=\sum _{n=1}^{k}\sum _{Q_i,Q_j\in W_k}{}dist(Q_i,Q_j) \end{aligned}$$
(2)

Attributes Belong to the Same Table. In the left table of Example 2, A2 and A3 belong to the same table, and other attributes belong to different tables respectively.

Example 2

A workload with 3 statements

figure d

If the number of clusters is 2, then the cluster result is \(\{\{Q1\},\{Q2, Q3\}\}\). Because we can use Eq. 1 to calculate the distances of Q1Q3 is 0.6, and Q2Q3 is 0.33. However, A2, A3 are accessed by Q2, Q3 respectively, because A2, A3 belong to the same table, so this clustering result can lead to contradictory partitioning plans of the table which contains A2, A3. In this case, we introduce \(D_{ij}\), it denotes the number of attributes belonging to the same table and accessed but not co-accessed by Qi, Qj. The distance is defined as:

$$\begin{aligned} dist2(Q_i,Q_j)=\frac{D_{ij}}{C_i+C_j} \end{aligned}$$
(3)

If we add the dist2 to dist1, then we can get the new distances of Q1Q3 and Q2Q3 is 0.6 and 0.67, so the attribute Q3 should be assigned with Q1.

Based on the above analysis, we can define the distance of \(Q_i, Q_j\) as:

$$\begin{aligned} dist(Q_i,Q_j)=dist1(Q_i,Q_j)+dist2(Q_i,Q_j). \end{aligned}$$
(4)

Workload Compression. We can use Eq. 4 to compute the distance of any tow statements, then we get the distance matrix:

$$\begin{aligned} \mathcal {D}= \left[ \begin{array}{ccccc} 1&{} &{} &{} &{} \\ dist(Q_2, Q_1)&{}1&{} &{} &{} \\ dist(Q_3, Q_1)&{}dist(Q_3, Q_2)&{}1&{} &{}\\ \vdots &{}\vdots &{}\vdots &{}\ddots &{} \\ dist(Q_n, Q_1)&{}dist(Q_n, Q_2)&{}\cdots &{}\cdots &{}1 \end{array} \right] \end{aligned}$$
(5)

However, in many cases, the workload is too large to fit into the memory. In these cases, we can compress all statements that access to the same attributes into one. For example, in Example 2, if there are 500 statements access to the same attributes as Q1, 200 statements access to the same attributes as Q2 and 700 statements access to the same attributes as Q3, then we can record this like the right table in Example 2. Similarly, in the computation of the distance matrix, the factors \(C_{ij}\) and \(D_{ij}\) also need to multiply the corresponding number.

4.2 The Number of Clusters

The most thorough way to resolve the contradictions is to generate partitioning plan for each statement of the workload [12], but this is infeasible and unnecessary, because there will be a great deal of problems under this circumstance, i.e. storage constraint, data consistency and hardware cost.

An appropriate value is the number of replicas, because we can use them without increasing the overall storage overhead and design complexity, and this is also consistent with our original idea. Therefore, we assume the number of clusters is equal to the number of replicas.

4.3 Clustering Algorithm

The input of the algorithm (Algorithm 1) is the attribute set A and the number of clusters k. Firstly, k statements are chosen randomly from the workload as the representative attributes (initial medoids) of k clusters (o[k]). Then the other statements are assigned to the cluster with the nearest distance. Afterwards, the algorithm iteratively replaces a representative statement \(c_i\) with a non-representative statement a and compute the SwapCost until the representative statement set o[k] does not change. In each iteration, if the \(SwapCost<0\), then a is the new representative statement of the cluster and reassign other statements, otherwise, the representative statement set is not changed. The SwapCost is the difference between the new and the old absolute-error criterion (Eq. 2).

figure e

5 Generating Partitioning Plan

We use the branch and bound algorithm to search the optimal partitioning plan, which is designed for discrete and combinatorial optimization problems. The solution space is constructed to a tree structure, and the algorithm only expands the most promising node at each search stage.

The input of the algorithm (Algorithm 2) is the initial plan (RN) of a cluster, and the storage bound (B), and a cluster (W). We use DFS policy to choose the next tree node for expansion (NextNode). To improve the pruning efficiency and reduce the search time, we sort attributes and tables in descending order of the frequency of access. If the storage requirement of the new plan exceeds the constraint, then the new plan is pruned. Otherwise, we compute the cost of the new plan with the cost model [13], and if the cost of the temporary optimal plan (OptimalPlan) is greater than the cost of the new plan, then replace the temporary optimal plan with the new plan. The iteration ends when there is no partial plan can be partitioned, or a specified length of time has elapsed.

figure f

6 Routing Statements

In our approach, different replicas are organized with different plans, for a statement of workload, the response time may vary greatly when executed in different replicas. In order to forward each statement to its preferred replica, we use the triplet to maintain the mapping information of statement, workload cluster and replica.

$$<Statement ID, Cluster ID,Replica ID>$$

On receiving a statement, the server can forward it directly by taking a lookup of the triplets, however, 2 types of statements need to be handled carefully.

6.1 New Statement

A New Statement is the statement that has never occurred in the process of clustering or generating partitioning plans. So the server cannot find the information of the statement in the triplets.

To forward this type of statement, we calculate the sum of the distances of this statement to each cluster firstly, that is \(\sum _{s\in C_i}^{}dist(n, s)\), where the n denotes the New Statement. Then, we can assign the New Statement to the cluster with the minimal value, and route it to the corresponding replica.

6.2 Update Statement

Different from query statement, the update statement has to be executed in all replicas. It can be time-consuming to execute the statement in the replica whose partitioning plan is generated without the statement. For example, three replicas are organized with P1, P2, and P3, and P1, P2, P3 partition table T on attributes C1, C2, C3 respectively. If the update statement only contains C1, then the replica which is organized by P1 can complete the statement faster than the other replicas.

We present a method called Query Completion to address this issue. Figure 3 shows the process flow of an update statement. We update the replica which is partitioned by P1 (line 1) firstly, then construct a read-only query to retrieve corresponding tuples affected by former update query (line 2), and then complete the update query with C2 and C3, and send the completed statement to other replications finally (line 3).

Fig. 3.
figure 3

Process flow of update statement

7 Experimental Evaluation

The main objective of our approach is to show that organizing replicas with different partitioning plans can improve system performance. So we carry out experiments to compare our approach (RAPD) with the current practice of single partitioning plan design (CPPD), Schism [8] and Divergent Design [7].

7.1 Data Sets and Experimental Platform

We choose TPC-E, TPC-H and a real-world system NACUES (National College and University Enrollment System) as the experimental datasets. The basic statistics are summarized in Table 1.

Table 1. Summary statistics of datasets

TPC-E and TPC-H are widely used in many data management experiments. The real-world dataset NACUES is a system used by nearly 3000 colleges and universities in China to enroll students from all over the country. It contain scores, profiles, physical examinations, applications of students, basic information of colleges and universities and enrollment plans etc.

We use MyCat [1] as the distributed DBMS, which use MySQL as the underlying database. It supports distributed transaction, database partitioning, which can fully meet our experimental needs. We set up a cluster of 16 common commodity servers each with 32 GB RAM, 2 Intel Xeon-E5 processors and 2 TB hard disk, all servers are connected through 1000 Mbps LAN.

7.2 Proportion of Single-Site Statements

In general, single-site statement consumes less resources than multiple-site statement, so the proportion of single-site statement is an important measure of partitioning design. Figure 4 shows the proportion of different number of plans being deployed.

Fig. 4.
figure 4

Proportion of single-site statements

Obviously, the proportions increase rapidly with the number of partition plans, the underlying reason is that the contradictions in workload are gradually solved when more plans are deployed. The result means that our approach is effective to solve the contradictions in workload.

7.3 Performance of Our Approach

In order to evaluate the impact to system performance, we compare RAPD with the CPPD, Schism, and Divergent Design in different number of replications. We execute all of the statements (SQL) of different datasets and compare the execution times of these 4 approaches, the execution times is show in Fig. 5.

Fig. 5.
figure 5

Performance of our approach

On the whole, we can see that the execution time of all approaches except CPPD becomes shorter as the number of replications increases. The reason is that in CPPD, all replications have the same partitioning plan, so the executing time is not very sentitive to the number of replications. Also, we can see that the execution times are reduced substantially in our approach (RAPD). Take NACUES in Fig. 5(d) for example, it takes more than 80 s to finish the 130 queries of NACUES in CPPD, but only takes 9 s in our approach. The reason is that the more partition plans are deployed, the more contradictions of statements can be solved, and that will lead to the growth of the proportion of single-site statement, and result in the reduction of execution time of statements. Further more, we can see that RAPD is 1.5–2 times faster than Divergent Design under different number of replications, this means the clustering result of in RAPD can generate better partitioning plans, that is more efficient to resolve the contradictions of statements.

Also, we notice that the performance of Schism in NACUES is not as good as in TPC-E and TPC-H, the reason is that the access graph of NACUES created in Schism is more complicate than others, given the same number of vertices, the access graph of NACUES has far more edges than that of TPC-E and TPC-H, so the graph partition algorithm used in Schism cannot find a good way to partition the graph, leading to a limited improvement of system performance.

7.4 Comparison of Execution Time

To compare the execution time with Divergent Design, we carried out an experiment with TPC-E, the workload has 100–500 statements, and the number of replica is 3, and the Black-box used in Divergent Design is the branch-and-bound algorithm as we used in RAPD. The total execution time of searching partitioning plan of our approach is composed of the pre-processing and clustering time of workload, and the execution time of branch-and-bound. Figure 6 shows the experiment result.

Fig. 6.
figure 6

Total execution time of searching partitioning plans

Compared with Divergent Design, the execution time of RAPD is much shorter, and is not affected by the number of statements. That is because RAPD cluster the workload first, and then search for the partitioning plans. Differently, in each of the iteration of workload clustering, Divergent Design needs to call the Branch and Bound algorithm, which is very time-consuming.

8 Future Work

Two of the most crucial issues of our approach are identifying contradictions and clustering the workload. There are only a slice of literatures on these two issues [12]. The analysis and assumption we made is only a tentative work, and the clustering algorithms of workload is quite straightforward. We would try other approaches in the future to make them more accurate, and formalized definitions and proofs are also our future work.

Another potential work is the algorithm of searching for the optimal partition plans. Although researchers have already conducted concrete work in this area, this problem is still far from being addressed. If the search algorithm cannot find a suitable partition plan, then different query clusters may conduct to similar partition plans, and we have encountered this phenomenon in our experiments of choosing algorithms.

9 Conclusion

In this paper, we present an approach of replica-aware partitioning design to resolve the contradictions in workload. Our approach uses access graph of statements to measure their distance and then cluster the workload and generate partitioning plans for each of them. By introducing the necessary modifications and data structures, replicas with different partitioning plans can work properly and more effectively. Through the experimental evaluation, we could see that our approach can improve system performance by up to 4x over the current practice of partitioning design.