Skip to main content
Log in

MyBenchmark: generating databases for query workloads

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

To evaluate the performance of database applications and database management systems (DBMSs), we usually execute workloads of queries on generated databases of different sizes and then benchmark various measures such as respond time and throughput. This paper introduces MyBenchmark, a parallel data generation tool that takes a set of queries as input and generates database instances. Users of MyBenchmark can control the characteristics of the generated data as well as the characteristics of the resulting workload. Applications of MyBenchmark include DBMS testing, database application testing, and application-driven benchmarking. In this paper, we present the architecture and the implementation algorithms of MyBenchmark. Experimental results show that MyBenchmark is able to generate workload-aware databases for a variety of workloads including query workloads extracted from TPC-C, TPC-E, TPC-H, and TPC-W benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. In case MyBenchmark generates more than one test database, we may use a database testing framework (e.g., DbUnit [1], HTpar [17]) to automatically assign the generated databases to the test queries.

  2. In this case, the developers need to specify only the output cardinalities of the final results and may leave the constraints of intermediate operators empty.

  3. In fact, \(Q_2\) must annotate consistent constraints with \(Q_1\) (e.g., two tuples for table \(R\)) or otherwise MyBenchmark will return an error to the user.

  4. Note that \(G_a\) is not necessarily complete.

  5. Although TPC-E has more queries than TPC-C, their maximum \(|\mathcal {R}|\) values are close because TPC-E also has more tables and the parametric predicates often access different attributes of different tables.

  6. TPC-H has a much higher \(|R|\) value than the other workloads because TPC-H queries often have parametric predicates on the same attributes of the fact table (e.g., the l_shipdate attribute in lineitem table).

  7. A bushy tree is a balanced tree.

  8. SQP and data instantiation are not our focus as long as they scale.

References

  1. Dbunit. http://www.dbunit.org

  2. DTM Data Generator. http://www.sqledit.com/dg/

  3. IBM DB2 Test Database Generator. http://www-306.ibm.com/software/data/db2imstools/db2tools/db2tdbg/

  4. Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows: Theory, Algorithms, and Applications. Prentice-Hall Inc, Upper Saddle River, NJ (1993)

    MATH  Google Scholar 

  5. Arasu, A., Kaushik, R., Li, J.: Data generation using declarative constraints. In: SIGMOD Conference, pp. 685–696 (2011)

  6. Binnig, C., Kossmann, D., Lo, E.: Reverse query processing. In: ICDE (2007)

  7. Binnig, C., Kossmann, D., Lo, E., Ozsu, M.T.: QAGen: generating query aware test databases. In: SIGMOD (2007)

  8. Bruno, N., Chaudhuri, S.: Flexible database generators. In: VLDB, pp. 1097–1107 (2005)

  9. Bruno, N., Chaudhuri, S., Thomas, D.: Generating queries with cardinality constraints for DBMS testing. In: TKDE (2006)

  10. Chays, D., Deng, Y., Frankl, P.G., Dan, S., Vokolos, F.I., Weyuker, E.J.: An AGENDA for testing relational database applications. Software testing, verification and reliability (2004)

  11. Chen, M.-S., Yu, P.S., Wu, K.-L.: Optimization of parallel execution for multi-join queries. IEEE Trans. Knowl. Data Eng. 8, 416–428 (1996)

    Article  Google Scholar 

  12. Cherkassky, B.V., Goldberg, A.V.: On implementing the push-relabel method for the maximum flow problem. Algorithmica 19(4), 390–410 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  13. Cluet, S., Moerkotte, G.: On the complexity of generating optimal left-deep processing trees with cross products. In: ICDT, pp. 54–67 (1995)

  14. Garey, M.R., Johnson, D.S.: Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co. New York, NY, USA (1990)

  15. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: SIGMOD Conference, pp. 1197–1208 (2013)

  16. Gray, J., Sundaresan, P., Englert, S., Baclawski, K., Weinberger, P.J.: Quickly generating billion-record synthetic databases. In: SIGMOD, pp. 243–252 (1994)

  17. Haftmann, F., Kossmann, D., Lo, E.: A framework for efficient regression tests on database applications. VLDB J. 16(1), 145–164 (2007)

    Article  Google Scholar 

  18. He, Z., Kunjir, M., Lim, H., Lo, E., Babu, S., Hsu, M., Castellanos, M.: Benchmarking “no one size fits all” big data analytics. In: The Third Workshop on Big Data Benchmarking (2013)

  19. Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator. SIGMOD Rec. 36(1), 19–24 (2007)

    Article  Google Scholar 

  20. Houkjær, K., Torp, K., Wind, R.: Simple and realistic data generation. In: VLDB, pp. 1243–1246 (2006)

  21. Kini, A., Shankar, S., Naughton, J.F., DeWitt D.J.: Database support for matching: limitations and opportunities. In: SIGMOD Conference, pp. 85–96 (2006)

  22. Lo, E., Binnig, C., Kossmann, D., Ozsu, M.T., Hon, W.-K.: A framework for testing DBMS features. VLDB J. 19(2), 203–230 (2010)

    Article  Google Scholar 

  23. Lo, E., Cheng, N., Hon, W.-K.: Generating databases for query workloads. PVLDB 3(1), 848–859 (2010)

    Google Scholar 

  24. Mannila, H., Räihä, K.-J.: Test data for relational queries. In: PODS, pp. 217–223 (1986)

  25. Mishra, C., Koudas, N., Zuzarte, C.: Generating targeted queries for database testing. In: SIGMOD, pp. 499–510 (2008)

  26. Neufeld, A., Moerkotte, G., Lockemann, P.C.: Generating consistent test data for a variable set of general consistency constraints. VLDB J. 2(2), 173–213 (1993)

    Article  Google Scholar 

  27. Olston, C., Chopra, S., Srivastava, U.: Generating example data for dataflow programs. In: SIGMOD Conference, pp. 245–256 (2009)

  28. Rabl, T., Poess, M.: Parallel data generation for performance analysis of large, complex rdbms. In: DBTest 2011, p. 5 (2011)

  29. Sellis, T.K.: Multiple-query optimization. ACM Trans. Database Syst. 13(1), 23–52 (1988)

    Google Scholar 

  30. Stephens, J.M., Poess, M.: Mudd: a multi-dimensional data generator. In: WOSP, pp. 104–109 (2004)

  31. Tay, Y.: Data generation for application-specific benchmarking. In: PVLDB, p. 4 (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eric Lo.

Additional information

Research supported by Grant PolyU 525009E from Hong Kong RGC.

Appendix: Proofs

Appendix: Proofs

1.1 Proof of \(k\)-SAT-MATCH is \(\mathcal {NP}\)-complete

Theorem 1

Problem \(k\)-SAT-MATCH is \(\mathcal {NP}\)-complete.

We begin with proving \(k\)-SAT-MATCH is in \(\mathcal {NP}\) and further show that it is \(\mathcal {NP}\)-hard by a reduction from the \(\mathcal {NP}\)-complete problem known as X3C (Exact Cover by 3-set).

Corollary 1

\(k\)-SAT-MATCH is in \(\mathcal {NP}\).

Proof

Each “yes” instance has a polynomial-size proof, which consists of the set of edges in the matching, and the set of values for each variable. Thus, each “yes” instance can be verified in polynomial time. \(\square \)

Corollary 2

\(k\)-SAT-MATCH is \(\mathcal {NP}\)-hard.

Proof. Obviously, if we solely focus on the constraint satisfaction problem (i.e., the condition on satisfiability required in Definition 4), \(k\)-SAT-MATCH is definitely \(\mathcal {NP}\)-hard. However, as we want to show the difficulty of the matching problem itself (e.g., adding an edge to the matching set will induce some relationships that hinder the matching of the other nodes), we assume here the constraint satisfaction step is at no cost.

We are going to reduce X3C (Exact Cover by 3-Set) to the \(k\)-SAT-MATCH problem. The X3C problem [14] takes as input a set of elements \(\mathcal {S}=\{S_1, S_2, \ldots , S_{3n}\}\) and a collection of 3-element set \(\mathcal {C}=\{C_1, C_2, \ldots , C_m\}\) and asks whether there is a sub-collection of \(\mathcal {C}\), whose size is \(n\), such that it exactly covers all elements of \(\mathcal {S}\). The reduction is to construct a constrained bipartite graph \(G=(U,V,E)\) as follows.

  1. 1.

    For each 3-element set \(C_i =\{S_j, S_k, S_\ell \}\), insert 3 constrained nodes \(u_{i,j}\), \(u_{i,k}\), and \(u_{i,\ell }\) to constrained node set \(U\). The propositional formulas that are associated with \(u_{i,j}\), \(u_{i,k}\), and \(u_{i,\ell }\) would be [$\(a_j\) \(\le \) \(w_i\)], [$\(a_k\) \(\le \) \(w_i\)] and [$\(a_\ell \) \(\le \) \(w_i\)], respectively ($\(a_j\), $\(a_k\), $\(a_\ell \) are symbols and \(w_i\) is any unique value).

  2. 2.

    For each element \(S_j\), insert a constrained node \(v_j\) to constrained node set \(V\). The propositional formula that is associated with \(v_i\) would be [$\(b_j\) \(\ge \) \(w\)] (value \(w\) would be the same for all elements).

  3. 3.

    Connect the nodes in \(U\) and \(V\) if they are created from the same element \(S_j\). For instance, assume a 3-element set \(C_2 = \{S_4, S_5, S_6\}\) has inserted 3 nodes \(u_{2,4}\), \(u_{2,5}\), and \(u_{2,6}\) to \(U\) in Step 1 and element \(S_4\) has inserted a node \(v_4\) into \(V\) in Step 2. Then, nodes \(u_{2,4}\) and \(v_4\) should be connected as both of them are created from element \(S_4\).

  4. 4.

    For each 3-element set \(C_i\), insert a node \(u_{C_i}\) with propositional formula [$\(c_i\) \(>\) \(w_i\)] to \(U\) and insert a node \(v_{C_i}\) with propositional formula [$\(d_i\) \(\le \) \(w\)] to \(V\) and connect the two nodes with an edge.

The rest of the proof will establish:

Proposition 1

There is an exact cover of \(\mathcal {S}\) if and only if the size of maximum satisfiable matching of \(G\) is exactly \(3n+(m-n)\).

Firstly, if the node \(u_{i,j}\) appears in the MSM, it must be matched with the node \(v_j\), so that it will induce the total-order relationship \(w_i \ge w\). On the other hand, if \(u_{C_i}\) appears in the MSM, it must be matched with will \(v_{C_i}\), so that it will induce the total-order relationship \(w > w_i\). Thus, if either \(u_{i,j}\), \(u_{i,k}\), or \(u_{i,\ell }\) appear in the MSM, we cannot have \(u_{C_i}\) in the MSM at the same time.

Suppose we denote \(z\) to be the number of \(i\)’s such that \(u_{i,j}\), \(u_{i,k}\), or \(u_{i,\ell }\) appear in the MSM. Then, the size of MSM is at most \(3z + (m-z)\), which in turn is at most \(3n + (m-n)\) since \(z \le n\).

The “only-if” direction. Next, suppose there is an exact cover of \(\mathcal {S}\). In that case, let \(C_{i_1}, C_{i_2}, \ldots , C_{i_n}\) be the 3-sets such that they exactly cover \(\mathcal {S}\). This implies the elements in these 3-sets must be distinct from each other. Then, consider the following matching in \(G\):

  1. 1.

    For each \(i \in \{i_1, i_2, \ldots , i_n\}\), the corresponding nodes of \(C_{i_t}\), i.e., \(u_{i_t,j}, u_{i_t,k}, u_{i_t,\ell }\), are matched to \(v_j\), \(v_k\), and \(v_\ell \), respectively.

  2. 2.

    For each \(i \notin \{i_1, i_2, \ldots , i_n\}\), \(u_{C_i}\) is matched to \(v_{C_i}\).

The above matching is also satisfiable because the edges induce total-order relationships of the form \(w_i \ge w\) when \(i \in \{i_1,i_2,\ldots ,i_n\}\), and of the form \(w_i < w\) for other choice of \(i\). Thus, all edges can be satisfied simultaneously. Finally, it is easy to check that the above matching has \(3n+(m-n)\) edges, so that it is a maximum satisfiable matching.

The “if” direction. If the size of MSM is exactly \(3n+(m-n)\), we claim that \(z\), which is the number of \(i\)’s such that \(u_{i,j}\), \(u_{i,k}\), or \(u_{i,\ell }\) appear in the MSM, must be exactly \(n\); in addition, for each such \(i\), all \(u_{i,j}, u_{i,k}, u_{i,\ell }\) must appear in the matching. If this claim is true, it will immediately imply the corresponding 3-sets \(C_i\)’s (in total \(n\) of them) will cover exactly \(\mathcal {S}\).

Now, it remains to prove the claim. We first show that \(z = n\). If \(z < n\), then the matching can contain at most \(3z\) edges connecting some \(u_{i,r}\) with \(v_r\), and at most \(m-z\) edges connecting some \(u_{C_s}\) with \(v_{C_s}\), so that the number of edges is at most \(3z + (m-z)\), which is less than \(3n+(m-n)\). On the other hand, if \(z > n\), then the matching can contain at most \(3n\) edges connecting some \(u_{i,r}\) with \(v_r\) (because \(v_r\) is limited), and at most \(m-z\) edges connecting some \(u_{C_s}\) with \(v_{C_s}\), so that the number of edges is at most \(3n + (m-z)\), which again is less than \(3n+(m-n)\). Thus, if the size of MSM is \(3n+(m-n)\), we must have \(z=n\).

Given \(z=n\), there are at most \(m-n\) edges connecting \(u_{C_s}\) with \(v_{C_s}\). Thus, at least \(3n\) edges must be connecting some \(u_{i,r}\) with \(v_r\). However, since there are only \(n\) values of \(i\) with \(u_{i,j}\), \(u_{i,k}\), or \(u_{i,\ell }\) appear in the MSM, the previous statement is possible unless for each such \(i\), all \(u_{i,j}, u_{i,k}, u_{i,\ell }\) appear in the matching. Thus, the proof of the claim completes, and so do the proofs of the Proposition 1 and Corollary 2.

1.2 Proof sketch of the optimal integration plan problem

Proof Sketch. Given an instance of a cross product optimization [13], we create a corresponding symbolic database such that the matching size between two databases is always equal to the size of the cartesian product of the databases plus the size of the two databases. Thus, finding the maximum (satisfiable) matching equals to finding the optimal join ordering, which is \(\mathcal {NP}\)-hard. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lo, E., Cheng, N., Lin, W.W.K. et al. MyBenchmark: generating databases for query workloads. The VLDB Journal 23, 895–913 (2014). https://doi.org/10.1007/s00778-014-0354-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-014-0354-1

Keywords

Navigation