MyBenchmark: generating databases for query workloads

Lo, Eric; Cheng, Nick; Lin, Wilfred W. K.; Hon, Wing-Kai; Choi, Byron

doi:10.1007/s00778-014-0354-1

MyBenchmark: generating databases for query workloads

Regular Paper
Published: 15 March 2014

Volume 23, pages 895–913, (2014)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Eric Lo¹,
Nick Cheng¹,
Wilfred W. K. Lin¹,
Wing-Kai Hon² &
…
Byron Choi³

1926 Accesses
16 Citations
Explore all metrics

Abstract

To evaluate the performance of database applications and database management systems (DBMSs), we usually execute workloads of queries on generated databases of different sizes and then benchmark various measures such as respond time and throughput. This paper introduces MyBenchmark, a parallel data generation tool that takes a set of queries as input and generates database instances. Users of MyBenchmark can control the characteristics of the generated data as well as the characteristics of the resulting workload. Applications of MyBenchmark include DBMS testing, database application testing, and application-driven benchmarking. In this paper, we present the architecture and the implementation algorithms of MyBenchmark. Experimental results show that MyBenchmark is able to generate workload-aware databases for a variety of workloads including query workloads extracted from TPC-C, TPC-E, TPC-H, and TPC-W benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

Article 17 January 2020

UniBench: A Benchmark for Multi-model Database Management Systems

A Framework for Supporting Repetition and Evaluation in the Process of Cloud-Based DBMS Performance Benchmarking

Notes

In case MyBenchmark generates more than one test database, we may use a database testing framework (e.g., DbUnit [1], HTpar [17]) to automatically assign the generated databases to the test queries.
In this case, the developers need to specify only the output cardinalities of the final results and may leave the constraints of intermediate operators empty.
In fact, $Q_2$ must annotate consistent constraints with $Q_1$ (e.g., two tuples for table $R$) or otherwise MyBenchmark will return an error to the user.
Note that $G_a$ is not necessarily complete.
Although TPC-E has more queries than TPC-C, their maximum $|\mathcal {R}|$ values are close because TPC-E also has more tables and the parametric predicates often access different attributes of different tables.
TPC-H has a much higher $|R|$ value than the other workloads because TPC-H queries often have parametric predicates on the same attributes of the fact table (e.g., the l_shipdate attribute in lineitem table).
A bushy tree is a balanced tree.
SQP and data instantiation are not our focus as long as they scale.

References

Dbunit. http://www.dbunit.org
DTM Data Generator. http://www.sqledit.com/dg/
IBM DB2 Test Database Generator. http://www-306.ibm.com/software/data/db2imstools/db2tools/db2tdbg/
Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows: Theory, Algorithms, and Applications. Prentice-Hall Inc, Upper Saddle River, NJ (1993)
MATH Google Scholar
Arasu, A., Kaushik, R., Li, J.: Data generation using declarative constraints. In: SIGMOD Conference, pp. 685–696 (2011)
Binnig, C., Kossmann, D., Lo, E.: Reverse query processing. In: ICDE (2007)
Binnig, C., Kossmann, D., Lo, E., Ozsu, M.T.: QAGen: generating query aware test databases. In: SIGMOD (2007)
Bruno, N., Chaudhuri, S.: Flexible database generators. In: VLDB, pp. 1097–1107 (2005)
Bruno, N., Chaudhuri, S., Thomas, D.: Generating queries with cardinality constraints for DBMS testing. In: TKDE (2006)
Chays, D., Deng, Y., Frankl, P.G., Dan, S., Vokolos, F.I., Weyuker, E.J.: An AGENDA for testing relational database applications. Software testing, verification and reliability (2004)
Chen, M.-S., Yu, P.S., Wu, K.-L.: Optimization of parallel execution for multi-join queries. IEEE Trans. Knowl. Data Eng. 8, 416–428 (1996)
Article Google Scholar
Cherkassky, B.V., Goldberg, A.V.: On implementing the push-relabel method for the maximum flow problem. Algorithmica 19(4), 390–410 (1997)
Article MATH MathSciNet Google Scholar
Cluet, S., Moerkotte, G.: On the complexity of generating optimal left-deep processing trees with cross products. In: ICDT, pp. 54–67 (1995)
Garey, M.R., Johnson, D.S.: Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co. New York, NY, USA (1990)
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: SIGMOD Conference, pp. 1197–1208 (2013)
Gray, J., Sundaresan, P., Englert, S., Baclawski, K., Weinberger, P.J.: Quickly generating billion-record synthetic databases. In: SIGMOD, pp. 243–252 (1994)
Haftmann, F., Kossmann, D., Lo, E.: A framework for efficient regression tests on database applications. VLDB J. 16(1), 145–164 (2007)
Article Google Scholar
He, Z., Kunjir, M., Lim, H., Lo, E., Babu, S., Hsu, M., Castellanos, M.: Benchmarking “no one size fits all” big data analytics. In: The Third Workshop on Big Data Benchmarking (2013)
Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator. SIGMOD Rec. 36(1), 19–24 (2007)
Article Google Scholar
Houkjær, K., Torp, K., Wind, R.: Simple and realistic data generation. In: VLDB, pp. 1243–1246 (2006)
Kini, A., Shankar, S., Naughton, J.F., DeWitt D.J.: Database support for matching: limitations and opportunities. In: SIGMOD Conference, pp. 85–96 (2006)
Lo, E., Binnig, C., Kossmann, D., Ozsu, M.T., Hon, W.-K.: A framework for testing DBMS features. VLDB J. 19(2), 203–230 (2010)
Article Google Scholar
Lo, E., Cheng, N., Hon, W.-K.: Generating databases for query workloads. PVLDB 3(1), 848–859 (2010)
Google Scholar
Mannila, H., Räihä, K.-J.: Test data for relational queries. In: PODS, pp. 217–223 (1986)
Mishra, C., Koudas, N., Zuzarte, C.: Generating targeted queries for database testing. In: SIGMOD, pp. 499–510 (2008)
Neufeld, A., Moerkotte, G., Lockemann, P.C.: Generating consistent test data for a variable set of general consistency constraints. VLDB J. 2(2), 173–213 (1993)
Article Google Scholar
Olston, C., Chopra, S., Srivastava, U.: Generating example data for dataflow programs. In: SIGMOD Conference, pp. 245–256 (2009)
Rabl, T., Poess, M.: Parallel data generation for performance analysis of large, complex rdbms. In: DBTest 2011, p. 5 (2011)
Sellis, T.K.: Multiple-query optimization. ACM Trans. Database Syst. 13(1), 23–52 (1988)
Google Scholar
Stephens, J.M., Poess, M.: Mudd: a multi-dimensional data generator. In: WOSP, pp. 104–109 (2004)
Tay, Y.: Data generation for application-specific benchmarking. In: PVLDB, p. 4 (2011)

Download references

Author information

Authors and Affiliations

Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
Eric Lo, Nick Cheng & Wilfred W. K. Lin
Department of Computer Science, National TsingHua University, Hsinchu City, Taiwan
Wing-Kai Hon
Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong
Byron Choi

Authors

Eric Lo
View author publications
You can also search for this author in PubMed Google Scholar
Nick Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Wilfred W. K. Lin
View author publications
You can also search for this author in PubMed Google Scholar
Wing-Kai Hon
View author publications
You can also search for this author in PubMed Google Scholar
Byron Choi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eric Lo.

Additional information

Research supported by Grant PolyU 525009E from Hong Kong RGC.

Appendix: Proofs

1.1 Proof of $k$-SAT-MATCH is $\mathcal {NP}$-complete

Theorem 1

Problem $k$-SAT-MATCH is $\mathcal {NP}$-complete.

We begin with proving $k$-SAT-MATCH is in $\mathcal {NP}$ and further show that it is $\mathcal {NP}$-hard by a reduction from the $\mathcal {NP}$-complete problem known as X3C (Exact Cover by 3-set).

Corollary 1

$k$-SAT-MATCH is in $\mathcal {NP}$.

Proof

Each “yes” instance has a polynomial-size proof, which consists of the set of edges in the matching, and the set of values for each variable. Thus, each “yes” instance can be verified in polynomial time. $\square $

Corollary 2

$k$-SAT-MATCH is $\mathcal {NP}$-hard.

Proof. Obviously, if we solely focus on the constraint satisfaction problem (i.e., the condition on satisfiability required in Definition 4), $k$-SAT-MATCH is definitely $\mathcal {NP}$-hard. However, as we want to show the difficulty of the matching problem itself (e.g., adding an edge to the matching set will induce some relationships that hinder the matching of the other nodes), we assume here the constraint satisfaction step is at no cost.

We are going to reduce X3C (Exact Cover by 3-Set) to the $k$-SAT-MATCH problem. The X3C problem [14] takes as input a set of elements $\mathcal {S}=\{S_1, S_2, \ldots , S_{3n}\}$ and a collection of 3-element set $\mathcal {C}=\{C_1, C_2, \ldots , C_m\}$ and asks whether there is a sub-collection of $\mathcal {C}$, whose size is $n$, such that it exactly covers all elements of $\mathcal {S}$. The reduction is to construct a constrained bipartite graph $G=(U,V,E)$ as follows.

1.
For each 3-element set $C_i =\{S_j, S_k, S_\ell \}$, insert 3 constrained nodes $u_{i,j}$, $u_{i,k}$, and $u_{i,\ell }$ to constrained node set $U$. The propositional formulas that are associated with $u_{i,j}$, $u_{i,k}$, and $u_{i,\ell }$ would be [$$a_j$ $\le $ $w_i$], [$$a_k$ $\le $ $w_i$] and [$$a_\ell $ $\le $ $w_i$], respectively ($$a_j$, $$a_k$, $$a_\ell $ are symbols and $w_i$ is any unique value).
2.
For each element $S_j$, insert a constrained node $v_j$ to constrained node set $V$. The propositional formula that is associated with $v_i$ would be [$$b_j$ $\ge $ $w$] (value $w$ would be the same for all elements).
3.
Connect the nodes in $U$ and $V$ if they are created from the same element $S_j$. For instance, assume a 3-element set $C_2 = \{S_4, S_5, S_6\}$ has inserted 3 nodes $u_{2,4}$, $u_{2,5}$, and $u_{2,6}$ to $U$ in Step 1 and element $S_4$ has inserted a node $v_4$ into $V$ in Step 2. Then, nodes $u_{2,4}$ and $v_4$ should be connected as both of them are created from element $S_4$.
4.
For each 3-element set $C_i$, insert a node $u_{C_i}$ with propositional formula [$$c_i$ $>$ $w_i$] to $U$ and insert a node $v_{C_i}$ with propositional formula [$$d_i$ $\le $ $w$] to $V$ and connect the two nodes with an edge.

The rest of the proof will establish:

Proposition 1

There is an exact cover of $\mathcal {S}$ if and only if the size of maximum satisfiable matching of $G$ is exactly $3n+(m-n)$.

Firstly, if the node $u_{i,j}$ appears in the MSM, it must be matched with the node $v_j$, so that it will induce the total-order relationship $w_i \ge w$. On the other hand, if $u_{C_i}$ appears in the MSM, it must be matched with will $v_{C_i}$, so that it will induce the total-order relationship $w > w_i$. Thus, if either $u_{i,j}$, $u_{i,k}$, or $u_{i,\ell }$ appear in the MSM, we cannot have $u_{C_i}$ in the MSM at the same time.

Suppose we denote $z$ to be the number of $i$’s such that $u_{i,j}$, $u_{i,k}$, or $u_{i,\ell }$ appear in the MSM. Then, the size of MSM is at most $3z + (m-z)$, which in turn is at most $3n + (m-n)$ since $z \le n$.

The “only-if” direction. Next, suppose there is an exact cover of $\mathcal {S}$. In that case, let $C_{i_1}, C_{i_2}, \ldots , C_{i_n}$ be the 3-sets such that they exactly cover $\mathcal {S}$. This implies the elements in these 3-sets must be distinct from each other. Then, consider the following matching in $G$:

1.
For each $i \in \{i_1, i_2, \ldots , i_n\}$, the corresponding nodes of $C_{i_t}$, i.e., $u_{i_t,j}, u_{i_t,k}, u_{i_t,\ell }$, are matched to $v_j$, $v_k$, and $v_\ell $, respectively.
2.
For each $i \notin \{i_1, i_2, \ldots , i_n\}$, $u_{C_i}$ is matched to $v_{C_i}$.

The above matching is also satisfiable because the edges induce total-order relationships of the form $w_i \ge w$ when $i \in \{i_1,i_2,\ldots ,i_n\}$, and of the form $w_i < w$ for other choice of $i$. Thus, all edges can be satisfied simultaneously. Finally, it is easy to check that the above matching has $3n+(m-n)$ edges, so that it is a maximum satisfiable matching.

The “if” direction. If the size of MSM is exactly $3n+(m-n)$, we claim that $z$, which is the number of $i$’s such that $u_{i,j}$, $u_{i,k}$, or $u_{i,\ell }$ appear in the MSM, must be exactly $n$; in addition, for each such $i$, all $u_{i,j}, u_{i,k}, u_{i,\ell }$ must appear in the matching. If this claim is true, it will immediately imply the corresponding 3-sets $C_i$’s (in total $n$ of them) will cover exactly $\mathcal {S}$.

Now, it remains to prove the claim. We first show that $z = n$. If $z < n$, then the matching can contain at most $3z$ edges connecting some $u_{i,r}$ with $v_r$, and at most $m-z$ edges connecting some $u_{C_s}$ with $v_{C_s}$, so that the number of edges is at most $3z + (m-z)$, which is less than $3n+(m-n)$. On the other hand, if $z > n$, then the matching can contain at most $3n$ edges connecting some $u_{i,r}$ with $v_r$ (because $v_r$ is limited), and at most $m-z$ edges connecting some $u_{C_s}$ with $v_{C_s}$, so that the number of edges is at most $3n + (m-z)$, which again is less than $3n+(m-n)$. Thus, if the size of MSM is $3n+(m-n)$, we must have $z=n$.

Given $z=n$, there are at most $m-n$ edges connecting $u_{C_s}$ with $v_{C_s}$. Thus, at least $3n$ edges must be connecting some $u_{i,r}$ with $v_r$. However, since there are only $n$ values of $i$ with $u_{i,j}$, $u_{i,k}$, or $u_{i,\ell }$ appear in the MSM, the previous statement is possible unless for each such $i$, all $u_{i,j}, u_{i,k}, u_{i,\ell }$ appear in the matching. Thus, the proof of the claim completes, and so do the proofs of the Proposition 1 and Corollary 2.

1.2 Proof sketch of the optimal integration plan problem

Proof Sketch. Given an instance of a cross product optimization [13], we create a corresponding symbolic database such that the matching size between two databases is always equal to the size of the cartesian product of the databases plus the size of the two databases. Thus, finding the maximum (satisfiable) matching equals to finding the optimal join ordering, which is $\mathcal {NP}$-hard. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lo, E., Cheng, N., Lin, W.W.K. et al. MyBenchmark: generating databases for query workloads. The VLDB Journal 23, 895–913 (2014). https://doi.org/10.1007/s00778-014-0354-1

Download citation

Received: 02 July 2013
Revised: 10 January 2014
Accepted: 13 February 2014
Published: 15 March 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s00778-014-0354-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MyBenchmark: generating databases for query workloads

Abstract

Access this article

Similar content being viewed by others

CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

UniBench: A Benchmark for Multi-model Database Management Systems

A Framework for Supporting Repetition and Evaluation in the Process of Cloud-Based DBMS Performance Benchmarking

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proofs

1.1 Proof of \(k\)-SAT-MATCH is \(\mathcal {NP}\)-complete

Theorem 1

Corollary 1

Proof

Corollary 2

Proposition 1

1.2 Proof sketch of the optimal integration plan problem

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MyBenchmark: generating databases for query workloads

Abstract

Access this article

Similar content being viewed by others

CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

UniBench: A Benchmark for Multi-model Database Management Systems

A Framework for Supporting Repetition and Evaluation in the Process of Cloud-Based DBMS Performance Benchmarking

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proofs

Appendix: Proofs

1.1 Proof of \(k\)-SAT-MATCH is \(\mathcal {NP}\)-complete

Theorem 1

Corollary 1

Proof

Corollary 2

Proposition 1

1.2 Proof sketch of the optimal integration plan problem

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation