Abstract
Transactions are the bread-and-butter of database management system (DBMS) industry. When you check your bank balance, pay bill, or move money from saving to chequing account, transactions are involved. That transactions are self-similar—whether you pay a utility company or credit card, it is still a ‘pay bill’ transaction—has been noted before. Somewhat surprisingly, that property remains largely unexploited, barring some notable exceptions. The research reported in this paper begins to build ‘intelligence’ into database systems by offering built-in transaction classification and clustering. The utility of such an approach is demonstrated by showing how it simplifies DBMS monitoring and troubleshooting. The well-known DBSCAN algorithm clusters online transaction processing (OLTP) transactions: this paper’s contribution is in demonstrating a robust server-side feature extraction approach, rather than the previously suggested and error-prone log-mining approach. It is shown how ‘DBSCAN + angular cosine distance function’ finds better clusters than the previously tried combinations, and simplifies DBSCAN parameter tuning—a known nontrivial task. DBMS troubleshooting efficacy is demonstrated by identifying the root causes of several real-life performance problems: problematic transaction rollbacks; performance drifts; system-wide issues; CPU and memory bottlenecks; and so on. It is also shown that the cluster count remains unchanged irrespective of system load—a desirable but often overlooked property. The transaction clustering solution has been implemented inside the popular MySQL DBMS, although most modern relational database systems can benefit from the ideas described herein.
Similar content being viewed by others
Notes
Indeed, all of the workloads in OLTP-Bench have between 1 and 10 clusters [1]. Read-write and read-only flavors of Sysbench [2] contain 10 clusters each. Although Sysbench is a synthetic benchmark, all of the benchmarks in OLTP-Bench are based on real-life applications: online auctions; banking; talent show voting; social networking; key-value store; airline reservation system; and so on.
INSERT, UPDATE, and DELETE statements can also have embedded SELECT queries, and those are handled similarly to the way \(FV_S\) is.
A close cousin of ACD—the dot-product-based cosine distance—is not a metric because it does not satisfy the triangle inequality and indeed performs worse than ACD in clustering SQL transactions as demonstrated in Sect. 5.7.
The name Eps is chosen because it matches a DBSCAN parameter’s name as mentioned in Sect. 2.3.
The tables are implemented using circular in-memory queues, and therefore, do not grow beyond limits, but truncation ensures data fidelity.
CH-benCHmark, JPAB, and ResourceStreser workloads produced run-time errors. LinkBench was excluded because it (and JPAB) have been removed from a rewrite of OLTP-Bench currently under development [25].
As an interesting aside emphasizing the importance of metric distance functions, the author was attempting to generalize an elegant algorithm due to Lingas [27] that computed the relative neighborhood graphs (RNG). Lingas’ algorithm computed the RNG for a set of points using the Euclidean distance; the generalization was to compute the RNG for points with associated weights (disks). Generalizing the Euclidean distance to compute inter-disk distances makes it non-metric. A counterexample was found in [28] that suggested that a straightforward generalization of Lingas’ algorithm would be nearly impossible—the status quo after more than 27 years.
The second subplot contains only five lines because clusters 1 and 3 examine 3 rows each, and the plotting software cannot distinguish two overlapping lines.
References
Difallah DE, Pavlo A, Curino C, Cudré-Mauroux P (2013) OLTP-bench: an extensible testbed for benchmarking relational databases. PVLDB 7(4):277–288
GitHub: sysbench (2020). https://github.com/akopytov/sysbench
Verbitski A, Gupta A, Saha D, Brahmadesam M, Gupta K, Mittal R, Krishnamurthy S, Maurice S, Kharatishvili T, Bao X (2017) Amazon aurora: design considerations for high throughput cloud-native relational databases. In: Proceedings of the 2017 ACM international conference on management of data, SIGMOD conference 2017, Chicago, IL, USA, pp 1041–1052. https://doi.org/10.1145/3035918.3056101. https://doi.org/10.1145/3035918.3056101
Gan J, Tao Y (2015) DBSCAN revisited: mis-claim, un-fixability, and approximation. In: Proceedings of the ACM SIGMOD conference, pp 519–530. https://doi.org/10.1145/2723372.2737792
Schubert E, Sander J, Ester M, Kriegel H, Xu X (2017) DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst 42(3):19–11921
MariaDB (2020) MariaDB MaxScale. https://mariadb.com/kb/en/maxscale/
Harizopoulos S, Ailamaki A (2004) STEPS towards cache-resident transaction processing. In: (e)Proceedings of the thirtieth international conference on very large data bases, VLDB 2004, Toronto, Canada, pp 660–671. https://doi.org/10.1016/B978-012088469-8.50059-0. http://www.vldb.org/conf/2004/RS18P1.PDF
Marathe AP (2021) DBMS performance troubleshooting in cloud computing using transaction clustering. In: Proceedings of the EDBT 2021 Conference, pp. 463–468 (2021). https://doi.org/10.5441/002/edbt.2021.52
Li Q, Kraft P, Cafarella M, Demiralp c, Graefe G, Kozyrakis C, Stonebraker M, Suresh L, Zaharia M (2023) Transactions make debugging easy. In: Proceedings of the CIDR 2023 conference, Amsterdam, The Netherlands, Jan. 8–11
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96), pp 226–231
Yoon DY, Niu N, Mozafari B (2016) DBSherlock: a performance diagnostic tool for transactional databases. In: Proceedings of the 2016 international conference on management of data, pp 1599–1614 . https://doi.org/10.1145/2882903.2915218
Scikit Learn: DBSCAN (2019). https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
Marcus RC, Negi P, Mao H, Zhang C, Alizadeh M, Kraska T, Papaemmanouil O, Tatbul N (2019) Neo: a learned query optimizer. Proc VLDB Endow 12(11):1705–1718. https://doi.org/10.14778/3342263.3342644
Bermbach D, Wittern E, Tai S (2017) Cloud service benchmarking-measuring quality of cloud services from a client perspective. Springer, New York City
Wikipedia: Cosine similarity (2019). https://en.wikipedia.org/wiki/Cosine_similarity
Leskovec J, Rajaraman A, Ullman JD (2014) Mining of massive datasets, 2nd edn. Cambridge University Press, Cambridge
GitHub: DBSeer (2020). https://github.com/barzan/dbseer
Mozafari B, Curino C, Jindal A, Madden S (2013) Performance and resource modeling in highly-concurrent OLTP workloads. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2013, New York, NY, USA, pp 301–312. https://doi.org/10.1145/2463676.2467800
Makiyama VH, Raddick J, Santos RDC (2015) Text mining applied to SQL queries: a case study for the SDSS skyserver, vol 1478. CEUR-WS.org, Aachen, Germany, pp 66–72. http://ceur-ws.org/Vol-1478/paper7.pdf
Kul G, Luong DTA, Xie T, Chandola V, Kennedy O, Upadhyaya SJ (2018) Similarity metrics for SQL query clustering. IEEE Trans Knowl Data Eng 30(12):2408–2420. https://doi.org/10.1109/TKDE.2018.2831214
MySQL: Chapter 26 MySQL Performance Schema (2020). https://dev.mysql.com/doc/refman/8.0/en/performance-schema.html
Microsoft: System Dynamic Managemenet Views (2019). https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/system-dynamic-management-views?view=sql-server-ver15
Oracle: About Dynamic Performance Views (2020). https://docs.oracle.com/cd/B19306_01/server.102/b14237/dynviews_1001.htm#i1398692
PostgreSQL: The Statistics Collector (2020). https://www.postgresql.org/docs/9.6/monitoring-stats.html
GitHub: OLTP-Bench II (2020). https://github.com/timveil-cockroach/oltpbench
Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169–194
Lingas A (1994) A linear-time construction of the relative neighborhood graph from the Delaunay triangulation. Comput Geom 4:199–208. https://doi.org/10.1016/0925-7721(94)90018-3
Marathe AP (1995) The weighted relative neighbourhood graph. Master’s thesis, York University, Toronto, Ontario, Canada
Marathe AP (2020) LRZ convolution: an algorithm for automatic anomaly detection in time-series data. In: Proceedings of the SSDBM 2020 conference, pp 1–1112
TPC-C (1992). http://www.tpc.org/tpcc/
Wikipedia: Kernel-based Virtual Machine (2020). https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine
Wikipedia: Ruby on Rails (2020). https://en.wikipedia.org/wiki/Ruby_on_Rails
Wikipedia: Django (web framework) (2020). https://en.wikipedia.org/wiki/Django_(web_framework)
Stonebraker M, Madden S, Abadi DJ, Harizopoulos S, Hachem N, Helland P (2007) The end of an architectural era (it’s time for a complete rewrite). In: Proceedings of the 33rd international conference on very large data bases. University of Vienna, Austria, pp 1150–1160. http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193
Li G, Zhou X, Li S, Gao B (2019) Qtune: a query-aware database tuning system with deep reinforcement learning. Proc VLDB Endow 12(12):2118–2130
Agrawal R, Rantzau R, Terzi E (2006) Context-sensitive ranking. In: Proceedings of the ACM SIGMOD conference, pp. 383–394. https://doi.org/10.1145/1142473.1142517
Zhang T, Ramakrishnan R, Livny M (1997) BIRCH: a new data clustering algorithm and its applications. Data Min Knowl Discov 1(2):141–182
Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–136
Dopazo J, Carazo JM (1997) Phylogenetic reconstruction using an unsupervised growing neural network that adopts the topology of a phylogenetic tree. J Mol Evol 44:226–233
Acknowledgements
Matthew Van Dijk implemented the server-side feature extraction framework.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Marathe, A.P. Towards intelligent database systems using clusters of SQL transactions. Knowl Inf Syst 65, 2863–2894 (2023). https://doi.org/10.1007/s10115-023-01850-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-01850-5