Towards intelligent database systems using clusters of SQL transactions

Marathe, Arunprasad P.

doi:10.1007/s10115-023-01850-5

Towards intelligent database systems using clusters of SQL transactions

Regular Paper
Published: 16 March 2023

Volume 65, pages 2863–2894, (2023)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Arunprasad P. Marathe ORCID: orcid.org/0000-0002-3789-7670¹

341 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Transactions are the bread-and-butter of database management system (DBMS) industry. When you check your bank balance, pay bill, or move money from saving to chequing account, transactions are involved. That transactions are self-similar—whether you pay a utility company or credit card, it is still a ‘pay bill’ transaction—has been noted before. Somewhat surprisingly, that property remains largely unexploited, barring some notable exceptions. The research reported in this paper begins to build ‘intelligence’ into database systems by offering built-in transaction classification and clustering. The utility of such an approach is demonstrated by showing how it simplifies DBMS monitoring and troubleshooting. The well-known DBSCAN algorithm clusters online transaction processing (OLTP) transactions: this paper’s contribution is in demonstrating a robust server-side feature extraction approach, rather than the previously suggested and error-prone log-mining approach. It is shown how ‘DBSCAN + angular cosine distance function’ finds better clusters than the previously tried combinations, and simplifies DBSCAN parameter tuning—a known nontrivial task. DBMS troubleshooting efficacy is demonstrated by identifying the root causes of several real-life performance problems: problematic transaction rollbacks; performance drifts; system-wide issues; CPU and memory bottlenecks; and so on. It is also shown that the cluster count remains unchanged irrespective of system load—a desirable but often overlooked property. The transaction clustering solution has been implemented inside the popular MySQL DBMS, although most modern relational database systems can benefit from the ideas described herein.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey of data mining

Article 06 February 2020

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

Advancing database security: a comprehensive systematic mapping study of potential challenges

Article Open access 17 July 2023

Notes

Indeed, all of the workloads in OLTP-Bench have between 1 and 10 clusters [1]. Read-write and read-only flavors of Sysbench [2] contain 10 clusters each. Although Sysbench is a synthetic benchmark, all of the benchmarks in OLTP-Bench are based on real-life applications: online auctions; banking; talent show voting; social networking; key-value store; airline reservation system; and so on.
INSERT, UPDATE, and DELETE statements can also have embedded SELECT queries, and those are handled similarly to the way \(FV_S\) is.
A close cousin of ACD—the dot-product-based cosine distance—is not a metric because it does not satisfy the triangle inequality and indeed performs worse than ACD in clustering SQL transactions as demonstrated in Sect. 5.7.
The name Eps is chosen because it matches a DBSCAN parameter’s name as mentioned in Sect. 2.3.
The tables are implemented using circular in-memory queues, and therefore, do not grow beyond limits, but truncation ensures data fidelity.
CH-benCHmark, JPAB, and ResourceStreser workloads produced run-time errors. LinkBench was excluded because it (and JPAB) have been removed from a rewrite of OLTP-Bench currently under development [25].
As an interesting aside emphasizing the importance of metric distance functions, the author was attempting to generalize an elegant algorithm due to Lingas [27] that computed the relative neighborhood graphs (RNG). Lingas’ algorithm computed the RNG for a set of points using the Euclidean distance; the generalization was to compute the RNG for points with associated weights (disks). Generalizing the Euclidean distance to compute inter-disk distances makes it non-metric. A counterexample was found in [28] that suggested that a straightforward generalization of Lingas’ algorithm would be nearly impossible—the status quo after more than 27 years.
The second subplot contains only five lines because clusters 1 and 3 examine 3 rows each, and the plotting software cannot distinguish two overlapping lines.

References

Difallah DE, Pavlo A, Curino C, Cudré-Mauroux P (2013) OLTP-bench: an extensible testbed for benchmarking relational databases. PVLDB 7(4):277–288
Google Scholar
GitHub: sysbench (2020). https://github.com/akopytov/sysbench
Verbitski A, Gupta A, Saha D, Brahmadesam M, Gupta K, Mittal R, Krishnamurthy S, Maurice S, Kharatishvili T, Bao X (2017) Amazon aurora: design considerations for high throughput cloud-native relational databases. In: Proceedings of the 2017 ACM international conference on management of data, SIGMOD conference 2017, Chicago, IL, USA, pp 1041–1052. https://doi.org/10.1145/3035918.3056101. https://doi.org/10.1145/3035918.3056101
Gan J, Tao Y (2015) DBSCAN revisited: mis-claim, un-fixability, and approximation. In: Proceedings of the ACM SIGMOD conference, pp 519–530. https://doi.org/10.1145/2723372.2737792
Schubert E, Sander J, Ester M, Kriegel H, Xu X (2017) DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst 42(3):19–11921
Article MathSciNet Google Scholar
MariaDB (2020) MariaDB MaxScale. https://mariadb.com/kb/en/maxscale/
Harizopoulos S, Ailamaki A (2004) STEPS towards cache-resident transaction processing. In: (e)Proceedings of the thirtieth international conference on very large data bases, VLDB 2004, Toronto, Canada, pp 660–671. https://doi.org/10.1016/B978-012088469-8.50059-0. http://www.vldb.org/conf/2004/RS18P1.PDF
Marathe AP (2021) DBMS performance troubleshooting in cloud computing using transaction clustering. In: Proceedings of the EDBT 2021 Conference, pp. 463–468 (2021). https://doi.org/10.5441/002/edbt.2021.52
Li Q, Kraft P, Cafarella M, Demiralp c, Graefe G, Kozyrakis C, Stonebraker M, Suresh L, Zaharia M (2023) Transactions make debugging easy. In: Proceedings of the CIDR 2023 conference, Amsterdam, The Netherlands, Jan. 8–11
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96), pp 226–231
Yoon DY, Niu N, Mozafari B (2016) DBSherlock: a performance diagnostic tool for transactional databases. In: Proceedings of the 2016 international conference on management of data, pp 1599–1614 . https://doi.org/10.1145/2882903.2915218
Scikit Learn: DBSCAN (2019). https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
Marcus RC, Negi P, Mao H, Zhang C, Alizadeh M, Kraska T, Papaemmanouil O, Tatbul N (2019) Neo: a learned query optimizer. Proc VLDB Endow 12(11):1705–1718. https://doi.org/10.14778/3342263.3342644
Article Google Scholar
Bermbach D, Wittern E, Tai S (2017) Cloud service benchmarking-measuring quality of cloud services from a client perspective. Springer, New York City
Google Scholar
Wikipedia: Cosine similarity (2019). https://en.wikipedia.org/wiki/Cosine_similarity
Leskovec J, Rajaraman A, Ullman JD (2014) Mining of massive datasets, 2nd edn. Cambridge University Press, Cambridge
Book Google Scholar
GitHub: DBSeer (2020). https://github.com/barzan/dbseer
Mozafari B, Curino C, Jindal A, Madden S (2013) Performance and resource modeling in highly-concurrent OLTP workloads. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2013, New York, NY, USA, pp 301–312. https://doi.org/10.1145/2463676.2467800
Makiyama VH, Raddick J, Santos RDC (2015) Text mining applied to SQL queries: a case study for the SDSS skyserver, vol 1478. CEUR-WS.org, Aachen, Germany, pp 66–72. http://ceur-ws.org/Vol-1478/paper7.pdf
Kul G, Luong DTA, Xie T, Chandola V, Kennedy O, Upadhyaya SJ (2018) Similarity metrics for SQL query clustering. IEEE Trans Knowl Data Eng 30(12):2408–2420. https://doi.org/10.1109/TKDE.2018.2831214
Article Google Scholar
MySQL: Chapter 26 MySQL Performance Schema (2020). https://dev.mysql.com/doc/refman/8.0/en/performance-schema.html
Microsoft: System Dynamic Managemenet Views (2019). https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/system-dynamic-management-views?view=sql-server-ver15
Oracle: About Dynamic Performance Views (2020). https://docs.oracle.com/cd/B19306_01/server.102/b14237/dynviews_1001.htm#i1398692
PostgreSQL: The Statistics Collector (2020). https://www.postgresql.org/docs/9.6/monitoring-stats.html
GitHub: OLTP-Bench II (2020). https://github.com/timveil-cockroach/oltpbench
Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169–194
Article Google Scholar
Lingas A (1994) A linear-time construction of the relative neighborhood graph from the Delaunay triangulation. Comput Geom 4:199–208. https://doi.org/10.1016/0925-7721(94)90018-3
Article MathSciNet MATH Google Scholar
Marathe AP (1995) The weighted relative neighbourhood graph. Master’s thesis, York University, Toronto, Ontario, Canada
Marathe AP (2020) LRZ convolution: an algorithm for automatic anomaly detection in time-series data. In: Proceedings of the SSDBM 2020 conference, pp 1–1112
TPC-C (1992). http://www.tpc.org/tpcc/
Wikipedia: Kernel-based Virtual Machine (2020). https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine
Wikipedia: Ruby on Rails (2020). https://en.wikipedia.org/wiki/Ruby_on_Rails
Wikipedia: Django (web framework) (2020). https://en.wikipedia.org/wiki/Django_(web_framework)
Stonebraker M, Madden S, Abadi DJ, Harizopoulos S, Hachem N, Helland P (2007) The end of an architectural era (it’s time for a complete rewrite). In: Proceedings of the 33rd international conference on very large data bases. University of Vienna, Austria, pp 1150–1160. http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193
Article Google Scholar
Li G, Zhou X, Li S, Gao B (2019) Qtune: a query-aware database tuning system with deep reinforcement learning. Proc VLDB Endow 12(12):2118–2130
Article Google Scholar
Agrawal R, Rantzau R, Terzi E (2006) Context-sensitive ranking. In: Proceedings of the ACM SIGMOD conference, pp. 383–394. https://doi.org/10.1145/1142473.1142517
Zhang T, Ramakrishnan R, Livny M (1997) BIRCH: a new data clustering algorithm and its applications. Data Min Knowl Discov 1(2):141–182
Article Google Scholar
Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–136
Article MathSciNet MATH Google Scholar
Dopazo J, Carazo JM (1997) Phylogenetic reconstruction using an unsupervised growing neural network that adopts the topology of a phylogenetic tree. J Mol Evol 44:226–233
Article Google Scholar

Download references

Acknowledgements

Matthew Van Dijk implemented the server-side feature extraction framework.

Author information

Authors and Affiliations

Huawei Technologies Canada Co., Ltd., 19 Allstate Pkwy, Markham, ON, L3R 5A4, Canada
Arunprasad P. Marathe

Authors

Arunprasad P. Marathe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arunprasad P. Marathe.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Marathe, A.P. Towards intelligent database systems using clusters of SQL transactions. Knowl Inf Syst 65, 2863–2894 (2023). https://doi.org/10.1007/s10115-023-01850-5

Download citation

Received: 07 July 2022
Revised: 13 January 2023
Accepted: 13 February 2023
Published: 16 March 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s10115-023-01850-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards intelligent database systems using clusters of SQL transactions

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey of data mining

A survey on the evolution of stream processing systems

Advancing database security: a comprehensive systematic mapping study of potential challenges

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards intelligent database systems using clusters of SQL transactions

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey of data mining

A survey on the evolution of stream processing systems

Advancing database security: a comprehensive systematic mapping study of potential challenges

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation