Skip to main content
Log in

Towards intelligent database systems using clusters of SQL transactions

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Transactions are the bread-and-butter of database management system (DBMS) industry. When you check your bank balance, pay bill, or move money from saving to chequing account, transactions are involved. That transactions are self-similar—whether you pay a utility company or credit card, it is still a ‘pay bill’ transaction—has been noted before. Somewhat surprisingly, that property remains largely unexploited, barring some notable exceptions. The research reported in this paper begins to build ‘intelligence’ into database systems by offering built-in transaction classification and clustering. The utility of such an approach is demonstrated by showing how it simplifies DBMS monitoring and troubleshooting. The well-known DBSCAN algorithm clusters online transaction processing (OLTP) transactions: this paper’s contribution is in demonstrating a robust server-side feature extraction approach, rather than the previously suggested and error-prone log-mining approach. It is shown how ‘DBSCAN + angular cosine distance function’ finds better clusters than the previously tried combinations, and simplifies DBSCAN parameter tuning—a known nontrivial task. DBMS troubleshooting efficacy is demonstrated by identifying the root causes of several real-life performance problems: problematic transaction rollbacks; performance drifts; system-wide issues; CPU and memory bottlenecks; and so on. It is also shown that the cluster count remains unchanged irrespective of system load—a desirable but often overlooked property. The transaction clustering solution has been implemented inside the popular MySQL DBMS, although most modern relational database systems can benefit from the ideas described herein.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Indeed, all of the workloads in OLTP-Bench have between 1 and 10 clusters [1]. Read-write and read-only flavors of Sysbench [2] contain 10 clusters each. Although Sysbench is a synthetic benchmark, all of the benchmarks in OLTP-Bench are based on real-life applications: online auctions; banking; talent show voting; social networking; key-value store; airline reservation system; and so on.

  2. INSERT, UPDATE, and DELETE statements can also have embedded SELECT queries, and those are handled similarly to the way \(FV_S\) is.

  3. A close cousin of ACD—the dot-product-based cosine distance—is not a metric because it does not satisfy the triangle inequality and indeed performs worse than ACD in clustering SQL transactions as demonstrated in Sect. 5.7.

  4. The name Eps is chosen because it matches a DBSCAN parameter’s name as mentioned in Sect. 2.3.

  5. The tables are implemented using circular in-memory queues, and therefore, do not grow beyond limits, but truncation ensures data fidelity.

  6. CH-benCHmark, JPAB, and ResourceStreser workloads produced run-time errors. LinkBench was excluded because it (and JPAB) have been removed from a rewrite of OLTP-Bench currently under development [25].

  7. As an interesting aside emphasizing the importance of metric distance functions, the author was attempting to generalize an elegant algorithm due to Lingas [27] that computed the relative neighborhood graphs (RNG). Lingas’ algorithm computed the RNG for a set of points using the Euclidean distance; the generalization was to compute the RNG for points with associated weights (disks). Generalizing the Euclidean distance to compute inter-disk distances makes it non-metric. A counterexample was found in [28] that suggested that a straightforward generalization of Lingas’ algorithm would be nearly impossible—the status quo after more than 27 years.

  8. The second subplot contains only five lines because clusters 1 and 3 examine 3 rows each, and the plotting software cannot distinguish two overlapping lines.

References

  1. Difallah DE, Pavlo A, Curino C, Cudré-Mauroux P (2013) OLTP-bench: an extensible testbed for benchmarking relational databases. PVLDB 7(4):277–288

    Google Scholar 

  2. GitHub: sysbench (2020). https://github.com/akopytov/sysbench

  3. Verbitski A, Gupta A, Saha D, Brahmadesam M, Gupta K, Mittal R, Krishnamurthy S, Maurice S, Kharatishvili T, Bao X (2017) Amazon aurora: design considerations for high throughput cloud-native relational databases. In: Proceedings of the 2017 ACM international conference on management of data, SIGMOD conference 2017, Chicago, IL, USA, pp 1041–1052. https://doi.org/10.1145/3035918.3056101. https://doi.org/10.1145/3035918.3056101

  4. Gan J, Tao Y (2015) DBSCAN revisited: mis-claim, un-fixability, and approximation. In: Proceedings of the ACM SIGMOD conference, pp 519–530. https://doi.org/10.1145/2723372.2737792

  5. Schubert E, Sander J, Ester M, Kriegel H, Xu X (2017) DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst 42(3):19–11921

    Article  MathSciNet  Google Scholar 

  6. MariaDB (2020) MariaDB MaxScale. https://mariadb.com/kb/en/maxscale/

  7. Harizopoulos S, Ailamaki A (2004) STEPS towards cache-resident transaction processing. In: (e)Proceedings of the thirtieth international conference on very large data bases, VLDB 2004, Toronto, Canada, pp 660–671. https://doi.org/10.1016/B978-012088469-8.50059-0. http://www.vldb.org/conf/2004/RS18P1.PDF

  8. Marathe AP (2021) DBMS performance troubleshooting in cloud computing using transaction clustering. In: Proceedings of the EDBT 2021 Conference, pp. 463–468 (2021). https://doi.org/10.5441/002/edbt.2021.52

  9. Li Q, Kraft P, Cafarella M, Demiralp c, Graefe G, Kozyrakis C, Stonebraker M, Suresh L, Zaharia M (2023) Transactions make debugging easy. In: Proceedings of the CIDR 2023 conference, Amsterdam, The Netherlands, Jan. 8–11

  10. Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96), pp 226–231

  11. Yoon DY, Niu N, Mozafari B (2016) DBSherlock: a performance diagnostic tool for transactional databases. In: Proceedings of the 2016 international conference on management of data, pp 1599–1614 . https://doi.org/10.1145/2882903.2915218

  12. Scikit Learn: DBSCAN (2019). https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

  13. Marcus RC, Negi P, Mao H, Zhang C, Alizadeh M, Kraska T, Papaemmanouil O, Tatbul N (2019) Neo: a learned query optimizer. Proc VLDB Endow 12(11):1705–1718. https://doi.org/10.14778/3342263.3342644

    Article  Google Scholar 

  14. Bermbach D, Wittern E, Tai S (2017) Cloud service benchmarking-measuring quality of cloud services from a client perspective. Springer, New York City

    Google Scholar 

  15. Wikipedia: Cosine similarity (2019). https://en.wikipedia.org/wiki/Cosine_similarity

  16. Leskovec J, Rajaraman A, Ullman JD (2014) Mining of massive datasets, 2nd edn. Cambridge University Press, Cambridge

    Book  Google Scholar 

  17. GitHub: DBSeer (2020). https://github.com/barzan/dbseer

  18. Mozafari B, Curino C, Jindal A, Madden S (2013) Performance and resource modeling in highly-concurrent OLTP workloads. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2013, New York, NY, USA, pp 301–312. https://doi.org/10.1145/2463676.2467800

  19. Makiyama VH, Raddick J, Santos RDC (2015) Text mining applied to SQL queries: a case study for the SDSS skyserver, vol 1478. CEUR-WS.org, Aachen, Germany, pp 66–72. http://ceur-ws.org/Vol-1478/paper7.pdf

  20. Kul G, Luong DTA, Xie T, Chandola V, Kennedy O, Upadhyaya SJ (2018) Similarity metrics for SQL query clustering. IEEE Trans Knowl Data Eng 30(12):2408–2420. https://doi.org/10.1109/TKDE.2018.2831214

    Article  Google Scholar 

  21. MySQL: Chapter 26 MySQL Performance Schema (2020). https://dev.mysql.com/doc/refman/8.0/en/performance-schema.html

  22. Microsoft: System Dynamic Managemenet Views (2019). https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/system-dynamic-management-views?view=sql-server-ver15

  23. Oracle: About Dynamic Performance Views (2020). https://docs.oracle.com/cd/B19306_01/server.102/b14237/dynviews_1001.htm#i1398692

  24. PostgreSQL: The Statistics Collector (2020). https://www.postgresql.org/docs/9.6/monitoring-stats.html

  25. GitHub: OLTP-Bench II (2020). https://github.com/timveil-cockroach/oltpbench

  26. Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169–194

    Article  Google Scholar 

  27. Lingas A (1994) A linear-time construction of the relative neighborhood graph from the Delaunay triangulation. Comput Geom 4:199–208. https://doi.org/10.1016/0925-7721(94)90018-3

    Article  MathSciNet  MATH  Google Scholar 

  28. Marathe AP (1995) The weighted relative neighbourhood graph. Master’s thesis, York University, Toronto, Ontario, Canada

  29. Marathe AP (2020) LRZ convolution: an algorithm for automatic anomaly detection in time-series data. In: Proceedings of the SSDBM 2020 conference, pp 1–1112

  30. TPC-C (1992). http://www.tpc.org/tpcc/

  31. Wikipedia: Kernel-based Virtual Machine (2020). https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine

  32. Wikipedia: Ruby on Rails (2020). https://en.wikipedia.org/wiki/Ruby_on_Rails

  33. Wikipedia: Django (web framework) (2020). https://en.wikipedia.org/wiki/Django_(web_framework)

  34. Stonebraker M, Madden S, Abadi DJ, Harizopoulos S, Hachem N, Helland P (2007) The end of an architectural era (it’s time for a complete rewrite). In: Proceedings of the 33rd international conference on very large data bases. University of Vienna, Austria, pp 1150–1160. http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf

  35. Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193

    Article  Google Scholar 

  36. Li G, Zhou X, Li S, Gao B (2019) Qtune: a query-aware database tuning system with deep reinforcement learning. Proc VLDB Endow 12(12):2118–2130

    Article  Google Scholar 

  37. Agrawal R, Rantzau R, Terzi E (2006) Context-sensitive ranking. In: Proceedings of the ACM SIGMOD conference, pp. 383–394. https://doi.org/10.1145/1142473.1142517

  38. Zhang T, Ramakrishnan R, Livny M (1997) BIRCH: a new data clustering algorithm and its applications. Data Min Knowl Discov 1(2):141–182

    Article  Google Scholar 

  39. Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–136

    Article  MathSciNet  MATH  Google Scholar 

  40. Dopazo J, Carazo JM (1997) Phylogenetic reconstruction using an unsupervised growing neural network that adopts the topology of a phylogenetic tree. J Mol Evol 44:226–233

    Article  Google Scholar 

Download references

Acknowledgements

Matthew Van Dijk implemented the server-side feature extraction framework.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arunprasad P. Marathe.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Marathe, A.P. Towards intelligent database systems using clusters of SQL transactions. Knowl Inf Syst 65, 2863–2894 (2023). https://doi.org/10.1007/s10115-023-01850-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-01850-5

Keywords

Navigation