Abstract
Machine Learning has become highly popular due to several success stories in data-driven applications. Prominent examples include object detection in images, speech recognition, and text translation. According to Gartner’s 2016 Hype Cycle for Emerging Technologies, Machine Learning is currently at its peak of inflated expectations, with several other application domains trying to exploit the use of Machine Learning technology. Since data-driven applications are a fundamental cornerstone of the database community as well, it becomes natural to ask how these fields relate to each other. In this article, we will therefore provide a brief introduction to the field of Machine Learning, we will discuss its interplay with other fields such as Data Mining and Databases, and we provide an overview of recent data management systems integrating Machine Learning functionality.
Similar content being viewed by others
Notes
Not every problem at hand needs to be tackled by Machine Learning. For example, the detection of people’s resumes on the Web via Machine Learning has not shown to be advantageous over the manual design of an algorithm to discover resumes [15]: “Since everyone who has looked at or written a resume has a pretty good idea of what resumes contain, there was no mystery about what makes a Web page a resume.”
Often these concepts are not well separated. The most prominent example is the k‑means clustering model, where the default algorithm to solve it (Lloyd’s algorithm) is itself often called k‑means.
Usually the whole process is called Knowledge Discovery, while phase 4 is referred to as Data Mining.
References
Aref M, ten Cate B, Green TJ, Kimelfeld B, Olteanu D, Pasalic E, Veldhuizen TL, Washburn G (2015) Design and implementation of the logicblox system. In: SIGMOD, pp 1371–1382
Bishop CM (2006) Pattern Recognition and Machine Learning. Springer, New York
Böhm M, Burdick DR, Evfimievski AV, Reinwald B, Reiss FR, Sen P, Tatikonda S (2014) and Y. Tian. Systemml’s optimizer: Plan generation for large-scale machine learning programs. IEEE Data Eng Bull 37(3):52–62
Cai Z, Vagena Z, Perez LL, Arumugam S, Haas PJ, Jermaine CM (2013) Simulation of database-valued markov chains using simsql. In: SIGMOD, pp 637–648
Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache flink™: Stream and batch processing in a single engine. IEEE Data Eng Bull 38(4):28–38
Chaudhuri S, Narasayya VR (2007) Self-tuning database systems: A decade of progress. In: VLDB, pp 3–14
Das S, Li F, Narasayya VR, König AC (2016) Automated demand-driven resource scaling in relational database-as-a-service. In: SIGMOD, pp 1923–1934
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Senior A, Tucker P, Yang K, Le QV et al (2012) Large scale distributed deep networks. In: NIPS, pp 1223–1231
Elnaffar S, Martin TP, Horman R (2002) Automatically classifying database workloads. In: CIKM, pp 622–624
Ganapathi A, Kuno HA, Dayal U, Wiener JL, Fox A, Jordan MI, Patterson DA (2009) Predicting multiple metrics for queries: Better decisions enabled by machine learning. In: ICDE, pp 592–603
Hellerstein JM, Ré C, Schoppmann F, Wang DZ, Fratkin E, Gorajek A, Ng KS, Welton C, Feng X, Li K, Kumar A (2012) The madlib analytics library or MAD skills, the SQL. PVLDB 5(12):1700–1711
Holze M, Ritter N (2008) Autonomic databases: Detection of workload shifts with n‑gram-models. In: ADBIS, pp 127–142
Kraska T, Talwalkar A, Duchi JC, Griffith R, Franklin MJ, Jordan MI (2013) Mlbase: A distributed machine-learning system. In: CIDR
Kunft A, Alexandrov A, Katsifodimos A, Markl V (2016) Bridging the gap: towards optimization across linear and relational algebra. In: Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR@SIGMOD, pp 1–4
Leskovec J, Rajaraman A, Ullman JD (2014) Mining of massive datasets. Cambridge University Press, Cambridge
Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su B-Y (2014) Scaling distributed machine learning with the parameter server. In: OSDI, pp 583–598
Mozafari B, Curino C, Jindal A, Madden S (2013) Performance and resource modeling in highly-concurrent OLTP workloads. In: SIGMOD, pp 301–312
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT press, Cambridge
Passing L, Then M, Hubig N, Lang H, Schreier M, Günnemann S, Kemper A, Neumann T (2017) Sql- and operator-centric data analytics in relational main-memory databases. In: EDBT
Pavlo A et al (2017) Self-driving database management systems. In: CIDR
Recht B, Re C, Wright S, Niu F (2011) Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: NIPS, pp 693–701
Roy N, Dubey A, Gokhale AS (2011) Efficient autoscaling in the cloud using predictive models for workload forecasting. In: CLOUD, pp 500–507
Samuel AL (1959) Some studies in machine learning using the game of checkers. IBM J Res Dev 3(3):210–229
Sapia C (2000) PROMISE: predicting query behavior to enable predictive caching strategies for OLAP systems. In: DaWaK, pp 224–233
Schelter S, Palumbo A, Quinn S, Marthi S, Musselman A (2016) Samsara: Declarative machine learning on distributed dataflow systems. In: Machine Learning Systems workshop at NIPS
Shearer C (2000) The crisp-dm model: the new blueprint for data mining. J Data Warehous 5(4):13–22
Tamayo P et al (2005) Oracle data mining – data mining in the database environment. In: The Data Mining and Knowledge Discovery Handbook, pp 1315–1329
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: Cluster computing with working sets. In: HotCloud, pp 1–7
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Günnemann, S. Machine Learning Meets Databases. Datenbank Spektrum 17, 77–83 (2017). https://doi.org/10.1007/s13222-017-0247-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-017-0247-8