ABSTRACT
Data mining remains an important research area in database systems. We present a review of processing alternatives, storage mechanisms, algorithms, data structures and optimizations that enable data mining on large data sets. We focus on the computation of well-known multidimensional statistical and machine learning models. We pay particular attention to SQL and MapReduce as two competing technologies for large scale processing. We conclude with a summary of solved major problems and open research issues.
- J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, 2nd edition, 2006. Google ScholarDigital Library
- C. Ordonez. Integrating K-means clustering with a relational DBMS using SQL. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(2):188--201, 2006. Google ScholarDigital Library
- C. Ordonez. Statistical model computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22, 2010. Google ScholarDigital Library
- M. Stonebraker, D. Abadi, D.J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64--71, 2010. Google ScholarDigital Library
Index Terms
- Database systems research on data mining
Recommendations
One-pass data mining algorithms in a DBMS with UDFs
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataData mining research is extensive, but most work has proposed efficient algorithms, data structures and optimizations that work outside a DBMS, mostly on flat files. In contrast, we present a data mining system that can work on top of a relational DBMS ...
Building statistical models and scoring with UDFs
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of dataMultidimensional statistical models are generally computed outside a relational DBMS, exporting data sets. This article explains how fundamental multidimensional statistical models are computed inside the DBMS in a single table scan exploiting SQL and ...
Comparing SQL and MapReduce to compute Naive Bayes in a single table scan
CloudDB '10: Proceedings of the second international workshop on Cloud data managementMost data mining processing is currently performed on flat files outside the DBMS. We propose novel techniques to process such data mining computations inside the DBMS. We focus on the popular Naive Bayes classification algorithm. In contrast to most ...
Comments