Abstract
Relational DBMSs (RDBMSs) remain the most popular tool for processing structured data in data intensive domains. However, most of stand-alone data mining packages process flat files outside a RDBMS. In-database data mining avoids export-import data/results bottleneck as opposed to use stand-alone mining packages and keeps all the benefits provided by a RDBMS. The paper presents an approach to data mining inside a RDBMS based on a parallel implementation of user-defined functions (UDFs). Such an approach is implemented for PostgreSQL and modern Intel MIC (Many Integrated Core) architecture. The UDF performs a single mining task on data from the specified table and produces a resulting table. The UDF is organized as a wrapper of an appropriate mining algorithm, which is implemented in C language and is parallelized by the OpenMP technology and thread-level parallelism. The heavy-weight parts of the algorithm are additionally parallelized by intrinsic functions for MIC platforms to reach the optimal loop vectorization manually. The library of such UDFs supports a cache of precomputed mining structures to reduce costs of further computations. In the experiments, the proposed approach shows good scalability and overtakes R data mining package.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Duran, A., Klemm, M.: The Intel Many Integrated Core architecture. In: Smari, W.W., Zeljkovic, V. (eds.) HPCS, pp. 365–366. IEEE (2012)
Engreitz, J.M., Daigle Jr., B.J., Marshall, J.J., Altman, R.B.: Independent component analysis: mining microarray data for fundamental human gene expression modules. J. Biomed. Inform. 43(6), 932–944 (2010)
Feng, X., Kumar, A., Recht, B., Re, C.: Towards a unified architecture for in-RDBMS analytics. In: Candan, K.S., Chen, Y., Snodgrass, R.T., Gravano, L., Fuxman, A. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp. 325–336. ACM (2012)
Garcia, W., Ordonez, C., Zhao, K., Chen, P.: Efficient algorithms based on relational queries to mine frequent graphs. In: Nica, A., Varde, A.S. (eds.) Proceedings of the Third Ph.D. Workshop on Information and Knowledge Management, PIKM 2010, Toronto, Ontario, Canada, pp. 17–24. ACM, 30 October 2010
Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y., Rajan, A., Stefanovic, N., Xia, B., Zaiane, O.R.: Dbminer: a system for mining knowledge in large relational databases. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, pp. 250–255. AAAI Press (1996)
Hellerstein, J.M., Re, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The MADlib analytics library or MAD skills, the SQL. PVLDB 5(12), 1700–1711 (2012)
Imielinski, T., Virmani, A.: MSQL: a query language for database mining. Data Min. Knowl. Discov. 3(4), 373–408 (1999)
Jaedicke, M., Mitschang, B.: On parallel processing of aggregate and scalar functions in object-relational DBMS. In: Haas, L.M., Tiwary, A. (eds.) SIGMOD 1998, Proceedings of the ACM SIGMOD International Conference on Management of Data, 2–4 June, 1998, Seattle, Washington, USA, pp. 379–389. ACM Press (1998)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Kostenetskiy, P., Safonov, A.: SUSU supercomputer resources. In: Sokolinsky, L., Starodubov, I. (eds.) PCT 2016, International Scientific Conference on Parallel Computational Technologies, Arkhangelsk, Russia, 29–31 March 2016, CEUR Workshop Proceedings, vol. 1576, pp. 561–573. CEUR-WS.org (2016)
Lichman, M.: UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science (2013). http://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption
Mahajan, D., Kim, J.K., Sacks, J., Ardalan, A., Kumar, A., Esmaeilzadeh, H.: In-RDBMS Hardware Acceleration of Advanced Analytics. CoRR abs/1801.06027 (2018)
Meek, C., Thiesson, B., Heckerman, D.: The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res. 2, 397–418 (2002)
Melnykov, V., Chen, W.C., Maitra, R.: MixSim: an R package for simulating data to study performance of clustering algorithms. J. Stat. Softw. Artic. 51(12), 1–25 (2012)
Miniakhmetov, R., Zymbler, M.: Integration of fuzzy c-means clustering algorithm with PostgreSQL database management system. Numer. Methods Programm. 13(2(26)), 46–52 (2012)
O’Neil, E.J., O’Neil, P.E., Weikum, G.: The LRU-K page replacement algorithm for database disk buffering. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., 26–28 May 1993, pp. 297–306. ACM Press (1993)
Ordonez, C.: Integrating k-means clustering with a relational DBMS using SQL. IEEE Trans. Knowl. Data Eng. 18(2), 188–201 (2006)
Ordonez, C.: Building statistical models and scoring with UDFs. In: Chan, C.Y., Ooi, B.C., Zhou, A. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China, 12–14 June 2007, pp. 1005–1016. ACM (2007)
Ordonez, C., Garcia-Garcia, J.: Vector and matrix operations programmed with UDFs in a relational DBMS. In: Yu, P.S., Tsotras, V.J., Fox, E.A., Liu, B. (eds.) Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, 6–11 November 2006, pp. 503–512. ACM (2006)
Ordonez, C., Pitchaimalai, S.K.: Bayesian classifiers programmed in SQL. IEEE Trans. Knowl. Data Eng. 22(1), 139–144 (2010)
Pan, C.S., Zymbler, M.L.: Very large graph partitioning by means of parallel DBMS. In: Catania, B., Guerrini, G., Pokorný, J. (eds.) ADBIS 2013. LNCS, vol. 8133, pp. 388–399. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40683-6_29
Peng, Y., Grossman, M., Sarkar, V.: Static cost estimation for data layout selection on GPUs. In: 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, PMBS@SC 2016, Salt Lake, UT, USA, 14 November 2016, pp. 76–86. IEEE (2016)
Rantzau, R.: Frequent itemset discovery with SQL using universal quantification. In: Meo, R., Lanzi, P.L., Klemettinen, M. (eds.) Database Support for Data Mining Applications. LNCS (LNAI), vol. 2682, pp. 194–213. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-44497-8_10
Rechkalov, T., Zymbler, M.: Accelerating medoids-based clustering with the Intel Many Integrated Core architecture. In: 9th International Conference on Application of Information and Communication Technologies, AICT 2015, 14–16 October 2015, Rostov-on-Don, Russia - Proceedings, pp. 413–417 (2015)
Rechkalov, T., Zymbler, M.: An approach to data mining inside PostgreSQL based on parallel implementation of UDFs. In: Kalinichenko, L.A., Manolopoulos, Y., Kuznetsov, S.O. (eds.) Selected Papers of the XIX International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2017), Moscow, Russia, 9–13 October 2017, CEUR Workshop Proceedings, vol. 2022, pp. 114–121. CEUR-WS.org (2017)
Sattler, K., Dunemann, O.: SQL database primitives for decision tree classifiers. In: Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, Atlanta, Georgia, USA, 5–10 November 2001, pp. 379–386. ACM (2001)
Shang, X., Sattler, K.-U., Geist, I.: SQL Based Frequent Pattern Mining with FP-Growth. In: Seipel, D., Hanus, M., Geske, U., Bartenstein, O. (eds.) INAP/WLP -2004. LNCS (LNAI), vol. 3392, pp. 32–46. Springer, Heidelberg (2005). https://doi.org/10.1007/11415763_3
Sokolinsky, L.B.: LFU-K: an effective buffer management replacement algorithm. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 670–681. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24571-1_60
Acknowledgments
This work was financially supported by the Russian Foundation for Basic Research (grant No. 17-07-00463), by Act 211 Government of the Russian Federation (contract No. 02.A03.21.0011) and by the Ministry of education and science of Russian Federation (government order 2.7905.2017/8.9). Authors thank RSC Group (Moscow, Russia) for the provided computational resources.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Rechkalov, T., Zymbler, M. (2018). Integrating DBMS and Parallel Data Mining Algorithms for Modern Many-Core Processors. In: Kalinichenko, L., Manolopoulos, Y., Malkov, O., Skvortsov, N., Stupnikov, S., Sukhomlin, V. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2017. Communications in Computer and Information Science, vol 822. Springer, Cham. https://doi.org/10.1007/978-3-319-96553-6_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-96553-6_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96552-9
Online ISBN: 978-3-319-96553-6
eBook Packages: Computer ScienceComputer Science (R0)