Scalable machine learning computing a data summarization matrix with a parallel array DBMS

Ordonez, Carlos; Zhang, Yiqun; Johnsson, S. Lennart

doi:10.1007/s10619-018-7229-1

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

Published: 08 June 2018

Volume 37, pages 329–350, (2019)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Carlos Ordonez¹,
Yiqun Zhang¹ &
S. Lennart Johnsson¹

441 Accesses
11 Citations
Explore all metrics

Abstract

Big data analytics requires scalable (beyond RAM limits) and highly parallel (exploiting many CPU cores) processing of machine learning models, which in general involve heavy matrix manipulation. Array DBMSs represent a promising system to manipulate large matrices. With that motivation in mind, we present a high performance system exploiting a parallel array DBMS to evaluate a general, but compact, matrix summarization that benefits many machine learning models. We focus on two representative models: linear regression (supervised) and PCA (unsupervised). Our approach combines data summarization inside the parallel DBMS with further model computation in a mathematical language (e.g. R). We introduce a two-phase algorithm which first computes a general data summary in parallel and then evaluates matrix equations with reduced intermediate matrices in main memory on one node. We present theory results characterizing speedup and time/space complexity. From a parallel data system perspective, we consider scale-up and scale-out in a shared-nothing architecture. In contrast to most big data analytic systems, our system is based on array operators programmed in C++, working directly on the Unix file system instead of Java or Scala running on HDFS mounted of top of Unix, resulting in much faster processing. Experiments compare our system with Spark (parallel) and R (single machine), showing orders of magnitude time improvement. We present parallel benchmarks varying number of threads and processing nodes. Our two-phase approach should motivate analysts to exploit a parallel array DBMS for matrix summarization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Big data analytics: a survey

Article Open access 01 October 2015

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

References

Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., Konwinski, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)
Article Google Scholar
Behm, A., Borkar, V.R., Carey, M.J., Grover, R., Li, C., Onose, N., Vernica, R., Deutsch, A., Papakonstantinou, Y., Tsotras, V.J.: ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distrib. Parallel Databases (DAPD) 29(3), 185–216 (2011)
Article Google Scholar
Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proc. ACM KDD Conference, pp. 9–15 (1998)
Chen, Q., Hsu, M., Liu, R.: Extend udf technology for integrated analytics. Data Warehous. Knowl. Discov. 5691, 256–270 (2009)
Article Google Scholar
Cormode, G.: Compact summaries over large datasets. In: Proc. ACM PODS (2015)
Das, S., Sismanis, Y., Beyer, K.S., Gemulla, R., Haas, P.J., McPherson, J.: RICARDO: integrating R and hadoop. In: Proc. ACM SIGMOD Conference, pp. 987–998 (2010)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
DuMouchel, W., Volinski, C., Johnson, T., Pregybon, D.: Squashing flat files flatter. In: Proc. ACM KDD Conference (1999)
Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: Proc. KDD, pp. 69–77 (2011)
Gucht, D.V., Williams, R., Woodruff, D.P., Zhang, Q.: The communication complexity of distributed set-joins with applications to matrix multiplication. In: Proc. ACM PODS, pp. 199–212 (2015)
Hameurlain, A., Morvan, F.: Parallel relational database systems: why, how and beyond. In: Proc. DEXA Conference, pp. 302–312 (1996)
Hameurlain, A., Morvan, F.: CPU and incremental memory allocation in dynamic parallelization of SQL queries. Parallel Comput. 28(4), 525–556 (2002)
Article MATH Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)
MATH Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning, 1st edn. Springer, New York (2001)
Book MATH Google Scholar
Hellerstein, J., Re, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB 5(12), 1700–1711 (2012)
Article Google Scholar
Lamb, A., Fuller, M., Varadarajan, R., Tran, N., Vandier, B., Doshi, L., Bear, C.: The Vertica analytic database: C-store 7 years later. PVLDB 5(12), 1790–1801 (2012)
Google Scholar
Li, F., Nath, S.: Scalable data summarization on big data. Distrib. Parallel Databases 32(3), 313–314 (2014)
Article Google Scholar
Liu, J., Wright, S.J., Re, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(1), 285–322 (2015)
MathSciNet MATH Google Scholar
Ordonez, C.: Statistical model computation with UDFs. IEEE Trans. Knowl. Data Eng. (TKDE) 22(12), 1752–1765 (2010)
Article Google Scholar
Ordonez, C., Mohanam, N., Garcia-Alvarado, C.: PCA for large data sets with parallel data summarization. Distrib. Parallel Databases 32(3), 377–403 (2014)
Article Google Scholar
Ordonez, C., Zhang, Y., Cabrera, W.: The Gamma matrix to summarize dense and sparse data sets for big data analytics. IEEE Trans. Knowl. Data Eng. (TKDE) 28(7), 1906–1918 (2016)
Article Google Scholar
Parthasarathy, S., Dwarkadas, S.: Shared state for distributed interactive data mining applications. Distrib. Parallel Databases 11(2), 129–155 (2002)
Article MATH Google Scholar
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)
Article Google Scholar
Stonebraker, M., Becla, J., DeWitt, D.J., Lim, K.T., Maier, D., Ratzesberger, O., Zdonik, S.B.: Requirements for science data bases and SciDB. In: Proc. CIDR Conference (2009)
Stonebraker, M., Brown, P., Zhang, D., Becla, J.: SciDB: a database management system for applications with complex analytics. Comput. Sci. Eng. 15(3), 54–62 (2013)
Article Google Scholar
Stonebraker, M., Madden, S., Abadi, D. J., Harizopoulos, S., Hachem, N., Helland, P.: The end of an architectural era: (it’s time for a complete rewrite). In: VLDB, pp. 1150–1160 (2007)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: HotCloud USENIX Workshop (2010)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proc. ACM SIGMOD Conference, pp. 103–114 (1996)
Zhang, Y., Ordonez, C., Cabrera, W.: Big data analytics integrating a parallel columnar DBMS and the R language. In: Proc. of IEEE CCGrid Conference (2016)
Zhang, Y., Ordonez, C., Johnsson, L.: A cloud system for machine learning exploiting a parallel array DBMS. In: Proc. DEXA Workshops (BDMICS), pp. 22–26 (2017)

Download references

Acknowledgements

This work concludes a long-time project, during which the first author visited MIT from 2013 to 2016. The first author thanks the guidance from Michael Stonebraker to move away from relational DBMSs to compute machine learning models in a scalable manner and to understand SciDB storage and processing mechanisms for large matrices.

Author information

Authors and Affiliations

Department of Computer Science, University of Houston, Houston, TX, 77204, USA
Carlos Ordonez, Yiqun Zhang & S. Lennart Johnsson

Authors

Carlos Ordonez
View author publications
You can also search for this author in PubMed Google Scholar
Yiqun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
S. Lennart Johnsson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Ordonez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ordonez, C., Zhang, Y. & Johnsson, S.L. Scalable machine learning computing a data summarization matrix with a parallel array DBMS. Distrib Parallel Databases 37, 329–350 (2019). https://doi.org/10.1007/s10619-018-7229-1

Download citation

Published: 08 June 2018
Issue Date: 15 September 2019
DOI: https://doi.org/10.1007/s10619-018-7229-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big data analytics: a survey

Big data preprocessing: methods and prospects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big data analytics: a survey

Big data preprocessing: methods and prospects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation