skip to main content
10.1145/3209889.3209894acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

End-to-End Machine Learning with Apache AsterixDB

Published: 15 June 2018 Publication History

Abstract

Recent developments in machine learning and data science provide a foundation for extracting underlying information from Big Data. Unfortunately, current platforms and tools often require data scientists to glue together and maintain custom-built platforms consisting of multiple Big Data component technologies. In this paper, we explain how Apache AsterixDB, an open source Big Data Management System, can help to reduce the burden involved in using machine learning algorithms in Big Data analytics. In particular, we describe how AsterixDB's built-in support for user-defined functions (UDFs), the availability of UDFs in data ingestion pipelines and queries, and the provision of machine learning platform and notebook inter-operation capabilities can together enable data analysts to more easily create and manage end-to-end analytical dataflows.

References

[1]
Appendix b - the WEKA workbench. In I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, editors, Data Mining: Practical machine learning tools and techniques, pages 553--571. Morgan Kaufmann, fourth edition edition, 2017.
[2]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
[3]
W. Y. Alkowaileet, S. Alsubaiee, M. J. Carey, T. Westmann, and Y. Bu. Large-scale complex analytics on semi-structured datasets using AsterixDB and Spark. Proceedings of the VLDB Endowment, 9(13):1585--1588, 2016.
[4]
S. Alsubaiee, Y. Altowim, H. Altwaijry, A. Behm, V. Borkar, Y. Bu, M. Carey, I. Cetindil, M. Cheelangi, K. Faraaz, et al. AsterixDB: A scalable, open source BDMS. Proceedings of the VLDB Endowment, 7(14):1905--1916, 2014.
[5]
Apache Zeppelin. http://zeppelin.apache.org, 2013.
[6]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993--1022, 2003.
[7]
D. Donoho. 50 years of Data Science. In Princeton NJ, Tukey Centennial Workshop, 2015.
[8]
R. Grover and M. J. Carey. Data ingestion in AsterixDB. In Proceedings of the 18th International Conference on Extending Database Technology (EDBT 2015), pages 605--616, 2015.
[9]
J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al. The MADlib analytics library: or MAD skills, the SQL. Proceedings of the VLDB Endowment, 5(12):1700--1711, 2012.
[10]
Jep. Java Embedded Python. https://github.com/ninia/jep, 2013.
[11]
J. Jia, C. Li, X. Zhang, C. Li, M. J. Carey, et al. Towards interactive analytics and visualization on one billion tweets. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, page 85. ACM, 2016.
[12]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
[13]
D. Kotzias, M. Denil, N. De Freitas, and P. Smyth. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 597--606. ACM, 2015.
[14]
C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55--60, 2014.
[15]
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine Learning in Apache Spark. Journal of Machine Learning Research, 17(1):1235--1241, 2016.
[16]
S. I. F. G. N. Nes and S. M. S. M. M. Kersten. Monetdb: Two decades of research in column-oriented database architectures. Data Engineering, 40, 2012.
[17]
K. W. Ong, Y. Papakonstantinou, and R. Vernoux. The SQL++ query language: Configurable, unifying and semi-structured. arXiv preprint arXiv:1405.3631, 2014.
[18]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.
[19]
Redis. https://redis.io, 2009.
[20]
A. Rheinländer, U. Leser, and G. Graefe. Optimization of complex dataflows with user-defined functions. ACM Computing Surveys (CSUR), 50(3):38, 2017.
[21]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009.
[22]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 10(10-10):95, 2010.

Cited By

View all
  • (2024)Higher-Order SQL Lambda Functions2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00450(5622-5628)Online publication date: 13-May-2024
  • (2022)DCA-IoMT: Knowledge-Graph-Embedding-Enhanced Deep Collaborative Alert Recommendation Against COVID-19IEEE Transactions on Industrial Informatics10.1109/TII.2022.315971018:12(8924-8935)Online publication date: Dec-2022
  • (2021)Managing ML pipelinesProceedings of the VLDB Endowment10.14778/3476311.347640214:12(3178-3181)Online publication date: 1-Jul-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DEEM'18: Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning
June 2018
63 pages
ISBN:9781450358286
DOI:10.1145/3209889
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2018

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGMOD/PODS '18
Sponsor:

Acceptance Rates

DEEM'18 Paper Acceptance Rate 10 of 16 submissions, 63%;
Overall Acceptance Rate 44 of 67 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Higher-Order SQL Lambda Functions2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00450(5622-5628)Online publication date: 13-May-2024
  • (2022)DCA-IoMT: Knowledge-Graph-Embedding-Enhanced Deep Collaborative Alert Recommendation Against COVID-19IEEE Transactions on Industrial Informatics10.1109/TII.2022.315971018:12(8924-8935)Online publication date: Dec-2022
  • (2021)Managing ML pipelinesProceedings of the VLDB Endowment10.14778/3476311.347640214:12(3178-3181)Online publication date: 1-Jul-2021
  • (2021)An authorization model for query execution in the cloudThe VLDB Journal10.1007/s00778-021-00709-x31:3(555-579)Online publication date: 6-Nov-2021
  • (2021)Distributed Query Evaluation over Encrypted DataData and Applications Security and Privacy XXXV10.1007/978-3-030-81242-3_6(96-114)Online publication date: 14-Jul-2021
  • (2020)Bridging BAD Islands: Declarative Data Sharing at Scale2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378342(2002-2011)Online publication date: 10-Dec-2020
  • (2020)Automating the expansion of a knowledge graphExpert Systems with Applications: An International Journal10.1016/j.eswa.2019.112965141:COnline publication date: 1-Mar-2020
  • (2020)BAD to the bone: Big Active Data at its coreThe VLDB Journal10.1007/s00778-020-00616-7Online publication date: 23-May-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media