skip to main content
10.1145/2882903.2903744acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Building the Enterprise Fabric for Big Data with Vertica and Spark Integration

Published: 14 June 2016 Publication History

Abstract

Enterprise customers increasingly require greater flexibility in the way they access and process their Big Data while at the same time they continue to request advanced analytics and access to diverse data sources. Yet customers also still require the robustness of enterprise class analytics for their mission-critical data. In this paper, we present our initial efforts toward a solution that satisfies the above requirements by integrating the HPE Vertica enterprise database with Apache Spark's open source big data computation engine. In particular, it enables fast, reliable transferring of data between Vertica and Spark; and deploying Machine Learning models created by Spark into Vertica for predictive analytics on Vertica data. This integration provides a fabric on which our customers get the best of both worlds: it extends Vertica's extensive SQL analytics capabilities with Spark's machine learning library (MLlib), giving Vertica users access to a wide range of ML functions; it also enables customers to leverage Spark as an advanced ETL engine for all data that require the guarantees offered by Vertica.

References

[1]
Amazon Redshift. https://aws.amazon.com/redshift/.
[2]
Amazon Simple Storage Service. https://aws.amazon.com/s3/.
[3]
Apache Avro data serialization.
[4]
DataStax Cassandra Connector. https://github.com/datastax/spark-cassandra-connector.
[5]
HPE Vertica Connector for Apache Spark. https://saas.hpe.com/marketplace/big-data/hpe-vertica-connector-apache-spark.
[6]
JavaPMML API. https://github.com/jpmml.
[7]
PMML 4.1 general structure. http://dmg.org/pmml/v4-1/GeneralStructure.html.
[8]
Redshift data source for Spark. https://github.com/databricks/spark-redshift.
[9]
Spark MLlib. http://spark.apache.org/mllib/.
[10]
Spark PMML model export. https://spark.apache.org/docs/latest/mllib-pmml-model-export.html.
[11]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.
[12]
A. Lakshman and P. Malik. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2), Apr. 2010.
[13]
A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, and C. Bear. The Vertica analytic database: C-store 7 years later. In VLDB, volume 5, 2012.
[14]
S. Prasad, A. Fard, V. Gupta, J. Martinez, J. LeFevre, V. Xu, M. Hsu, and I. Roy. Large-scale predictive analytics in Vertica: Fast data transfer, distributed model creation, and in-database prediction. In SIGMOD, 2015.
[15]
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: A Column-oriented DBMS. In VLDB, 2005.
[16]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.
[17]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing, 2010.

Cited By

View all
  • (2023)Application of Financial Big Data Analysis Method Based on Collaborative Filtering Algorithm in Supply Chain EnterprisesInternational Journal of Cooperative Information Systems10.1142/S021884302350022333:04Online publication date: 27-Sep-2023
  • (2021)HyperspaceProceedings of the VLDB Endowment10.14778/3476311.347638214:12(3043-3055)Online publication date: 28-Oct-2021
  • (2020)Vertica-ML: Distributed Machine Learning in Vertica DatabaseProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3386137(755-768)Online publication date: 11-Jun-2020
  • Show More Cited By

Index Terms

  1. Building the Enterprise Fabric for Big Data with Vertica and Spark Integration

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
    June 2016
    2300 pages
    ISBN:9781450335317
    DOI:10.1145/2882903
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. PMML
    2. analytics
    3. big data
    4. connector
    5. database
    6. spark
    7. vertica

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'16
    Sponsor:
    SIGMOD/PODS'16: International Conference on Management of Data
    June 26 - July 1, 2016
    California, San Francisco, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Application of Financial Big Data Analysis Method Based on Collaborative Filtering Algorithm in Supply Chain EnterprisesInternational Journal of Cooperative Information Systems10.1142/S021884302350022333:04Online publication date: 27-Sep-2023
    • (2021)HyperspaceProceedings of the VLDB Endowment10.14778/3476311.347638214:12(3043-3055)Online publication date: 28-Oct-2021
    • (2020)Vertica-ML: Distributed Machine Learning in Vertica DatabaseProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3386137(755-768)Online publication date: 11-Jun-2020
    • (2020)Real-Time Device Reach Forecasting Using HLL and MinHash Data Sketches2020 7th International Conference on Soft Computing & Machine Intelligence (ISCMI)10.1109/ISCMI51676.2020.9311573(153-157)Online publication date: 14-Nov-2020
    • (2017)Analysis of NUMA effects in modern multicore systems for the design of high-performance data transfer applicationsFuture Generation Computer Systems10.1016/j.future.2017.04.00174:C(41-50)Online publication date: 1-Sep-2017

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media