skip to main content
10.1145/2882903.2882957acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment

Published: 14 June 2016 Publication History

Editorial Notes

Computationally Replicable. The experimental results of this paper were replicated by a SIGMOD Review Committee and were found to support the central results reported in the paper. Details of the review process are found here

Abstract

We analyze the workload from a multi-year deployment of a database-as-a-service platform targeting scientists and data scientists with minimal database experience. Our hypothesis was that relatively minor changes to the way databases are delivered can increase their use in ad hoc analysis environments. The web-based SQLShare system emphasizes easy dataset-at-a-time ingest, relaxed schemas and schema inference, easy view creation and sharing, and full SQL support. We find that these features have helped attract workloads typically associated with scripts and files rather than relational databases: complex analytics, routine processing pipelines, data publishing, and collaborative analysis. Quantitatively, these workloads are characterized by shorter dataset "lifetimes", higher query complexity, and higher data complexity. We report on usage scenarios that suggest SQL is being used in place of scripts for one-off data analysis and ad hoc data sharing. The workload suggests that a new class of relational systems emphasizing short-term, ad hoc analytics over engineered schemas may improve uptake of database technology in data science contexts. Our contributions include a system design for delivering databases into these contexts, a description of a public research query workload dataset released to advance research in analytic data systems, and an initial analysis of the workload that provides evidence of new use cases under-supported in existing systems.

Supplementary Material

ReadMe (readme.txt)
Rights information
Query Workload Analysis Master (query-workload-analysis-master.zip)
Graphs, Plots, Results

References

[1]
Apache hadoop. https://hadoop.apache.org/. Accessed: 2014--10--14.
[2]
Big data techniques applied to media and computer graphics applications. https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf.
[3]
OpenRefine (formerly google refine). http://openrefine.org/. Accessed: 2014--10--14.
[4]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383--1394. ACM, 2015.
[5]
A. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. arXiv preprint arXiv:1409.0798, 2014.
[6]
J. Clark, S. DeRose, et al. Xml path language (xpath). W3C recommendation, 16, 1999.
[7]
S. Cohen-Boulakia and U. Leser. Search, adapt, and reuse: the future of scientific workflows. ACM SIGMOD Record, 40(2):6--16, 2011.
[8]
T. P. P. Council. TPC-H benchmark specification. http://www.tpc.org/tpch/, 2008.
[9]
E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program., 13(3):219--237, July 2005.
[10]
A. Doan and A. Y. Halevy. Semantic integration research in the database community: A brief survey. AI magazine, 26(1):83, 2005.
[11]
M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. ACM Sigmod Record, 34(4):27--33, 2005.
[12]
H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google fusion tables: web-centered data management and collaboration. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 1061--1066. ACM, 2010.
[13]
D. Halperin, V. Teixeira de Almeida, L. L. Choo, S. Chu, P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker, et al. Demonstration of the myria big data management service. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, Sigmod '14, pages 881--884. ACM, 7 2014.
[14]
B. Howe, G. Cole, E. Souroush, P. Koutris, A. Key, N. Khoussainova, and L. Battle. Database-as-a-service for long-tail science. In Scientific and Statistical Database Management, pages 480--489. Springer, 2011.
[15]
B. Howe, F. Ribalet, D. Halperin, S. Chitnis, and E. V. Armbrust. Sqlshare: Scientific workflow via relational view sharing. Computing in Science & Engineering, Special Issue on Science Data Management, 15(2), 2013.
[16]
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 3363--3372. ACM, 2011.
[17]
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. In IEEE Visual Analytics Science & Technology (VAST), 2012.
[18]
S. M. Kent. Sloan digital sky survey. In Science with Astronomical Near-Infrared Sky Surveys, pages 27--30. Springer, 1994.
[19]
N. Khoussainova, M. Balazinska, W. Gatterbauer, Y. Kwon, and D. Suciu. A case for a collaborative query management system. arXiv preprint arXiv:0909.1778, 2009.
[20]
N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. Snipsuggest: Context-aware autocompletion for sql. Proceedings of the VLDB Endowment, 4(1):22--33, 2010.
[21]
M. Kim, V. Sazawal, D. Notkin, and G. Murphy. An empirical study of code clone genealogies. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 187--196. ACM, 2005.
[22]
F. Li, T. Pan, and H. V. Jagadish. Schema-free sql. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1051--1062, New York, NY, USA, 2014. ACM.
[23]
B. Mozafari, E. Z. Y. Goh, and D. Y. Yoon. Cliffguard: A principled framework for finding robust database designs. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1167--1182. ACM, 2015.
[24]
E. Ogasawara, J. Dias, F. Porto, P. Valduriez, and M. Mattoso. An algebraic approach for data-centric scientific workflows. Proc. of VLDB Endowment, 4(12):1328--1339, 2011.
[25]
K. Ren, Y. Kwon, M. Balazinska, and B. Howe. Hadoop's adolescence: an analysis of hadoop usage in scientific workloads. Proceedings of the VLDB Endowment, 6(10):853--864, 2013.
[26]
M. Rosson and J. Carroll. Active programming strategies in reuse. In O. Nierstrasz, editor, ECOOP '93 -- Object-Oriented Programming, volume 707 of Lecture Notes in Computer Science, pages 4--20. Springer Berlin Heidelberg, 1993.
[27]
P. Roy, K. Ramamritham, S. Seshadri, P. Shenoy, and S. Sudarshan. Don't trash your intermediate results, cache'em. arXiv preprint cs/0003005, 2000.
[28]
V. Singh, J. Gray, A. Thakar, A. S. Szalay, J. Raddick, B. Boroski, S. Lebedeva, and B. Yanny. Skyserver traffic report-the first five years. arXiv preprint cs/0701173, 2007.
[29]
M. Stonebraker, J. Becla, D. J. DeWitt, K. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. Requirements for science data bases and scidb. In CIDR 2009, Fourth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4--7, 2009, Online Proceedings, 2009.
[30]
I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields. Workflows for e-Science: Scientific Workflows for Grids. Springer Publishing Company, Incorporated, 2014.
[31]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009.

Cited By

View all
  • (2024)Why TPC is Not Enough: An Analysis of the Amazon Redshift FleetProceedings of the VLDB Endowment10.14778/3681954.368203117:11(3694-3706)Online publication date: 1-Jul-2024
  • (2024)SeLeP: Learning Based Semantic Prefetching for Exploratory Database WorkloadsProceedings of the VLDB Endowment10.14778/3659437.365945817:8(2064-2076)Online publication date: 1-Apr-2024
  • (2024)SchemaPile: A Large Collection of Relational Database SchemasProceedings of the ACM on Management of Data10.1145/36549752:3(1-25)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. database management as a cloud service
  2. database management sytems
  3. relational databases

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS'16
Sponsor:
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)124
  • Downloads (Last 6 weeks)16
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Why TPC is Not Enough: An Analysis of the Amazon Redshift FleetProceedings of the VLDB Endowment10.14778/3681954.368203117:11(3694-3706)Online publication date: 1-Jul-2024
  • (2024)SeLeP: Learning Based Semantic Prefetching for Exploratory Database WorkloadsProceedings of the VLDB Endowment10.14778/3659437.365945817:8(2064-2076)Online publication date: 1-Apr-2024
  • (2024)SchemaPile: A Large Collection of Relational Database SchemasProceedings of the ACM on Management of Data10.1145/36549752:3(1-25)Online publication date: 30-May-2024
  • (2024)What Do We Mean When We Say “Insight”? A Formal Synthesis of Existing TheoryIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332669830:9(6075-6088)Online publication date: Sep-2024
  • (2024)Anonymization of Sensitive Data in the Diabetes Prediction using Private Bayesian Networks2024 5th International Conference on Image Processing and Capsule Networks (ICIPCN)10.1109/ICIPCN63822.2024.00058(310-316)Online publication date: 3-Jul-2024
  • (2024)Explaining cube measures through Intentional AnalyticsInformation Systems10.1016/j.is.2023.102338121(102338)Online publication date: Mar-2024
  • (2023)Reinforced approximate exploratory data analysisProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25929(7660-7669)Online publication date: 7-Feb-2023
  • (2023)Data Makes Better Data ScientistsProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3597465.3605228(1-3)Online publication date: 18-Jun-2023
  • (2023)Database Evolution, by Scientists, for Scientists: A Case Study2023 IEEE 19th International Conference on e-Science (e-Science)10.1109/e-Science58273.2023.10254872(1-10)Online publication date: 9-Oct-2023
  • (2023)TRANSQLATION: TRANsformer-based SQL RecommendATION2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386277(4703-4711)Online publication date: 15-Dec-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media