research-article

SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading

Authors:
Yu Cheng

University of California, Merced, CA

University of California, Merced, CA
View Profile

,
Florin Rusu

University of California, Merced, CA

University of California, Merced, CA
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 40 Issue 3Article No.: 19pp 1–45https://doi.org/10.1145/2818181

Published:23 October 2015Publication History

ACM Transactions on Database Systems

Abstract

Traditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data (e.g., genomics), databases are entirely discarded. External tables, on the other hand, provide instant SQL querying over raw files. Their performance across a query workload is limited though by the speed of repeated full scans, tokenizing, and parsing of the entire file.

In this article, we propose SCANRAW, a novel database meta-operator for in-situ processing over raw files that integrates data loading and external tables seamlessly, while preserving their advantages: optimal performance across a query workload and zero time-to-query. We decompose loading and external table processing into atomic stages in order to identify common functionality. We analyze alternative implementations and discuss possible optimizations for each stage. Our major contribution is a parallel superscalar pipeline implementation that allows SCANRAW to take advantage of the current many- and multicore processors by overlapping the execution of independent stages. Moreover, SCANRAW overlaps query processing with loading by speculatively using the additional I/O bandwidth arising during the conversion process for storing data into the database, such that subsequent queries execute faster. As a result, SCANRAW makes intelligent use of the available system resources—CPU cycles and I/O bandwidth—by switching dynamically between tasks to ensure that optimal performance is achieved. We implement SCANRAW in a state-of-the-art database system and evaluate its performance across a variety of synthetic and real-world datasets. Our results show that SCANRAW with speculative loading achieves the best-possible performance for a query sequence at any point in the processing. Moreover, SCANRAW maximizes resource utilization for the entire workload execution while speculatively loading data and without interfering with normal query processing.

References

D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, and S. Madden. 2013. The design and implementation of modern column-oriented database systems. Found. Trends Datab. 5, 3 (2013), 197--280. DOI:http://dx.doi.org/10.1561/1900000024 Google ScholarDigital Library
A. Abouzied, D. Abadi, and A. Silberschatz. 2013. Invisible loading: Access-driven data transfer from raw files into database systems. In Proceedings of the EDBT/ICDT Extended Database Technology Conference. 1--10. DOI:http://dx.doi.org/10.1145/2452376.2452377 Google ScholarDigital Library
A. Ailamaki, V. Kantere, and D. Dash. 2010. Managing scientific data. Commun. ACM 53, 6 (2010), 68--78. DOI:http://dx.doi.org/10.1145/1743546.1743568 Google ScholarDigital Library
I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. 2012. NoDB: Efficient query execution on raw data files. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 241--252. DOI:http://dx.doi.org/10.1145/2213836.2213864 Google ScholarDigital Library
N. Alur, C. Takahashi, S. Toratani, and D. Vasconcelos. 2008. IBM InfoSphere DataStage Data Flow and Job Design. IBM Redbooks. Google ScholarDigital Library
S. Arumugam, A. Dobra, C. Jermaine, N. Pansare, and L. Perez. 2010. The DataPath system: A data-centric analytic processing engine for large data warehouses. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 519--530. DOI:http://dx.doi.org/10.1145/1807167.1807224 Google ScholarDigital Library
R. Avnur and J. Hellerstein. 2000. Eddies: Continuously adaptive query processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 261--272. DOI:http://dx.doi.org/10.1145/342009.335420 Google ScholarDigital Library
D. Barnett, E. Garrison, A. Quinlan, M. Stromberg, and G. Marth. 2011. BamTools: A C++ API and toolkit for analyzing and managing BAM files. Bioinformatics 27, 12 (2011), 1691--1692. DOI:http://dx.doi.org/10.1093/bioinformatics/btr174 Google ScholarDigital Library
S. Blanas, K. Wu, S. Byna, B. Dong, and A. Shoshani. 2014. Parallel data analysis directly on scientific file formats. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 385--396. DOI:http://dx.doi.org/10.1145/2588555.2612185 Google ScholarDigital Library
R. D. Blumofe and C. E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (1999), 720--748. DOI:http://dx.doi.org/10.1145/324133.324234 Google ScholarDigital Library
Y. Cheng, C. Qin, and F. Rusu. 2012. GLADE: Big data analytics made easy. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 697--700. DOI:http://dx.doi.org/10.1145/2213836.2213936 Google ScholarDigital Library
Y. Cheng and F. Rusu. 2014a. Formal representation of the SS-DB benchmark and experimental evaluation in EXTASCID. Distrib. Parallel Datab. 33, 3 (2014), 277--317. DOI:http://dx.doi.org/10.1007/s10619-014-7149-7 Google ScholarDigital Library
Y. Cheng and F. Rusu. 2014b. Parallel in-situ data processing with speculative loading. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1287--1298. DOI:http://dx.doi.org/10.1145/2588555.2593673 Google ScholarDigital Library
J. Dean and S. Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. DOI:http://dx.doi.org/10.1145/1327452.1327492 Google ScholarDigital Library
D. J. DeWitt and J. Gray. 1991. Parallel database systems: The future of database processing or a passing fad? SIGMOD Rec. 19, 4 (1991), 104--112. DOI:http://dx.doi.org/10.1145/122058.122071 Google ScholarDigital Library
J. Gray, D. T. Liu, M. Nieto-Santisteban, A. Szalay, D. J. DeWitt, and G. Heber. 2005. Scientific data management in the coming decade. SIGMOD Rec. 34, 4 (2005), 34--41. DOI:http://dx.doi.org/10.1145/1107499.1107503 Google ScholarDigital Library
S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. 2011a. Here are my data files. Here are my queries. Where are my results? In Proceedings of the CIDR Conference on Innovative Database Research. 57--68.Google Scholar
S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S. Mullender, and M. L. Kersten. 2012. MonetDB: Two decades of research in column-oriented database architectures. IEEE Data Eng. Bull. 35, 1 (2012), 40--45.Google Scholar
S. Idreos, S. Manegold, H. Kuno, and G. Graefe. 2011b. Merging what's cracked, cracking what's merged: Adaptive indexing in main-memory column-stores. PVLDB 4, 9 (2011), 586--597. Google ScholarDigital Library
M. Ivanova, M. L. Kersten, and S. Manegold. 2012. Data vaults: A symbiosis between database technology and scientific file repositories. In Proceedings of the SSDBM International Conference on Scientific and Statistical Database Management. 485--494. DOI:http://dx.doi.org/10.1007/978-3-642-31235-9_32 Google ScholarDigital Library
M. Karpathiotakis, I. Alagiannis, T. Heinis, M. Branco, and A. Ailamaki. 2015. Just-in-time data virtualization: Lightweight data management with ViDa. In Proceedings of the CIDR Conference on Innovative Database Research.Google Scholar
M. Karpathiotakis, M. Branco, I. Alagiannis, and A. Ailamaki. 2014. Adaptive query processing on RAW data. PVLDB 7, 12 (2014), 1119--1130. Google ScholarDigital Library
M. Kersten, S. Idreos, S. Manegold, and E. Liarou. 2011. The researcher's guide to the data deluge: Querying a scientific database in just a few seconds. PVLDB 4, 12 (2011), 1474--1477.Google Scholar
M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, I. Joshi, L. Kuff, D. Kumar, A. Leblang, N. Li, I. Pandis, H. Robinson, D. Rorke, S. Rus, J. Russell, D. Tsirogiannis, S. Wanderman-Milne, and M. Yoder. 2015. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the CIDR Conference on Innovative Database Research.Google Scholar
H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25, 16 (2009), 2078--2079. DOI:http://dx.doi.org/10.1093/bioinformatics/btp352 Google ScholarDigital Library
K. Lorincz, K. Redwine, and J. Tov. 2003. Grep versus FlatSQL versus MySQL: Queries using UNIX Tools vs. a DBMS. http://www.eecs.harvard.edu/&sim;konrad/projects/flatsqlmysql/final_paper.pdf. (Last accessed July 2014.)Google Scholar
T. Mühlbauer, W. Rödiger, R. Seilbeck, A. Reiser, A. Kemper, and T. Neumann. 2013. Instant loading for main memory databases. PVLDB 6, 14 (2013), 1702--1713. Google ScholarDigital Library
D. A. Patterson, J. L. Hennessy, and D. Goldberg. 1996. Computer Architecture: A Quantitative Approach. Morgan Kaufmann. Google ScholarDigital Library
V. Raman, G. Swart, L. Qiao, F. Reiss, V. Dialani, D. Kossmann, I. Narang, and R. Sidle. 2008. Constant-time query processing. In Proceedings of the IEEE ICDE International Conference on Data Engineering. 60--69. DOI:http://dx.doi.org/10.1109/ICDE.2008.4497414 Google ScholarDigital Library
D. Sanchez, D. Lo, R. M. Yoo, J. Sugerman, and C. Kozyrakis. 2011. Dynamic fine-grain scheduling of pipeline parallelism. In Proceedings of the PACT Conference on Parallel Architectures and Compilation Techniques. 22--32. DOI:http://dx.doi.org/10.1109/PACT.2011.9 Google ScholarDigital Library
M. Stonebraker, J. Becla, D. J. DeWitt, K. T. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. 2009. Requirements for science data bases and SciDB. In Proceedings of the CIDR Conference on Innovative Database Research.Google Scholar
A. Witkowski, M. Colgan, A. Brumm, T. Cruanes, and H. Baer. 2011. Performant and scalable data loading with Oracle database 11g. http://www.oracle.com/technetwork/testcontent/twpdwbestpractices-for-loading-11g-404400.pdf. (Last accessed July 2014.)Google Scholar

Index Terms

SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data access methods
    2. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Parallel in-situ data processing with speculative loading
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Traditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data, e.g., genomics, databases are ...
Read More
Similarity Joins

Similarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several ...
Read More
GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example
PACMMOD

Data Scientists deal with a wide variety of file data formats and data representations. Probably the most difficult to handle are custom data formats that liberally define their own particular flat or nested structure with multiple custom delimiters, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Database Systems Volume 40, Issue 3
October 2015
247 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2838914
Editor:
Christian S. Jensen
Aalborg University, Denmark
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 October 2015
- Accepted: 1 April 2015
- Revised: 1 February 2015
- Received: 1 July 2014
Published in tods Volume 40, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data access operator
access path
data loading
database operator
external table
system
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 277
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Parallel in-situ data processing with speculative loading

Similarity Joins

GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Parallel in-situ data processing with speculative loading

Similarity Joins

GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media