Abstract
Traditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data (e.g., genomics), databases are entirely discarded. External tables, on the other hand, provide instant SQL querying over raw files. Their performance across a query workload is limited though by the speed of repeated full scans, tokenizing, and parsing of the entire file.
In this article, we propose SCANRAW, a novel database meta-operator for in-situ processing over raw files that integrates data loading and external tables seamlessly, while preserving their advantages: optimal performance across a query workload and zero time-to-query. We decompose loading and external table processing into atomic stages in order to identify common functionality. We analyze alternative implementations and discuss possible optimizations for each stage. Our major contribution is a parallel superscalar pipeline implementation that allows SCANRAW to take advantage of the current many- and multicore processors by overlapping the execution of independent stages. Moreover, SCANRAW overlaps query processing with loading by speculatively using the additional I/O bandwidth arising during the conversion process for storing data into the database, such that subsequent queries execute faster. As a result, SCANRAW makes intelligent use of the available system resources—CPU cycles and I/O bandwidth—by switching dynamically between tasks to ensure that optimal performance is achieved. We implement SCANRAW in a state-of-the-art database system and evaluate its performance across a variety of synthetic and real-world datasets. Our results show that SCANRAW with speculative loading achieves the best-possible performance for a query sequence at any point in the processing. Moreover, SCANRAW maximizes resource utilization for the entire workload execution while speculatively loading data and without interfering with normal query processing.
- D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, and S. Madden. 2013. The design and implementation of modern column-oriented database systems. Found. Trends Datab. 5, 3 (2013), 197--280. DOI:http://dx.doi.org/10.1561/1900000024 Google ScholarDigital Library
- A. Abouzied, D. Abadi, and A. Silberschatz. 2013. Invisible loading: Access-driven data transfer from raw files into database systems. In Proceedings of the EDBT/ICDT Extended Database Technology Conference. 1--10. DOI:http://dx.doi.org/10.1145/2452376.2452377 Google ScholarDigital Library
- A. Ailamaki, V. Kantere, and D. Dash. 2010. Managing scientific data. Commun. ACM 53, 6 (2010), 68--78. DOI:http://dx.doi.org/10.1145/1743546.1743568 Google ScholarDigital Library
- I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. 2012. NoDB: Efficient query execution on raw data files. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 241--252. DOI:http://dx.doi.org/10.1145/2213836.2213864 Google ScholarDigital Library
- N. Alur, C. Takahashi, S. Toratani, and D. Vasconcelos. 2008. IBM InfoSphere DataStage Data Flow and Job Design. IBM Redbooks. Google ScholarDigital Library
- S. Arumugam, A. Dobra, C. Jermaine, N. Pansare, and L. Perez. 2010. The DataPath system: A data-centric analytic processing engine for large data warehouses. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 519--530. DOI:http://dx.doi.org/10.1145/1807167.1807224 Google ScholarDigital Library
- R. Avnur and J. Hellerstein. 2000. Eddies: Continuously adaptive query processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 261--272. DOI:http://dx.doi.org/10.1145/342009.335420 Google ScholarDigital Library
- D. Barnett, E. Garrison, A. Quinlan, M. Stromberg, and G. Marth. 2011. BamTools: A C++ API and toolkit for analyzing and managing BAM files. Bioinformatics 27, 12 (2011), 1691--1692. DOI:http://dx.doi.org/10.1093/bioinformatics/btr174 Google ScholarDigital Library
- S. Blanas, K. Wu, S. Byna, B. Dong, and A. Shoshani. 2014. Parallel data analysis directly on scientific file formats. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 385--396. DOI:http://dx.doi.org/10.1145/2588555.2612185 Google ScholarDigital Library
- R. D. Blumofe and C. E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (1999), 720--748. DOI:http://dx.doi.org/10.1145/324133.324234 Google ScholarDigital Library
- Y. Cheng, C. Qin, and F. Rusu. 2012. GLADE: Big data analytics made easy. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 697--700. DOI:http://dx.doi.org/10.1145/2213836.2213936 Google ScholarDigital Library
- Y. Cheng and F. Rusu. 2014a. Formal representation of the SS-DB benchmark and experimental evaluation in EXTASCID. Distrib. Parallel Datab. 33, 3 (2014), 277--317. DOI:http://dx.doi.org/10.1007/s10619-014-7149-7 Google ScholarDigital Library
- Y. Cheng and F. Rusu. 2014b. Parallel in-situ data processing with speculative loading. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1287--1298. DOI:http://dx.doi.org/10.1145/2588555.2593673 Google ScholarDigital Library
- J. Dean and S. Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. DOI:http://dx.doi.org/10.1145/1327452.1327492 Google ScholarDigital Library
- D. J. DeWitt and J. Gray. 1991. Parallel database systems: The future of database processing or a passing fad? SIGMOD Rec. 19, 4 (1991), 104--112. DOI:http://dx.doi.org/10.1145/122058.122071 Google ScholarDigital Library
- J. Gray, D. T. Liu, M. Nieto-Santisteban, A. Szalay, D. J. DeWitt, and G. Heber. 2005. Scientific data management in the coming decade. SIGMOD Rec. 34, 4 (2005), 34--41. DOI:http://dx.doi.org/10.1145/1107499.1107503 Google ScholarDigital Library
- S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. 2011a. Here are my data files. Here are my queries. Where are my results? In Proceedings of the CIDR Conference on Innovative Database Research. 57--68.Google Scholar
- S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S. Mullender, and M. L. Kersten. 2012. MonetDB: Two decades of research in column-oriented database architectures. IEEE Data Eng. Bull. 35, 1 (2012), 40--45.Google Scholar
- S. Idreos, S. Manegold, H. Kuno, and G. Graefe. 2011b. Merging what's cracked, cracking what's merged: Adaptive indexing in main-memory column-stores. PVLDB 4, 9 (2011), 586--597. Google ScholarDigital Library
- M. Ivanova, M. L. Kersten, and S. Manegold. 2012. Data vaults: A symbiosis between database technology and scientific file repositories. In Proceedings of the SSDBM International Conference on Scientific and Statistical Database Management. 485--494. DOI:http://dx.doi.org/10.1007/978-3-642-31235-9_32 Google ScholarDigital Library
- M. Karpathiotakis, I. Alagiannis, T. Heinis, M. Branco, and A. Ailamaki. 2015. Just-in-time data virtualization: Lightweight data management with ViDa. In Proceedings of the CIDR Conference on Innovative Database Research.Google Scholar
- M. Karpathiotakis, M. Branco, I. Alagiannis, and A. Ailamaki. 2014. Adaptive query processing on RAW data. PVLDB 7, 12 (2014), 1119--1130. Google ScholarDigital Library
- M. Kersten, S. Idreos, S. Manegold, and E. Liarou. 2011. The researcher's guide to the data deluge: Querying a scientific database in just a few seconds. PVLDB 4, 12 (2011), 1474--1477.Google Scholar
- M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, I. Joshi, L. Kuff, D. Kumar, A. Leblang, N. Li, I. Pandis, H. Robinson, D. Rorke, S. Rus, J. Russell, D. Tsirogiannis, S. Wanderman-Milne, and M. Yoder. 2015. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the CIDR Conference on Innovative Database Research.Google Scholar
- H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25, 16 (2009), 2078--2079. DOI:http://dx.doi.org/10.1093/bioinformatics/btp352 Google ScholarDigital Library
- K. Lorincz, K. Redwine, and J. Tov. 2003. Grep versus FlatSQL versus MySQL: Queries using UNIX Tools vs. a DBMS. http://www.eecs.harvard.edu/∼konrad/projects/flatsqlmysql/final_paper.pdf. (Last accessed July 2014.)Google Scholar
- T. Mühlbauer, W. Rödiger, R. Seilbeck, A. Reiser, A. Kemper, and T. Neumann. 2013. Instant loading for main memory databases. PVLDB 6, 14 (2013), 1702--1713. Google ScholarDigital Library
- D. A. Patterson, J. L. Hennessy, and D. Goldberg. 1996. Computer Architecture: A Quantitative Approach. Morgan Kaufmann. Google ScholarDigital Library
- V. Raman, G. Swart, L. Qiao, F. Reiss, V. Dialani, D. Kossmann, I. Narang, and R. Sidle. 2008. Constant-time query processing. In Proceedings of the IEEE ICDE International Conference on Data Engineering. 60--69. DOI:http://dx.doi.org/10.1109/ICDE.2008.4497414 Google ScholarDigital Library
- D. Sanchez, D. Lo, R. M. Yoo, J. Sugerman, and C. Kozyrakis. 2011. Dynamic fine-grain scheduling of pipeline parallelism. In Proceedings of the PACT Conference on Parallel Architectures and Compilation Techniques. 22--32. DOI:http://dx.doi.org/10.1109/PACT.2011.9 Google ScholarDigital Library
- M. Stonebraker, J. Becla, D. J. DeWitt, K. T. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. 2009. Requirements for science data bases and SciDB. In Proceedings of the CIDR Conference on Innovative Database Research.Google Scholar
- A. Witkowski, M. Colgan, A. Brumm, T. Cruanes, and H. Baer. 2011. Performant and scalable data loading with Oracle database 11g. http://www.oracle.com/technetwork/testcontent/twpdwbestpractices-for-loading-11g-404400.pdf. (Last accessed July 2014.)Google Scholar
Index Terms
- SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading
Recommendations
Parallel in-situ data processing with speculative loading
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataTraditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data, e.g., genomics, databases are ...
Similarity Joins
Similarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several ...
GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example
PACMMODData Scientists deal with a wide variety of file data formats and data representations. Probably the most difficult to handle are custom data formats that liberally define their own particular flat or nested structure with multiple custom delimiters, ...
Comments