skip to main content
research-article

SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading

Published:23 October 2015Publication History
Skip Abstract Section

Abstract

Traditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data (e.g., genomics), databases are entirely discarded. External tables, on the other hand, provide instant SQL querying over raw files. Their performance across a query workload is limited though by the speed of repeated full scans, tokenizing, and parsing of the entire file.

In this article, we propose SCANRAW, a novel database meta-operator for in-situ processing over raw files that integrates data loading and external tables seamlessly, while preserving their advantages: optimal performance across a query workload and zero time-to-query. We decompose loading and external table processing into atomic stages in order to identify common functionality. We analyze alternative implementations and discuss possible optimizations for each stage. Our major contribution is a parallel superscalar pipeline implementation that allows SCANRAW to take advantage of the current many- and multicore processors by overlapping the execution of independent stages. Moreover, SCANRAW overlaps query processing with loading by speculatively using the additional I/O bandwidth arising during the conversion process for storing data into the database, such that subsequent queries execute faster. As a result, SCANRAW makes intelligent use of the available system resources—CPU cycles and I/O bandwidth—by switching dynamically between tasks to ensure that optimal performance is achieved. We implement SCANRAW in a state-of-the-art database system and evaluate its performance across a variety of synthetic and real-world datasets. Our results show that SCANRAW with speculative loading achieves the best-possible performance for a query sequence at any point in the processing. Moreover, SCANRAW maximizes resource utilization for the entire workload execution while speculatively loading data and without interfering with normal query processing.

References

  1. D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, and S. Madden. 2013. The design and implementation of modern column-oriented database systems. Found. Trends Datab. 5, 3 (2013), 197--280. DOI:http://dx.doi.org/10.1561/1900000024 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Abouzied, D. Abadi, and A. Silberschatz. 2013. Invisible loading: Access-driven data transfer from raw files into database systems. In Proceedings of the EDBT/ICDT Extended Database Technology Conference. 1--10. DOI:http://dx.doi.org/10.1145/2452376.2452377 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Ailamaki, V. Kantere, and D. Dash. 2010. Managing scientific data. Commun. ACM 53, 6 (2010), 68--78. DOI:http://dx.doi.org/10.1145/1743546.1743568 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. 2012. NoDB: Efficient query execution on raw data files. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 241--252. DOI:http://dx.doi.org/10.1145/2213836.2213864 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Alur, C. Takahashi, S. Toratani, and D. Vasconcelos. 2008. IBM InfoSphere DataStage Data Flow and Job Design. IBM Redbooks. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Arumugam, A. Dobra, C. Jermaine, N. Pansare, and L. Perez. 2010. The DataPath system: A data-centric analytic processing engine for large data warehouses. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 519--530. DOI:http://dx.doi.org/10.1145/1807167.1807224 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Avnur and J. Hellerstein. 2000. Eddies: Continuously adaptive query processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 261--272. DOI:http://dx.doi.org/10.1145/342009.335420 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Barnett, E. Garrison, A. Quinlan, M. Stromberg, and G. Marth. 2011. BamTools: A C++ API and toolkit for analyzing and managing BAM files. Bioinformatics 27, 12 (2011), 1691--1692. DOI:http://dx.doi.org/10.1093/bioinformatics/btr174 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Blanas, K. Wu, S. Byna, B. Dong, and A. Shoshani. 2014. Parallel data analysis directly on scientific file formats. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 385--396. DOI:http://dx.doi.org/10.1145/2588555.2612185 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. D. Blumofe and C. E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (1999), 720--748. DOI:http://dx.doi.org/10.1145/324133.324234 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Cheng, C. Qin, and F. Rusu. 2012. GLADE: Big data analytics made easy. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 697--700. DOI:http://dx.doi.org/10.1145/2213836.2213936 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Cheng and F. Rusu. 2014a. Formal representation of the SS-DB benchmark and experimental evaluation in EXTASCID. Distrib. Parallel Datab. 33, 3 (2014), 277--317. DOI:http://dx.doi.org/10.1007/s10619-014-7149-7 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Cheng and F. Rusu. 2014b. Parallel in-situ data processing with speculative loading. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1287--1298. DOI:http://dx.doi.org/10.1145/2588555.2593673 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Dean and S. Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. DOI:http://dx.doi.org/10.1145/1327452.1327492 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. J. DeWitt and J. Gray. 1991. Parallel database systems: The future of database processing or a passing fad? SIGMOD Rec. 19, 4 (1991), 104--112. DOI:http://dx.doi.org/10.1145/122058.122071 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Gray, D. T. Liu, M. Nieto-Santisteban, A. Szalay, D. J. DeWitt, and G. Heber. 2005. Scientific data management in the coming decade. SIGMOD Rec. 34, 4 (2005), 34--41. DOI:http://dx.doi.org/10.1145/1107499.1107503 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. 2011a. Here are my data files. Here are my queries. Where are my results? In Proceedings of the CIDR Conference on Innovative Database Research. 57--68.Google ScholarGoogle Scholar
  18. S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S. Mullender, and M. L. Kersten. 2012. MonetDB: Two decades of research in column-oriented database architectures. IEEE Data Eng. Bull. 35, 1 (2012), 40--45.Google ScholarGoogle Scholar
  19. S. Idreos, S. Manegold, H. Kuno, and G. Graefe. 2011b. Merging what's cracked, cracking what's merged: Adaptive indexing in main-memory column-stores. PVLDB 4, 9 (2011), 586--597. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Ivanova, M. L. Kersten, and S. Manegold. 2012. Data vaults: A symbiosis between database technology and scientific file repositories. In Proceedings of the SSDBM International Conference on Scientific and Statistical Database Management. 485--494. DOI:http://dx.doi.org/10.1007/978-3-642-31235-9_32 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Karpathiotakis, I. Alagiannis, T. Heinis, M. Branco, and A. Ailamaki. 2015. Just-in-time data virtualization: Lightweight data management with ViDa. In Proceedings of the CIDR Conference on Innovative Database Research.Google ScholarGoogle Scholar
  22. M. Karpathiotakis, M. Branco, I. Alagiannis, and A. Ailamaki. 2014. Adaptive query processing on RAW data. PVLDB 7, 12 (2014), 1119--1130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Kersten, S. Idreos, S. Manegold, and E. Liarou. 2011. The researcher's guide to the data deluge: Querying a scientific database in just a few seconds. PVLDB 4, 12 (2011), 1474--1477.Google ScholarGoogle Scholar
  24. M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, I. Joshi, L. Kuff, D. Kumar, A. Leblang, N. Li, I. Pandis, H. Robinson, D. Rorke, S. Rus, J. Russell, D. Tsirogiannis, S. Wanderman-Milne, and M. Yoder. 2015. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the CIDR Conference on Innovative Database Research.Google ScholarGoogle Scholar
  25. H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25, 16 (2009), 2078--2079. DOI:http://dx.doi.org/10.1093/bioinformatics/btp352 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. Lorincz, K. Redwine, and J. Tov. 2003. Grep versus FlatSQL versus MySQL: Queries using UNIX Tools vs. a DBMS. http://www.eecs.harvard.edu/∼konrad/projects/flatsqlmysql/final_paper.pdf. (Last accessed July 2014.)Google ScholarGoogle Scholar
  27. T. Mühlbauer, W. Rödiger, R. Seilbeck, A. Reiser, A. Kemper, and T. Neumann. 2013. Instant loading for main memory databases. PVLDB 6, 14 (2013), 1702--1713. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. A. Patterson, J. L. Hennessy, and D. Goldberg. 1996. Computer Architecture: A Quantitative Approach. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. V. Raman, G. Swart, L. Qiao, F. Reiss, V. Dialani, D. Kossmann, I. Narang, and R. Sidle. 2008. Constant-time query processing. In Proceedings of the IEEE ICDE International Conference on Data Engineering. 60--69. DOI:http://dx.doi.org/10.1109/ICDE.2008.4497414 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Sanchez, D. Lo, R. M. Yoo, J. Sugerman, and C. Kozyrakis. 2011. Dynamic fine-grain scheduling of pipeline parallelism. In Proceedings of the PACT Conference on Parallel Architectures and Compilation Techniques. 22--32. DOI:http://dx.doi.org/10.1109/PACT.2011.9 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Stonebraker, J. Becla, D. J. DeWitt, K. T. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. 2009. Requirements for science data bases and SciDB. In Proceedings of the CIDR Conference on Innovative Database Research.Google ScholarGoogle Scholar
  32. A. Witkowski, M. Colgan, A. Brumm, T. Cruanes, and H. Baer. 2011. Performant and scalable data loading with Oracle database 11g. http://www.oracle.com/technetwork/testcontent/twpdwbestpractices-for-loading-11g-404400.pdf. (Last accessed July 2014.)Google ScholarGoogle Scholar

Index Terms

  1. SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Database Systems
          ACM Transactions on Database Systems  Volume 40, Issue 3
          October 2015
          247 pages
          ISSN:0362-5915
          EISSN:1557-4644
          DOI:10.1145/2838914
          Issue’s Table of Contents

          Copyright © 2015 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 23 October 2015
          • Accepted: 1 April 2015
          • Revised: 1 February 2015
          • Received: 1 July 2014
          Published in tods Volume 40, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader