PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system

Kim, Jun-Sung; Whang, Kyu-Young; Kwon, Hyuk-Yoon; Song, Il-Yeol

doi:10.1007/s11280-014-0312-2

PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system

Published: 11 December 2014

Volume 19, pages 299–322, (2016)
Cite this article

World Wide Web Aims and scope Submit manuscript

Jun-Sung Kim¹,
Kyu-Young Whang¹,
Hyuk-Yoon Kwon¹ &
…
Il-Yeol Song²

449 Accesses
4 Citations
Explore all metrics

Abstract

There has been a lot of research on MapReduce for big data analytics. This new class of systems sacrifices DBMS functionality such as query languages, schemas, or indexes in order to maximize scalability and parallelism. However, as high functionality of the DBMS is considered important for big data analytics as well, there have been a lot of efforts to support DBMS functionality in MapReduce. HadoopDB is the only work that directly utilizes the DBMS for big data analytics in the MapReduce framework, taking advantage of both the DBMS and MapReduce. However, HadoopDB does not support sharability for the entire data since it stores the data into multiple nodes in a shared-nothing manner—i.e., it partitions a job into multiple tasks where each task is assigned to a fragment of data. Due to this limitation, HadoopDB cannot effectively process queries that require internode communication. That is, HadoopDB needs to re-load the entire data to process some queries (e.g., 2-way joins) or cannot support some complex queries (e.g., 3-way joins). In this paper, we propose a new notion of the DFS-integrated DBMS where a DBMS is tightly integrated with the distributed file system (DFS). By using the DFS-integrated DBMS, we can obtain sharability of the entire data. That is, a DBMS process in the system can access any data since multiple DBMSs are run on an integrated storage system in the DFS. To process big data analytics in parallel, our approach use the MapReduce framework on top of a DFS-integrated DBMS. We call this framework PARADISE. In PARADISE, we employ a job splitting method that logically splits a job based on the predicate in the integrated storage system. This contrasts with physical splitting in HadoopDB. We also propose the notion of locality mapping for further optimization of logical splitting. We show that PARADISE effectively overcomes the drawbacks of HadoopDB by identifying the following strengths. (1) It has a significantly faster (by up to 6.41 times) amortized query processing performance since it obviates the need to re-load data required in HadoopDB. (2) It supports query types more complex than the ones supported by HadoopDB.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

NoSQL: Future of BigData Analytics Characteristics and Comparison with RDBMS

The big data system, components, tools, and technologies: a survey

Article 18 September 2018

On data lake architectures and metadata management

Article 26 June 2020

References

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Rasin, A., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, In Proceedings of 35th Int’l Conf. on Very Large Data Bases (VLDB), pp. 922–933, Lyon, France (2009)
Blanas, S., Patel, J., Ercegovac, V., Rao, J., Shekita, E., Tian, Y.: A Comparison of Join Algorithms for Log Processing in MapReduce,” In Proc. 2010 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 975–986, Indianapolis, Indiana (2010)
Brantner, M., Florescu, D., Graf, D., Kossmann, D., Kraska, T.: Building a database on S3,” In Proc. 2008 A C M Int’l Conf. on Management of Data (SIGMOD) pp. 251–264, Vancouver, Canada (2008)
Beyer, M., Feinberg, D., Adrian, M., Edjlali, R.: Magic Quadrant for Data Warehouse Database Management Systems, Gartner Reports (2012)
Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets In Proc. 34th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1265–1276 Auckland, New Zealand (2008)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: BigTable: A Distributed Storage System for Structured Data, In Proceedings of 6th Symposium on Operating Systems Design and Implementation (OSDI), pp. 205–218, Seattle, Washington (2006)
Chattopadhyay, B., et al.: Tenzing – A SQL Implementation On The MapReduce Framework, In Proceedings of 37th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1318–1327, Seattle, Washington, Aug.–Sept. (2011)
Cooper, B., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s Hosted Data Serving Platform, In Proceedings of 34th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1277–1288, Auckland, New Zealand (2008)
Dean, J. , Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters, In Proceedings of 4th Symposium on Operating Systems Design and Implementation (OSDI), pp. 137–150, San Francisco, California (2004)
DeWitt, D., Gray, J.: Parallel Database Systems: The Future of High-Performance Database Systems. Commun. ACM 35(6), 85–98 (1992)
Article Google Scholar
The Digital Universe. http://www.emc.com/leadership/programs/digital-universe.htm
Dittrich, J., Quiane-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop ++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing),” In Proc. 36th Int’l Conf. on Very Large Data Bases (VLDB), pp. 515–529, Singapore, Sept. (2010)
Dittrich, J., Quiane-Ruiz, J., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only Aggressive Elephants Are Fast Elephants, In Proceeidngs 38th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1591–1692, Istanbul, Turkey (2012)
Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions, In Proceedings 35th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1402–1413, Lyon, France (2009)
Gantz, J., Reinsel, D.: Extracting Value from Chaos, IDC iView (2011)
Ghemawat, S., Gobioff, H., Leung, S.: The Google File System, In Proceedings 19th ACM Symposium on Operating Systems Principles(SOSP), pp. 29–43, BoltonLanding, New York (2003)
Hadoop, MapReduce. http://hadoop.apache.org
Hadoop, Project. http://hadoop.apache.org
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2006)
HDFS. http://hadoop.apache.org
Herdotou, H., Babu, S.: Profiling, Whatif Analysis, and Costbased Optimization of MapReduce Programs, In Proceedings 37th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1111–1122, Seattle, Washington (2011)
Jahani, E., Cafarella, M., Re, C.: Automatic Optimization for MapReduce Programs, In Proceedings 37th Int’l Conf. on Very Large Data Bases (VLDB), pp. 385–396, Seattle, Washington (2011)
Kim, J., Whang, K., Kwon, H., Song, I.: Odysseus/DFS: Integration of DBMS and the Distributed File System for Transaction Processing on Big Data, CoRR Technical Report (CS.DB/arXiv:1406.0435) (2014)
Lymna, P., Varian, H.: How Much Information?, Project Report, School of Information Management and Systems, University California at Berkeley (2003). http://www.sims.berkeley.edu/research/projects/how-much-info-2003
Morgan, T.: Can network architectures break the speed limit?, Enterprise Tech. (2011). http://www.theregister.co.uk/2011/10/10/network_architecture
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: ”Pig Latin: A Not-So-Foreign Language for Data Processing,” In Proc. 2008 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 1099–1110, Vancouver, Canada (2008)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of Approaches to Large-Scale Data Analysis, In Proceedings 2009 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 165–178, Providence, Rhode Island (2009)
Shute, J., et al.: F1: A Distributed SQL Database That Scales, In Proceedings of the 39th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1068–1079, Riva del Garda, Italy (2013)
Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and Parallel DBMSs:Friends or Foes?. Commun. ACM 53, 64–71 (2010)
Article Google Scholar
Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - A Warehousing Solution Over a Map-Reduce Framework, In Proceedings 35th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1626–1629, Lyon, France (2009)
Whang, K., Lee, M., Lee, J., Kim, M., Han, W.: Odysseus: a High-Performance ORDBMS Tightly-Coupled with IR Features, In Proceedings 21st IEEE Int’l Conf. on Data Engineering (ICDE), pp. 1104–1105, Tokyo, Japan. This paper received the Best Demonstration Award (2005)
Whang, K., Lee, J., Kim, M., Lee, M., Lee, K.: Odysseus: a High-Performance ORDBMS Tightly-Coupled with Spatial Database Features, In Proceedings 23rd IEEE Int’l Conf. on Data Engineering (ICDE), pp. 1493–1494, Istanbul, Turkey (2007)
Whang, K., Yun, T., Yeo, Y., Song, I., Kwon, H., Kim, I.: ODYS: An Approach to Building a Massively-Parallel Search Engine Using a DB-IR Tightly-Integrated Parallel DBMS for Higher-Level Functionality,” In Proceedings 2013 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 313–324, New York, New York (2013)
Whang, K., Lee, J., Lee, M., Han, W., Kim, M., Kim, J.: DB-IR integration using tight-coupling in the Odysseus DBMS, World Wide Web (2013). doi:10.1007/s11280-013-0264-y
Woligroski, D.: Gigabit Ethernet: Dude, Where’s My Bandwidth?, Bestofmedia Group (2009). http://www.tomshardware.com/reviews/gigabit-ethernet-bandwidth,2321.html

Download references

Author information

Authors and Affiliations

Department of Computer Science, KAIST, Daejeon, Korea
Jun-Sung Kim, Kyu-Young Whang & Hyuk-Yoon Kwon
College of Computing & Informatics, Drexel University, Philadelphia, USA
Il-Yeol Song

Authors

Jun-Sung Kim
View author publications
You can also search for this author in PubMed Google Scholar
Kyu-Young Whang
View author publications
You can also search for this author in PubMed Google Scholar
Hyuk-Yoon Kwon
View author publications
You can also search for this author in PubMed Google Scholar
Il-Yeol Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kyu-Young Whang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, JS., Whang, KY., Kwon, HY. et al. PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system. World Wide Web 19, 299–322 (2016). https://doi.org/10.1007/s11280-014-0312-2

Download citation

Received: 10 June 2014
Revised: 07 October 2014
Accepted: 18 November 2014
Published: 11 December 2014
Issue Date: May 2016
DOI: https://doi.org/10.1007/s11280-014-0312-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system

Abstract

Access this article

Similar content being viewed by others

NoSQL: Future of BigData Analytics Characteristics and Comparison with RDBMS

The big data system, components, tools, and technologies: a survey

On data lake architectures and metadata management

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system

Abstract

Access this article

Similar content being viewed by others

NoSQL: Future of BigData Analytics Characteristics and Comparison with RDBMS

The big data system, components, tools, and technologies: a survey

On data lake architectures and metadata management

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation