Skip to main content
Log in

PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

There has been a lot of research on MapReduce for big data analytics. This new class of systems sacrifices DBMS functionality such as query languages, schemas, or indexes in order to maximize scalability and parallelism. However, as high functionality of the DBMS is considered important for big data analytics as well, there have been a lot of efforts to support DBMS functionality in MapReduce. HadoopDB is the only work that directly utilizes the DBMS for big data analytics in the MapReduce framework, taking advantage of both the DBMS and MapReduce. However, HadoopDB does not support sharability for the entire data since it stores the data into multiple nodes in a shared-nothing manner—i.e., it partitions a job into multiple tasks where each task is assigned to a fragment of data. Due to this limitation, HadoopDB cannot effectively process queries that require internode communication. That is, HadoopDB needs to re-load the entire data to process some queries (e.g., 2-way joins) or cannot support some complex queries (e.g., 3-way joins). In this paper, we propose a new notion of the DFS-integrated DBMS where a DBMS is tightly integrated with the distributed file system (DFS). By using the DFS-integrated DBMS, we can obtain sharability of the entire data. That is, a DBMS process in the system can access any data since multiple DBMSs are run on an integrated storage system in the DFS. To process big data analytics in parallel, our approach use the MapReduce framework on top of a DFS-integrated DBMS. We call this framework PARADISE. In PARADISE, we employ a job splitting method that logically splits a job based on the predicate in the integrated storage system. This contrasts with physical splitting in HadoopDB. We also propose the notion of locality mapping for further optimization of logical splitting. We show that PARADISE effectively overcomes the drawbacks of HadoopDB by identifying the following strengths. (1) It has a significantly faster (by up to 6.41 times) amortized query processing performance since it obviates the need to re-load data required in HadoopDB. (2) It supports query types more complex than the ones supported by HadoopDB.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Rasin, A., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, In Proceedings of 35th Int’l Conf. on Very Large Data Bases (VLDB), pp. 922–933, Lyon, France (2009)

  2. Blanas, S., Patel, J., Ercegovac, V., Rao, J., Shekita, E., Tian, Y.: A Comparison of Join Algorithms for Log Processing in MapReduce,” In Proc. 2010 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 975–986, Indianapolis, Indiana (2010)

  3. Brantner, M., Florescu, D., Graf, D., Kossmann, D., Kraska, T.: Building a database on S3,” In Proc. 2008 A C M Int’l Conf. on Management of Data (SIGMOD) pp. 251–264, Vancouver, Canada (2008)

  4. Beyer, M., Feinberg, D., Adrian, M., Edjlali, R.: Magic Quadrant for Data Warehouse Database Management Systems, Gartner Reports (2012)

  5. Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets In Proc. 34th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1265–1276 Auckland, New Zealand (2008)

  6. Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: BigTable: A Distributed Storage System for Structured Data, In Proceedings of 6th Symposium on Operating Systems Design and Implementation (OSDI), pp. 205–218, Seattle, Washington (2006)

  7. Chattopadhyay, B., et al.: Tenzing – A SQL Implementation On The MapReduce Framework, In Proceedings of 37th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1318–1327, Seattle, Washington, Aug.–Sept. (2011)

  8. Cooper, B., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s Hosted Data Serving Platform, In Proceedings of 34th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1277–1288, Auckland, New Zealand (2008)

  9. Dean, J. , Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters, In Proceedings of 4th Symposium on Operating Systems Design and Implementation (OSDI), pp. 137–150, San Francisco, California (2004)

  10. DeWitt, D., Gray, J.: Parallel Database Systems: The Future of High-Performance Database Systems. Commun. ACM 35(6), 85–98 (1992)

    Article  Google Scholar 

  11. The Digital Universe. http://www.emc.com/leadership/programs/digital-universe.htm

  12. Dittrich, J., Quiane-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop ++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing),” In Proc. 36th Int’l Conf. on Very Large Data Bases (VLDB), pp. 515–529, Singapore, Sept. (2010)

  13. Dittrich, J., Quiane-Ruiz, J., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only Aggressive Elephants Are Fast Elephants, In Proceeidngs 38th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1591–1692, Istanbul, Turkey (2012)

  14. Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions, In Proceedings 35th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1402–1413, Lyon, France (2009)

  15. Gantz, J., Reinsel, D.: Extracting Value from Chaos, IDC iView (2011)

  16. Ghemawat, S., Gobioff, H., Leung, S.: The Google File System, In Proceedings 19th ACM Symposium on Operating Systems Principles(SOSP), pp. 29–43, BoltonLanding, New York (2003)

  17. Hadoop, MapReduce. http://hadoop.apache.org

  18. Hadoop, Project. http://hadoop.apache.org

  19. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2006)

  20. HDFS. http://hadoop.apache.org

  21. Herdotou, H., Babu, S.: Profiling, Whatif Analysis, and Costbased Optimization of MapReduce Programs, In Proceedings 37th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1111–1122, Seattle, Washington (2011)

  22. Jahani, E., Cafarella, M., Re, C.: Automatic Optimization for MapReduce Programs, In Proceedings 37th Int’l Conf. on Very Large Data Bases (VLDB), pp. 385–396, Seattle, Washington (2011)

  23. Kim, J., Whang, K., Kwon, H., Song, I.: Odysseus/DFS: Integration of DBMS and the Distributed File System for Transaction Processing on Big Data, CoRR Technical Report (CS.DB/arXiv:1406.0435) (2014)

  24. Lymna, P., Varian, H.: How Much Information?, Project Report, School of Information Management and Systems, University California at Berkeley (2003). http://www.sims.berkeley.edu/research/projects/how-much-info-2003

  25. Morgan, T.: Can network architectures break the speed limit?, Enterprise Tech. (2011). http://www.theregister.co.uk/2011/10/10/network_architecture

  26. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: ”Pig Latin: A Not-So-Foreign Language for Data Processing,” In Proc. 2008 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 1099–1110, Vancouver, Canada (2008)

  27. Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of Approaches to Large-Scale Data Analysis, In Proceedings 2009 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 165–178, Providence, Rhode Island (2009)

  28. Shute, J., et al.: F1: A Distributed SQL Database That Scales, In Proceedings of the 39th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1068–1079, Riva del Garda, Italy (2013)

  29. Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and Parallel DBMSs:Friends or Foes?. Commun. ACM 53, 64–71 (2010)

    Article  Google Scholar 

  30. Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - A Warehousing Solution Over a Map-Reduce Framework, In Proceedings 35th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1626–1629, Lyon, France (2009)

  31. Whang, K., Lee, M., Lee, J., Kim, M., Han, W.: Odysseus: a High-Performance ORDBMS Tightly-Coupled with IR Features, In Proceedings 21st IEEE Int’l Conf. on Data Engineering (ICDE), pp. 1104–1105, Tokyo, Japan. This paper received the Best Demonstration Award (2005)

  32. Whang, K., Lee, J., Kim, M., Lee, M., Lee, K.: Odysseus: a High-Performance ORDBMS Tightly-Coupled with Spatial Database Features, In Proceedings 23rd IEEE Int’l Conf. on Data Engineering (ICDE), pp. 1493–1494, Istanbul, Turkey (2007)

  33. Whang, K., Yun, T., Yeo, Y., Song, I., Kwon, H., Kim, I.: ODYS: An Approach to Building a Massively-Parallel Search Engine Using a DB-IR Tightly-Integrated Parallel DBMS for Higher-Level Functionality,” In Proceedings 2013 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 313–324, New York, New York (2013)

  34. Whang, K., Lee, J., Lee, M., Han, W., Kim, M., Kim, J.: DB-IR integration using tight-coupling in the Odysseus DBMS, World Wide Web (2013). doi:10.1007/s11280-013-0264-y

  35. Woligroski, D.: Gigabit Ethernet: Dude, Where’s My Bandwidth?, Bestofmedia Group (2009). http://www.tomshardware.com/reviews/gigabit-ethernet-bandwidth,2321.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kyu-Young Whang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, JS., Whang, KY., Kwon, HY. et al. PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system. World Wide Web 19, 299–322 (2016). https://doi.org/10.1007/s11280-014-0312-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-014-0312-2

Keywords

Navigation