skip to main content
10.1145/2483574.2483578acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

Published:22 June 2013Publication History

ABSTRACT

While originally proposed to provide fault-tolerance and scalability for data analysis queries on unstructured data over massive clusters, MapReduce systems today are being used for analysis of rich combinations of unstructured, semi-structured and structured data. To achieve performance on these new workloads, MapReduce systems (and the distributed file systems on which they are built) can no longer rely on static data placement strategies. In this thesis, we propose new physical data independence and adaptive data tuning solutions that can greatly improve the performance of analysis queries in systems where workloads are not static and where workloads may include complex queries with overlapping or related computations (subqueries). While profiting from the work on physical data independence in relational systems, we propose novel strategies that recognize the central role of data partitioning (and co-partitioning) in shared-nothing distributed file systems.

References

  1. Hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  2. HBase. http://hbase.apache.org/.Google ScholarGoogle Scholar
  3. HIVE. http://hive.apache.org/.Google ScholarGoogle Scholar
  4. Sloan Digital Sky Survey. http://cas.sdss.org/.Google ScholarGoogle Scholar
  5. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow., 2(1):922--933, Aug. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. Ahmad, O. Kennedy, C. Koch, and M. Nikolic. Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. Proc. VLDB Endow., 5(10):968--979, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Ahmad and C. Koch. Dbtoaster: a sql compiler for high-performance delta processing in main-memory databases. Proc. VLDB Endow., 2(2):1566--1569, Aug. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Börzsönyi, D. Kossmann, and K. Stocker. The skyline operator. In Proceedings of the 17th International Conference on Data Engineering, pages 421--430, Washington, DC, USA, 2001. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow., 3(1-2):515--529, Sept. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. I. Elghandour and A. Aboulnaga. Restore: reusing results of mapreduce jobs. Proc. VLDB Endow., 5(6):586--597, Feb. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Ghazal, M. Hu, T. Rabl, F. Raab, M. Poess, A. Crolotte, and H. Jacobson. Bigbench: Towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPS Oper. Syst. Rev., 37(5):29--43, Oct. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Goldstein and P.-A. Larson. Optimizing queries using materialized views: a practical, scalable solution. SIGMOD Rec., 30(2):331--342, May 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: automatic management of data and computation in datacenters. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--8, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270--294, Dec. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 59--72, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. G. Ivanova, M. L. Kersten, N. J. Nes, and R. A. Gonçalves. An architecture for recycling intermediates in a column-store. ACM Trans. Database Syst., 35(4):24:1--24:43, Oct. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. O. Nambiar and M. Poess. The making of tpc-ds. In Proceedings of the 32nd international conference on Very large data bases, VLDB '06, pages 1049--1058. VLDB Endowment, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow., 3(1-2):494--505, Sept. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Papadomanolakis and A. Ailamaki. Autopart: Automating schema design for large scientific databases using data partitioning. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, SSDBM '04, pages 383--, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD '09, pages 165--178, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23--52, Mar. 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 996--1005, march 2010.Google ScholarGoogle ScholarCross RefCross Ref
  25. O. G. Tsatalos, M. H. Solomon, and Y. E. Ioannidis. The gmap: a versatile tool for physical data independence. The VLDB Journal, 5(2):101--118, Apr. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD'13 PhD Symposium: Proceedings of the 2013 SIGMOD/PODS Ph.D. symposium
      June 2013
      78 pages
      ISBN:9781450321556
      DOI:10.1145/2483574
      • Program Chairs:
      • Lei Chen,
      • Xin Luna Dong

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 June 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD'13 PhD Symposium Paper Acceptance Rate12of26submissions,46%Overall Acceptance Rate40of60submissions,67%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader