ABSTRACT
While originally proposed to provide fault-tolerance and scalability for data analysis queries on unstructured data over massive clusters, MapReduce systems today are being used for analysis of rich combinations of unstructured, semi-structured and structured data. To achieve performance on these new workloads, MapReduce systems (and the distributed file systems on which they are built) can no longer rely on static data placement strategies. In this thesis, we propose new physical data independence and adaptive data tuning solutions that can greatly improve the performance of analysis queries in systems where workloads are not static and where workloads may include complex queries with overlapping or related computations (subqueries). While profiting from the work on physical data independence in relational systems, we propose novel strategies that recognize the central role of data partitioning (and co-partitioning) in shared-nothing distributed file systems.
- Hadoop. http://hadoop.apache.org/.Google Scholar
- HBase. http://hbase.apache.org/.Google Scholar
- HIVE. http://hive.apache.org/.Google Scholar
- Sloan Digital Sky Survey. http://cas.sdss.org/.Google Scholar
- A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow., 2(1):922--933, Aug. 2009. Google ScholarDigital Library
- Y. Ahmad, O. Kennedy, C. Koch, and M. Nikolic. Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. Proc. VLDB Endow., 5(10):968--979, June 2012. Google ScholarDigital Library
- Y. Ahmad and C. Koch. Dbtoaster: a sql compiler for high-performance delta processing in main-memory databases. Proc. VLDB Endow., 2(2):1566--1569, Aug. 2009. Google ScholarDigital Library
- S. Börzsönyi, D. Kossmann, and K. Stocker. The skyline operator. In Proceedings of the 17th International Conference on Data Engineering, pages 421--430, Washington, DC, USA, 2001. IEEE Computer Society. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, 2004. Google ScholarDigital Library
- J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow., 3(1-2):515--529, Sept. 2010. Google ScholarDigital Library
- I. Elghandour and A. Aboulnaga. Restore: reusing results of mapreduce jobs. Proc. VLDB Endow., 5(6):586--597, Feb. 2012. Google ScholarDigital Library
- A. Ghazal, M. Hu, T. Rabl, F. Raab, M. Poess, A. Crolotte, and H. Jacobson. Bigbench: Towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, 2013. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPS Oper. Syst. Rev., 37(5):29--43, Oct. 2003. Google ScholarDigital Library
- J. Goldstein and P.-A. Larson. Optimizing queries using materialized views: a practical, scalable solution. SIGMOD Rec., 30(2):331--342, May 2001. Google ScholarDigital Library
- P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: automatic management of data and computation in datacenters. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--8, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarDigital Library
- A. Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270--294, Dec. 2001. Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 59--72, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- M. G. Ivanova, M. L. Kersten, N. J. Nes, and R. A. Gonçalves. An architecture for recycling intermediates in a column-store. ACM Trans. Database Syst., 35(4):24:1--24:43, Oct. 2010. Google ScholarDigital Library
- R. O. Nambiar and M. Poess. The making of tpc-ds. In Proceedings of the 32nd international conference on Very large data bases, VLDB '06, pages 1049--1058. VLDB Endowment, 2006. Google ScholarDigital Library
- T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow., 3(1-2):494--505, Sept. 2010. Google ScholarDigital Library
- S. Papadomanolakis and A. Ailamaki. Autopart: Automating schema design for large scientific databases using data partitioning. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, SSDBM '04, pages 383--, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarDigital Library
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD '09, pages 165--178, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- T. K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23--52, Mar. 1988. Google ScholarDigital Library
- A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 996--1005, march 2010.Google ScholarCross Ref
- O. G. Tsatalos, M. H. Solomon, and Y. E. Ioannidis. The gmap: a versatile tool for physical data independence. The VLDB Journal, 5(2):101--118, Apr. 1996. Google ScholarDigital Library
- R. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, 2013. Google ScholarDigital Library
Index Terms
- DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems
Recommendations
m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data
CSE '13: Proceedings of the 2013 IEEE 16th International Conference on Computational Science and EngineeringHigh-level parallel dataflow systems, such as Pig and Hive, have lately gained great popularity in the area of big data processing. These systems often consist of a declarative query language and a set of compilers, which transform queries into ...
Materialization and Decomposition of Dataspaces for Efficient Search
Dataspaces consist of large-scale heterogeneous data. The query interface of accessing tuples should be provided as a fundamental facility by practical dataspace systems. Previously, an efficient index has been proposed for queries with keyword ...
A Partial Materialization-Based Approach to Scalable Query Answering in OWL 2 DL
Database Systems for Advanced ApplicationsAbstractThis paper focuses on the efficient ontology-mediated querying (OMQ) problem. Compared with query answering in plain databases, which deals with fixed finite database instances, a key challenge in OMQ is to deal with the possibly infinite large ...
Comments