research-article

DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

Author:
Jiang Du

University of Toronto, Toronto, ON, Canada

University of Toronto, Toronto, ON, Canada
View Profile

SIGMOD'13 PhD Symposium: Proceedings of the 2013 SIGMOD/PODS Ph.D. symposiumJune 2013Pages 7–12https://doi.org/10.1145/2483574.2483578

Published:22 June 2013Publication History

SIGMOD'13 PhD Symposium: Proceedings of the 2013 SIGMOD/PODS Ph.D. symposium

Pages 7–12

ABSTRACT

While originally proposed to provide fault-tolerance and scalability for data analysis queries on unstructured data over massive clusters, MapReduce systems today are being used for analysis of rich combinations of unstructured, semi-structured and structured data. To achieve performance on these new workloads, MapReduce systems (and the distributed file systems on which they are built) can no longer rely on static data placement strategies. In this thesis, we propose new physical data independence and adaptive data tuning solutions that can greatly improve the performance of analysis queries in systems where workloads are not static and where workloads may include complex queries with overlapping or related computations (subqueries). While profiting from the work on physical data independence in relational systems, we propose novel strategies that recognize the central role of data partitioning (and co-partitioning) in shared-nothing distributed file systems.

References

Hadoop. http://hadoop.apache.org/.Google Scholar
HBase. http://hbase.apache.org/.Google Scholar
HIVE. http://hive.apache.org/.Google Scholar
Sloan Digital Sky Survey. http://cas.sdss.org/.Google Scholar
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow., 2(1):922--933, Aug. 2009. Google ScholarDigital Library
Y. Ahmad, O. Kennedy, C. Koch, and M. Nikolic. Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. Proc. VLDB Endow., 5(10):968--979, June 2012. Google ScholarDigital Library
Y. Ahmad and C. Koch. Dbtoaster: a sql compiler for high-performance delta processing in main-memory databases. Proc. VLDB Endow., 2(2):1566--1569, Aug. 2009. Google ScholarDigital Library
S. Börzsönyi, D. Kossmann, and K. Stocker. The skyline operator. In Proceedings of the 17th International Conference on Data Engineering, pages 421--430, Washington, DC, USA, 2001. IEEE Computer Society. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, 2004. Google ScholarDigital Library
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow., 3(1-2):515--529, Sept. 2010. Google ScholarDigital Library
I. Elghandour and A. Aboulnaga. Restore: reusing results of mapreduce jobs. Proc. VLDB Endow., 5(6):586--597, Feb. 2012. Google ScholarDigital Library
A. Ghazal, M. Hu, T. Rabl, F. Raab, M. Poess, A. Crolotte, and H. Jacobson. Bigbench: Towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, 2013. Google ScholarDigital Library
S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPS Oper. Syst. Rev., 37(5):29--43, Oct. 2003. Google ScholarDigital Library
J. Goldstein and P.-A. Larson. Optimizing queries using materialized views: a practical, scalable solution. SIGMOD Rec., 30(2):331--342, May 2001. Google ScholarDigital Library
P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: automatic management of data and computation in datacenters. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--8, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarDigital Library
A. Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270--294, Dec. 2001. Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 59--72, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
M. G. Ivanova, M. L. Kersten, N. J. Nes, and R. A. Gonçalves. An architecture for recycling intermediates in a column-store. ACM Trans. Database Syst., 35(4):24:1--24:43, Oct. 2010. Google ScholarDigital Library
R. O. Nambiar and M. Poess. The making of tpc-ds. In Proceedings of the 32nd international conference on Very large data bases, VLDB '06, pages 1049--1058. VLDB Endowment, 2006. Google ScholarDigital Library
T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow., 3(1-2):494--505, Sept. 2010. Google ScholarDigital Library
S. Papadomanolakis and A. Ailamaki. Autopart: Automating schema design for large scientific databases using data partitioning. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, SSDBM '04, pages 383--, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarDigital Library
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD '09, pages 165--178, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
T. K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23--52, Mar. 1988. Google ScholarDigital Library
A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 996--1005, march 2010.Google ScholarCross Ref
O. G. Tsatalos, M. H. Solomon, and Y. E. Ioannidis. The gmap: a versatile tool for physical data independence. The VLDB Journal, 5(2):101--118, Apr. 1996. Google ScholarDigital Library
R. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, 2013. Google ScholarDigital Library

Index Terms

DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data
CSE '13: Proceedings of the 2013 IEEE 16th International Conference on Computational Science and Engineering

High-level parallel dataflow systems, such as Pig and Hive, have lately gained great popularity in the area of big data processing. These systems often consist of a declarative query language and a set of compilers, which transform queries into ...
Read More
Materialization and Decomposition of Dataspaces for Efficient Search

Dataspaces consist of large-scale heterogeneous data. The query interface of accessing tuples should be provided as a fundamental facility by practical dataspace systems. Previously, an efficient index has been proposed for queries with keyword ...
Read More
A Partial Materialization-Based Approach to Scalable Query Answering in OWL 2 DL
Database Systems for Advanced Applications
Abstract
This paper focuses on the efficient ontology-mediated querying (OMQ) problem. Compared with query answering in plain databases, which deals with fixed finite database instances, a key challenge in OMQ is to deal with the possibly infinite large ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD'13 PhD Symposium: Proceedings of the 2013 SIGMOD/PODS Ph.D. symposium
June 2013
78 pages
ISBN:9781450321556
DOI:10.1145/2483574
Program Chairs:
Lei Chen
Hong Kong University of Science and Technology, China
,
Xin Luna Dong
Google Inc., USA
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
materialization
partitioning
shared-nothing distributed systems
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD'13 PhD Symposium Paper Acceptance Rate12of26submissions,46%Overall Acceptance Rate40of60submissions,67%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 330
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

SIGMOD'13 PhD Symposium: Proceedings of the 2013 SIGMOD/PODS Ph.D. symposium

ABSTRACT

References

Cited By

Index Terms

Recommendations

m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data

Materialization and Decomposition of Dataspaces for Efficient Search

A Partial Materialization-Based Approach to Scalable Query Answering in OWL 2 DL

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

SIGMOD'13 PhD Symposium: Proceedings of the 2013 SIGMOD/PODS Ph.D. symposium

ABSTRACT

References

Cited By

Index Terms

Recommendations

m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data

Materialization and Decomposition of Dataspaces for Efficient Search

A Partial Materialization-Based Approach to Scalable Query Answering in OWL 2 DL

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media