ABSTRACT
We study the problem of distributing the tuples of a relation to a number of processors organized in an r-dimensional hypercube, which is an important task for parallel join processing. In contrast to previous work, which proposed randomized algorithms for the task, we ask here the question of how to construct efficient deterministic distribution strategies that can optimally load balance the input relation. We first present some general lower bounds on the load for any dimension; these bounds depend not only on the size of the relation, but also on the maximum frequency of each value in the relation. We then construct an algorithm for the case of 1 dimension that is optimal within a constant factor, and an algorithm for the case of 2 dimensions that is optimal within a polylogarithmic factor. Our 2-dimensional algorithm is based on an interesting connection with the vector load balancing problem, a well-studied problem that generalizes classic load balancing.
- F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99--110, 2010. Google ScholarDigital Library
- Y. Azar, I. R. Cohen, S. Kamara, and B. Shepherd. Tight bounds for online vector bin packing. In STOC, pages 961--970, 2013. Google ScholarDigital Library
- P. Beame, P. Koutris, and D. Suciu. Communication steps for parallel query processing. In PODS, pages 273--284, 2013. Google ScholarDigital Library
- P. Beame, P. Koutris, and D. Suciu. Skew in parallel query processing. In PODS, pages 212--223, 2014. Google ScholarDigital Library
- C. Chekuri and S. Khanna. On multidimensional packing problems. SIAM J. Comput., 33(4):837--851, Apr. 2004. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- D. Halperin, V. T. de Almeida, L. L. Choo, S. Chu, P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker, S. Xu, M. Balazinska, B. Howe, and D. Suciu. Demonstration of the Myria big data management service. In SIGMOD, pages 881--884, 2014. Google ScholarDigital Library
- S. Im, N. Kell, J. Kulkarni, and D. Panigrahi. Tight bounds for online vector scheduling. In FOCS, pages 525--544, 2015. Google ScholarDigital Library
- A. Meyerson, A. Roytman, and B. Tagiku. Online multidimensional load balancing. In APPROX-RANDOM, pages 287--302, 2013.Google ScholarCross Ref
- R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In SIGMOD, pages 13--24, 2013. Google ScholarDigital Library
Index Terms
- Deterministic load balancing for parallel joins
Recommendations
Deterministic load balancing and dictionaries in the parallel disk model
SPAA '06: Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architecturesWe consider deterministic dictionaries in the parallel disk model, motivated by applications such as file systems. Our main results show that if the number of disks is moderately large (at least logarithmic in the size of the universe from which keys ...
Tight bounds for parallel randomized load balancing
Given a distributed system of $$n$$n balls and $$n$$n bins, how evenly can we distribute the balls to the bins, minimizing communication__ __ The fastest non-adaptive and symmetric algorithm achieving a constant maximum bin load requires $$\varTheta (\...
Parallel Randomized Load Balancing: A Lower Bound for a More General Model
SOFSEM '10: Proceedings of the 36th Conference on Current Trends in Theory and Practice of Computer ScienceWe extend the lower bound of Adler et. al [1] and Berenbrink [2] for parallel randomized load balancing algorithms.
The setting in these asynchronous and distributed algorithms is of n balls and n bins. The algorithms begin by each ball choosing d ...
Comments