ABSTRACT
Online aggregation is a promising solution to achieving fast early responses for interactive ad-hoc queries that compute aggregates on a large amount of data. Essential to the success of online aggregation is a good non-blocking join algorithm that enables both (i) high early result rates with statistical guarantees and (ii) fast end-to-end query times. We analyze existing non-blocking join algorithms and find that they all provide sub-optimal early result rates, and those with fast end-to-end times achieve them only by further sacrificing their early result rates.
We propose a new non-blocking join algorithm, Partitioned expanding Ripple Join (PR-Join), which achieves considerably higher early result rates than previous non-blocking joins, while also delivering fast end-to-end query times. PR-Join performs separate, ripple-like join operations on individual hash partitions, where the width of a ripple expands multiplicatively over time. This contrasts with the non-partitioned, fixed-width ripples of Block Ripple Join. Assuming, as in previous non-blocking join studies, that the input relations are in random order, PR-Join ensures representative early results that are amenable to statistical guarantees. We show both analytically and with real-machine experiments that PR-Join achieves over an order of magnitude higher early result rates than previous non-blocking joins. We also discuss the benefits of using a flash-based SSD for temporary storage, showing that PR-Join can then achieve close to optimal end-to-end performance. Finally, we consider the joining of finite data streams that arrive over time, and find that PR-Join achieves similar or higher result rates than RPJ, the state-of-the-art algorithm specialized for that domain.
- D. Agrawal, D. Ganesan, R. Sitaraman, Y. Diao, and S. Singh. Lazy-adaptive tree: An optimized index structure for flash devices. PVLDB, 2(1):361--372, 2009. Google ScholarDigital Library
- M. Canim, B. Bhattacharjee, G. A. Mihaila, C. A. Lang, and K. Ross. An object placement advisor for db2 using solid state storage. PVLDB, 2(2):1318--1329, 2009. Google ScholarDigital Library
- S. Chen. Flashlogging: exploiting flash devices for synchronous logging performance. In SIGMOD, 2009. Google ScholarDigital Library
- D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. Stonebraker, and D. A. Wood. Implementation techniques for main memory database systems. In SIGMOD, 1984. Google ScholarDigital Library
- J.-P. Dittrich, B. Seeger, D. S. Taylor, and P. Widmayer. Progressive merge join: A generic and non-blocking sort-based join algorithm. In VLDB, 2002. Google ScholarDigital Library
- A. Dobra, C. Jermaine, F. Rusu, and F. Xu. Turbo-charging estimate convergence in dbo. PVLDB, 2(1):419--430, 2009. Google ScholarDigital Library
- Gartner, Inc. Market share: Business intelligence platform software, worldwide, 2007. http://www.gartner.com/it/page.jsp?id=700410, 2008.Google Scholar
- G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2):73--170, 1993. Google ScholarDigital Library
- G. Graefe. The five-minute rule twenty years later. In DaMoN Workshop, 2007. Google ScholarDigital Library
- P. J. Haas and J. M. Hellerstein. Ripple joins for online aggregation. In SIGMOD, 1999. Google ScholarDigital Library
- J. M. Hellerstein, R. Avnur, and V. Raman. Informix under control: Online query processing. Data Min. Knowl. Discov., 4(4), 2000. Google ScholarDigital Library
- J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. Google ScholarDigital Library
- IDC. The diverse and exploding digital universe. http://www.emc.com/collateral/analyst-reports/diverseexploding-digital-universe.pdf, 2008.Google Scholar
- C. Jermaine, S. Arumugam, A. Pol, and A. Dobra. Scalable approximate query processing with the dbo engine. ACM Trans. Database Syst., 33(4), 2008. Google ScholarDigital Library
- C. Jermaine, A. Dobra, S. Arumugam, S. Joshi, and A. Pol. The sort-merge-shrink join. ACM Trans. Database Syst., 31(4):1382--1416, 2006. Google ScholarDigital Library
- M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base machine and its architecture. New Generation Comput., 1(1):63--74, 1983.Google ScholarDigital Library
- I. Koltsidas and S. Viglas. Flashing up the storage layer. In VLDB, 2008. Google ScholarDigital Library
- R. Lawrence. Early hash join: A configurable algorithm for the efficient and early production of join results. In VLDB, 2005. Google ScholarDigital Library
- S.-W. Lee, B. Moon, C. Park, J.-M. Kim, and S.-W. Kim. A case for flash memory ssd in enterprise database applications. In SIGMOD, 2008. Google ScholarDigital Library
- G. Luo, C. J. Ellmann, P. J. Haas, and J. F. Naughton. A scalable hash ripple join algorithm. In SIGMOD, 2002. Google ScholarDigital Library
- M. F. Mokbel, M. Lu, and W. G. Aref. Hash-merge join: A non-blocking join algorithm for producing fast and early join results. In ICDE, 2004. Google ScholarDigital Library
- S. Nath and P. B. Gibbons. Online maintenance of very large random samples on flash storage. In VLDB, 2008. Google ScholarDigital Library
- S. Nath and A. Kansal. FlashDB: dynamic self-tuning database for NAND flash. In ACM/IEEE IPSN, 2007. Google ScholarDigital Library
- Y. Tao, M. L. Yiu, D. Papadias, M. Hadjieleftheriou, and N. Mamoulis. RPJ: Producing fast join results on streams through rate-based optimization. In SIGMOD, 2005. Google ScholarDigital Library
- D. Tsirogiannis, S. Harizopoulos, M. A. Shah, J. L. Wiener, and G. Graefe. Query processing techniques for solid state drives. In SIGMOD, 2009. Google ScholarDigital Library
- T. Urhan and M. J. Franklin. Xjoin: A reactively-scheduled pipelined join operator. Data Eng. Bull., 23(2):27--33, 2000.Google Scholar
Index Terms
PR-join: a non-blocking join achieving higher early result rate with statistical guarantees
Recommendations
Wander Join and XDB: Online Aggregation via Random Walks
Best of EDBT 2017, Best of SIGMOD 2016 and Regular PapersJoins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in ...
Wander Join: Online Aggregation for Joins
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataJoins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both ...
Wander Join: Online Aggregation via Random Walks
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataJoins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in ...
Comments