Abstract
In this paper, we propose FastThetaJoin, an optimization technique for \(\theta \)-join operation on multi-way data streams, which is an essential query often used in many data analytical tasks. The \(\theta \)-join operation on multi-way data streams is notoriously difficult as it always involves tremendous shuffle cost due to data movements between multiple operation components, rendering it hard to be efficiently implemented in a distributed environment. As with previous methods, FastThetaJoin also tries to minimize the number of \(\theta \)-joins, but it is distinct from others in terms of making partitions, deleting unnecessary data items, and performing the Cartesian product. FastThetaJoin not only effectively minimizes the number of \(\theta \)-joins, but also substantially improves the efficiency of its operations in a distributed environment. We implemented FastThetaJoin in the framework of Spark Streaming, characterized by its efficient bucket implementation of parameterized windows. The experimental results show that, compared with the existing solutions, our proposed method can speed up the \(\theta \)-join processing while reducing its overhead; the specific effects of the optimization is correlated to the nature of data streams–the greater the data difference is, the more apparent the optimization effect is.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
https://github.com/YongyiZhou/Multiway-Stream-Generator/blob/master/src/main/java/DSMain.java
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 99–110. ACM (2010)
Arasu, A., Babu, S., Widom, J.: The CQL continuous query language: semantic foundations and query execution. VLDB J. 15(2), 121–142 (2006)
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986. ACM (2010)
Carney, D., et al.: Monitoring streams: a new class of data management applications. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 215–226. VLDB Endowment (2002)
Golab, L., Özsu, M.T.: Update-pattern-aware modeling and processing of continuous queries. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 658–669. ACM (2005)
Hammad, M.A., Aref, W.G., Elmagarmid, A.K.: Stream window join: tracking moving objects in sensor-network databases. In: 15th International Conference on Scientific and Statistical Database Management 2003, pp. 75–84. IEEE (2003)
Jiang, D., Tung, A.K., Chen, G.: MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. 23(9), 1299–1311 (2010)
Khayyat, Z., et al.: Lightning fast and space efficient inequality joins. Proc. VLDB Endow. 8(13), 2074–2085 (2015)
Lin, Q., Ooi, B.C., Wang, Z., Yu, C.: Scalable distributed stream join processing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 811–825. ACM (2015)
Liu, W., Li, Z., Zhou, Y.: An efficient filter strategy for theta-join query in distributed environment. In: 2017 46th International Conference on Parallel Processing Workshops (ICPPW), pp. 77–84. IEEE (2017)
Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 949–960. ACM (2011)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)
Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1071–1085. ACM (2016)
Yan, K., Zhu, H.: Two MRJs for multi-way theta-join in MapReduce. In: Pathan, M., Wei, G., Fortino, G. (eds.) IDCS 2013. LNCS, vol. 8223, pp. 321–332. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41428-2_26
Yang, H.C., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040 (2007)
Zhang, C., Li, J., Wu, L., Lin, M., Liu, W.: SEJ: an even approach to multiway theta-joins using MapReduce. In: 2012 Second International Conference on Cloud and Green Computing, pp. 73–80. IEEE (2012)
Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using MapReduce. Proc. VLDB Endow. 5(11), 1184–1195 (2012)
Acknowledgment
This work is supported in part by Key-Area Research and Development Program of Guangdong Province (2020B010164002), Shenzhen strategic Emerging Industry Development Funds (JCYJ20170818163026031), and also in part by National Natural Science Foundation of China (61672513).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Hu, Z., Fan, X., Wang, Y., Xu, C. (2020). FastThetaJoin: An Optimization on Multi-way Data Stream \(\theta \)-join with Range Constraints. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12452. Springer, Cham. https://doi.org/10.1007/978-3-030-60245-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-60245-1_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60244-4
Online ISBN: 978-3-030-60245-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)