Skip to main content

FastThetaJoin: An Optimization on Multi-way Data Stream \(\theta \)-join with Range Constraints

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12452))

Abstract

In this paper, we propose FastThetaJoin, an optimization technique for \(\theta \)-join operation on multi-way data streams, which is an essential query often used in many data analytical tasks. The \(\theta \)-join operation on multi-way data streams is notoriously difficult as it always involves tremendous shuffle cost due to data movements between multiple operation components, rendering it hard to be efficiently implemented in a distributed environment. As with previous methods, FastThetaJoin also tries to minimize the number of \(\theta \)-joins, but it is distinct from others in terms of making partitions, deleting unnecessary data items, and performing the Cartesian product. FastThetaJoin not only effectively minimizes the number of \(\theta \)-joins, but also substantially improves the efficiency of its operations in a distributed environment. We implemented FastThetaJoin in the framework of Spark Streaming, characterized by its efficient bucket implementation of parameterized windows. The experimental results show that, compared with the existing solutions, our proposed method can speed up the \(\theta \)-join processing while reducing its overhead; the specific effects of the optimization is correlated to the nature of data streams–the greater the data difference is, the more apparent the optimization effect is.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

  2. https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

  3. https://github.com/YongyiZhou/Multiway-Stream-Generator/blob/master/src/main/java/DSMain.java

  4. Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 99–110. ACM (2010)

    Google Scholar 

  5. Arasu, A., Babu, S., Widom, J.: The CQL continuous query language: semantic foundations and query execution. VLDB J. 15(2), 121–142 (2006)

    Article  Google Scholar 

  6. Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)

    Google Scholar 

  7. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986. ACM (2010)

    Google Scholar 

  8. Carney, D., et al.: Monitoring streams: a new class of data management applications. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 215–226. VLDB Endowment (2002)

    Google Scholar 

  9. Golab, L., Özsu, M.T.: Update-pattern-aware modeling and processing of continuous queries. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 658–669. ACM (2005)

    Google Scholar 

  10. Hammad, M.A., Aref, W.G., Elmagarmid, A.K.: Stream window join: tracking moving objects in sensor-network databases. In: 15th International Conference on Scientific and Statistical Database Management 2003, pp. 75–84. IEEE (2003)

    Google Scholar 

  11. Jiang, D., Tung, A.K., Chen, G.: MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. 23(9), 1299–1311 (2010)

    Article  Google Scholar 

  12. Khayyat, Z., et al.: Lightning fast and space efficient inequality joins. Proc. VLDB Endow. 8(13), 2074–2085 (2015)

    Article  Google Scholar 

  13. Lin, Q., Ooi, B.C., Wang, Z., Yu, C.: Scalable distributed stream join processing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 811–825. ACM (2015)

    Google Scholar 

  14. Liu, W., Li, Z., Zhou, Y.: An efficient filter strategy for theta-join query in distributed environment. In: 2017 46th International Conference on Parallel Processing Workshops (ICPPW), pp. 77–84. IEEE (2017)

    Google Scholar 

  15. Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 949–960. ACM (2011)

    Google Scholar 

  16. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)

    Google Scholar 

  17. Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1071–1085. ACM (2016)

    Google Scholar 

  18. Yan, K., Zhu, H.: Two MRJs for multi-way theta-join in MapReduce. In: Pathan, M., Wei, G., Fortino, G. (eds.) IDCS 2013. LNCS, vol. 8223, pp. 321–332. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41428-2_26

    Chapter  Google Scholar 

  19. Yang, H.C., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040 (2007)

    Google Scholar 

  20. Zhang, C., Li, J., Wu, L., Lin, M., Liu, W.: SEJ: an even approach to multiway theta-joins using MapReduce. In: 2012 Second International Conference on Cloud and Green Computing, pp. 73–80. IEEE (2012)

    Google Scholar 

  21. Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using MapReduce. Proc. VLDB Endow. 5(11), 1184–1195 (2012)

    Article  Google Scholar 

Download references

Acknowledgment

This work is supported in part by Key-Area Research and Development Program of Guangdong Province (2020B010164002), Shenzhen strategic Emerging Industry Development Funds (JCYJ20170818163026031), and also in part by National Natural Science Foundation of China (61672513).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hu, Z., Fan, X., Wang, Y., Xu, C. (2020). FastThetaJoin: An Optimization on Multi-way Data Stream \(\theta \)-join with Range Constraints. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12452. Springer, Cham. https://doi.org/10.1007/978-3-030-60245-1_12

Download citation

Publish with us

Policies and ethics