Skip to main content
Log in

joinTree: A novel join-oriented multivariate operator for spatio-temporal data management in Flink

  • Published:
GeoInformatica Aims and scope Submit manuscript

Abstract

In the era of intelligent Internet, the management and analysis of massive spatio-temporal data is one of the important links to realize intelligent applications and build smart cities, in which the interaction of multi-source data is the basis of realizing spatio-temporal data management and analysis. As an important carrier to achieve the interactive calculation of massive data, Flink provides the advanced Operator Join to facilitate user program development. In a Flink job with multi-source data connection operations, the selection of join sequences and the data communication in the repartition phase are both key factors that affect the efficiency of the job. However, Flink does not provide any optimization mechanism for the two factors, which in turn leads to low job efficiency. If the enumeration method is used to find the optimal join sequence, the result will not be obtained in polynomial time, so the optimization effect cannot be achieved. We investigate the above problems, design and implement a more advanced Operator joinTree that can support multi-source data connection in Flink, and introduce two optimization strategies into the Operator. In summary, the advantages of our work are highlighted as follows: (1) the Operator enables Flink to support multi-source data connection operation, and reduces the amount of calculation and data communication by introducing lightweight optimization strategies to improve job efficiency; (2) with the optimization strategy for join sequence, the total running time can be reduced by 29% and the data communication can be reduced by 34% compared with traditional sequential execution; (3) the optimization strategy for data repartition can further enable the job to bring 35% performance improvement, and in the average case can reduce the data communication by 43%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

All data in the experiment is authoritative and available.

Code availability

All the codes in this research are available.

References

  1. Isaksen ET, Johansen BG (2021) Congestion pricing, air pollution, and individual-level behavioral responses. Memorandum

  2. Ye Y, Wang G, Chen L, Wang H (2015) Graph similarity search on large uncertain graph databases. Vldb Journal 24(2):271–296

    Article  Google Scholar 

  3. Delianidi M, Salampasis M, Diamantaras K, Siomos, T, Karaveli I (2021) A graph-based method for session-based recommendations

  4. Ye Y, Xiang L, Chen L, Sun Y, Wang G (2016) Rsknn: knn search on road networks by incorporating social influence. IEEE Transactions on Knowledge & Data Engineering 28(6):1575–1588

    Article  Google Scholar 

  5. Yuan Y, Lian X, Wang G, Chen L, Ma Y, Wang Y (2019) Weight-constrained route planning over time-dependent graphs. 2019 IEEE 35th international conference on data engineering (ICDE)

  6. Wang Y, Yuan Y, Wang H, Zhou X, Mu C, Wang G (2021) Constrained route planning over large multi-modal time-dependent networks. ICDE, 313–324

  7. Carbone P, Katsifodimos A, Kth, Sweden S, Tzoumas K (2015) Apache flink : Stream and batch processing in a single engine

  8. Failure H, Failure H, Access SD, Access SD, Sets LD, Sets LD, Model SC, Model SC, Computation M, Computation M (2007) The hadoop distributed file system: Architecture and design. Hadoop Project Website 11(11):1–10

    Google Scholar 

  9. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: Cluster computing with working sets

  10. Scheufele W, Moerkotte G, Seminargebaude A (1997) Constructing optimal bushy processing trees for join queries is np-hard (extended abstract)

  11. Dittrich J, Quiané-Ruiz J, Jindal A, Kargin Y, Setty V, Schad J (2010) Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow 3(1):518–529

    Google Scholar 

  12. Eltabakh MY, Tian Y, Özcan F, Gemulla R, Krettek A, McPherson J (2011) Cohadoop: Flexible data placement and its exploitation in hadoop. Proc. VLDB Endow 4(9):575–585

    Article  Google Scholar 

  13. Kimmett B, Thomo A, Venkatesh S (2014) Three-way joins on mapreduce: An experimental study, 227–232

  14. Afrati FN, Ullman JD (2011) Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering 23(9):1282–1298

    Article  Google Scholar 

  15. Leis V, Radke B, Gubichev A, Mirchev A, Boncz PA, Kemper A, Neumann T (2018) Query optimization through the looking glass, and what we found running the join order benchmark. VLDB J 27(5):643–668

    Article  Google Scholar 

  16. Li N, Liu Y, Dong Y, Gu J (2008) Application of ant colony optimization algorithm to multi-join query optimization 5370:189–197

  17. Kadkhodaei H, Mahmoudi F (2011) A combination method for join ordering problem in relational databases using genetic algorithm and ant colony, 312–317

  18. A LD, A GW, A JX, A XW, A SH, B RZ (2012) Commapreduce: An improvement of mapreduce with lightweight communication mechanisms. In: International conference on database systems for advanced applications, pp. 224–247

  19. Michael L, Nejdl W, Papapetrou O, Siberski W (2007) Improving distributed join efficiency with extended bloom filter operations. In: 21st international conference on advanced information networking and applications (AINA 2007)

  20. Selinger PG, Astrahan MM, Chamberlin DD, Lorie, RA, Price TG (1979) Access path selection in a relational database management system, 23–34

  21. Vance B, Maier D (1996) Rapid bushy join-order optimization with cartesian products, 35–46

  22. Ahmed R, Sen R, Poess M, Chakkappen S (2014) Of snowstorms and bushy trees. Proc. VLDB Endow 7(13):1452–1461

    Article  Google Scholar 

  23. Blanas S, Li Y, Patel JM (2011) Design and evaluation of main memory hash join algorithms for multi-core cpus, 37–48

  24. Stutzle T, Hoos H (1999) Improving the ant system: A detailed report on the max-min ant system

  25. Barata M, Bernardino J, Furtado P (2015) An overview of decision support benchmarks: Tpc-ds. TPC-H and SSB 353:619–628

    Google Scholar 

Download references

Funding

This research was supported by the National Key R&D Program of China under Grant No. 2018YFB1004402; and the NSFC under Grant No. 61872072, 62072087, 61772124, 61932004, 61732003, and 61729201; and the Fundamental Research Funds for the Central Universities under Grant No. N2016009.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, Hangxu Ji; software, Hangxu Ji; methodology, Hangxu Ji and Yuhai Zhao; supervision, Hangxu Ji and Shiye Wang; validation, Hangxu Ji and Shiye Wang; writing-original draft, Hangxu Ji; writing-review and editing, Gang Wu, George Y. Yuan, and Guoren Wang.

Corresponding author

Correspondence to Gang Wu.

Ethics declarations

Ethics approval

This article does not contain any studies involving human participants and/or animals by any of the authors.

Consent to participate

All authors have agreed to participate in the research described in this manuscript.

Consent for publication

All authors have read and agreed to the published version of the manuscript.

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ji, H., Wu, G., Zhao, Y. et al. joinTree: A novel join-oriented multivariate operator for spatio-temporal data management in Flink. Geoinformatica 27, 107–132 (2023). https://doi.org/10.1007/s10707-022-00470-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10707-022-00470-5

Keywords

Navigation