Skip to main content
Log in

Optimizing data query performance of Bi-cluster for large-scale scientific data in supercomputers

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Scientific exploration and discovery heavily rely on increasing datasets and strong supercomputing power. Surging data pose massive data management challenges in existing data query frameworks. Although many data management techniques have been developed to quickly locate the selected data records, the time and space required to build and store these indexes are often too expensive. To deal with the problem of data location in a parallel file system managing large-scale scientific data, we propose an improved high-performance query data framework called “Bi-cluster+.” In the aspect of index generation, a hierarchical index data structure is designed, which effectively balances index granularity and index construction overhead. According to the characteristics of the index offset, the write load balancing strategy is designed. The hierarchical index is written independently and in parallel. The in situ index generation is optimized by resource constraint analysis. In the aspect of data retrieval, optimization techniques are proposed to improve the query performance. Such as the strategy of the logical data block merging and reading. With the experiments by using multiple scientific datasets on a supercomputer, our optimizations improve data query performance by up to a factor of 1.9 compared with the original Bi-cluster implementation. The scalability of Bi-cluster+ can keep a good performance by evaluating on 17496 cores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. The hdf5 format. http://www.hdfgroup.org/HDF5

  2. Netcdf. http://www.unidata.ucar.edu/software/netcdf

  3. Aguilera MK, Golab W, Shah MA (2008) A practical scalable distributed b-tree. Proceedings of the Vldb Endowment

  4. Behzad B, Luu HVT, Huchette J, Byna S, Prabhat Aydt RA, Koziol Q, Snir M (2013) Taming parallel i/o complexity with auto-tuning. In: International conference on High Performance Computing

  5. Blanas S, Wu K, Byna S, Dong B, Shoshani A (2014) Parallel data analysis directly on scientific file formats. ACM

  6. Bowers KJ, Albright BJ, Yin L, Bergen B, Kwan TJT (2008) Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulation. Phys Plasmas 15:199–434

    Article  Google Scholar 

  7. Chen C, Huang X, Fu H, Yang G (2012) The chunk-locality index: an efficient query method for climate datasets. In: 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pp 2104–2110

  8. Cheng L, Wang Y, Pei Y, Epema D (2017) A coflow-based co-optimization framework for high-performance data analytics. In: Proceedings of 46th International Conference on Parallel Processing, pp 392–401

  9. Cheng P, Wang Y, Lu Y, Du Y, Chen Z (2019) Uniindex: an index and query middleware for parallel file systems. Concurr Comput Pract Experience 32(10):1

    Google Scholar 

  10. Chou J, Howison M, Austin B, Wu K, Ryne RD (2011) Parallel index and query for large scale data analysis. In: High Performance Computing, Networking, Storage & Analysis

  11. Chou JC, Wu K (2011) Prabhat: Fastquery: A parallel indexing system for scientific data. In: IEEE International Conference on Cluster Computing (CLUSTER), Austin, TX, USA, Sept 26–30, 2011, pp 455–464

  12. Dong B, Byna S, Wu K (2013) Sds: A framework for scientific data services. In: Parallel Data Storage Workshop

  13. Hey T (2012) The fourth paradigm—data-intensive scientific discovery. Springer, Berlin

    Book  Google Scholar 

  14. Jha S, Qiu J, Luckow A, Mantha P (2014) Fox GC A tale of two data-intensive paradigms: Applications, abstractions, and architectures. CoRR arXiv:abs/1403.1528

  15. Kim J, Abbasi H, Chacon L, Docan C, Wu K (2011) Parallel in situ indexing for data-intensive computing. In: Large data analysis & visualization

  16. Kraska T, Beutel A, Chi EH, Dean J, Polyzotis N (2017) The case for learned index structures

  17. Lakshminarasimhan S, Shah N, Ethier S, Klasky S, Latham R, Ross RB, Samatova NF (2011) Compressing the incompressible with ISABELA: in-situ reduction of spatio-temporal data. In: Euro-Par 2011 Parallel Processing - 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29–September 2, 2011, Proceedings, Part I, pp 366–379

  18. Li J, Liao WK, Choudhary A, Ross R, Zingale M (2003) Parallel netcdf: A high-performance scientific i/o interface. In: Supercomputing, ACM/IEEE Conference

  19. Liao X, Xiao L, Yang C, Yutong LU (2014) Milkyway-2 supercomputer: system and application. Front Comput Sci 8(3):1

    MathSciNet  Google Scholar 

  20. Liu Q, Logan J, Tian Y, Abbasi H, Podhorszki N, Choi JY, Klasky S, Tchoua R, Lofstead J, Oldfield RA (2014) Hello adios: the challenges and lessons of developing leadership class i/o frameworks. Concurr Comput Pract Exp 26(7):1453–1473

    Article  Google Scholar 

  21. Schwan P (2003) Lustre: building a file system for 1000-node clusters. In: Proceedings of the the 2003 Linux Symposium

  22. Shen Y (2018) Teno: An efficient high-throughput computing job execution framework on tianhe-2. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

  23. Shen Y, Peng C, Du Y, Lu Y (2019) Bi-cluster: A high-performance data query framework for large-scale scientific data. In: IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

  24. Shoshani, A., Rotem, D.: Scientific data management: challenges, technology, and deployment. CRC Press/Taylor & Francis (2009)

  25. Soroush E, Balazinska M, Wang DL (2011) Arraystore: a storage manager for complex parallel array processing. In: Proceedings of the ACM SIGMOD International cCnference on Management of Data, pp 253–264

  26. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE (2015) Big data: Astronomical or genomical? PLoS Biol 13(7):e1002195

  27. Stonebraker M, Becla J, Dewitt DJ, Lim KT, Zdonik SB (2009) Requirements for science data bases and scidb. In: Conference on Cidr

  28. Stonebraker M, Brown P, Zhang D, Becla J (2013) Scidb: a database management system for applications with complex analytics. Comput Sci Eng 15(3):54–62

    Article  Google Scholar 

  29. Wang DL, Monkewitz SM, Lim KT, Becla J (2011) Qserv: a distributed shared-nothing database for the lsst catalog. In: High performance computing, networking, storage & analysis

  30. Werner A, Schnorbus M, Shrestha R, Cannon A, Zwiers F, Dayon G, Anslow F (2019) A long-term, temporally consistent, gridded daily meteorological dataset for northwestern north america. Sci Data 6(1):1–16

    Article  Google Scholar 

  31. Wu K, Ahern S, Bethel EW, Chen J, Childs H, Cormier-Michel E, Geddes C, Gu J, Hagen H, Hamann BA (2009) Fastbit: Interactively searching massive data. J Phys Conf 180:012053

    Article  Google Scholar 

  32. Wu K, Otoo EJ, Shoshani A (2006) Optimizing bitmap indices with efficient compression. ACM Trans Database Syst 31(1):1–38

    Article  Google Scholar 

  33. Wu K, Shoshani A, Stockinger K (2010) Analyses of multi-level and multi-component compressed bitmap indexes. ACM Trans Database Syst 35(1):1–52

    Article  Google Scholar 

  34. Wu K, Stockinger K, Shoshani A (2008) Breaking the curse of cardinality on bitmap indexes. In: International Conference on Scientific and Statistical Database Management

  35. Wu T, Chou JC, Podhorszki N, Gu J, Tian Y, Klasky S, Wu K (2017) Apply block index technique to scientific data analysis and I/O systems. In: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017, Madrid, Spain, May 14–17, 2017, pp 865–871

  36. Wu T, Shyng H, Chou J, Dong B, Wu K (2016) Indexing blocks to reduce space and time requirements for searching large data files. In: 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

  37. Zhang Y, Kersten ML, Manegold S (2013) Sciql: array data processing inside an RDBMS. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp 1049–1052

Download references

Acknowledgements

Supported by National Key R&D Program of China (2018YFB0204303), National Natural Science Foundation of China (No. 61872392 and U1811461), Guangdong Natural Science Foundation (2018B030312002), the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant (No. 2016ZT06D211), the Major Program of Guangdong Basic and Applied Research (2019B030302002), NSF of Hunan (No. 2019JJ40339), and NSF of NUDT (No. ZK18-03-01).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xia Liao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liao, X., Shen, Y., Li, S. et al. Optimizing data query performance of Bi-cluster for large-scale scientific data in supercomputers. J Supercomput 78, 2417–2441 (2022). https://doi.org/10.1007/s11227-021-03965-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03965-4

Keywords

Navigation