Abstract
Scientific exploration and discovery heavily rely on increasing datasets and strong supercomputing power. Surging data pose massive data management challenges in existing data query frameworks. Although many data management techniques have been developed to quickly locate the selected data records, the time and space required to build and store these indexes are often too expensive. To deal with the problem of data location in a parallel file system managing large-scale scientific data, we propose an improved high-performance query data framework called “Bi-cluster+.” In the aspect of index generation, a hierarchical index data structure is designed, which effectively balances index granularity and index construction overhead. According to the characteristics of the index offset, the write load balancing strategy is designed. The hierarchical index is written independently and in parallel. The in situ index generation is optimized by resource constraint analysis. In the aspect of data retrieval, optimization techniques are proposed to improve the query performance. Such as the strategy of the logical data block merging and reading. With the experiments by using multiple scientific datasets on a supercomputer, our optimizations improve data query performance by up to a factor of 1.9 compared with the original Bi-cluster implementation. The scalability of Bi-cluster+ can keep a good performance by evaluating on 17496 cores.












Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
The hdf5 format. http://www.hdfgroup.org/HDF5
Aguilera MK, Golab W, Shah MA (2008) A practical scalable distributed b-tree. Proceedings of the Vldb Endowment
Behzad B, Luu HVT, Huchette J, Byna S, Prabhat Aydt RA, Koziol Q, Snir M (2013) Taming parallel i/o complexity with auto-tuning. In: International conference on High Performance Computing
Blanas S, Wu K, Byna S, Dong B, Shoshani A (2014) Parallel data analysis directly on scientific file formats. ACM
Bowers KJ, Albright BJ, Yin L, Bergen B, Kwan TJT (2008) Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulation. Phys Plasmas 15:199–434
Chen C, Huang X, Fu H, Yang G (2012) The chunk-locality index: an efficient query method for climate datasets. In: 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pp 2104–2110
Cheng L, Wang Y, Pei Y, Epema D (2017) A coflow-based co-optimization framework for high-performance data analytics. In: Proceedings of 46th International Conference on Parallel Processing, pp 392–401
Cheng P, Wang Y, Lu Y, Du Y, Chen Z (2019) Uniindex: an index and query middleware for parallel file systems. Concurr Comput Pract Experience 32(10):1
Chou J, Howison M, Austin B, Wu K, Ryne RD (2011) Parallel index and query for large scale data analysis. In: High Performance Computing, Networking, Storage & Analysis
Chou JC, Wu K (2011) Prabhat: Fastquery: A parallel indexing system for scientific data. In: IEEE International Conference on Cluster Computing (CLUSTER), Austin, TX, USA, Sept 26–30, 2011, pp 455–464
Dong B, Byna S, Wu K (2013) Sds: A framework for scientific data services. In: Parallel Data Storage Workshop
Hey T (2012) The fourth paradigm—data-intensive scientific discovery. Springer, Berlin
Jha S, Qiu J, Luckow A, Mantha P (2014) Fox GC A tale of two data-intensive paradigms: Applications, abstractions, and architectures. CoRR arXiv:abs/1403.1528
Kim J, Abbasi H, Chacon L, Docan C, Wu K (2011) Parallel in situ indexing for data-intensive computing. In: Large data analysis & visualization
Kraska T, Beutel A, Chi EH, Dean J, Polyzotis N (2017) The case for learned index structures
Lakshminarasimhan S, Shah N, Ethier S, Klasky S, Latham R, Ross RB, Samatova NF (2011) Compressing the incompressible with ISABELA: in-situ reduction of spatio-temporal data. In: Euro-Par 2011 Parallel Processing - 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29–September 2, 2011, Proceedings, Part I, pp 366–379
Li J, Liao WK, Choudhary A, Ross R, Zingale M (2003) Parallel netcdf: A high-performance scientific i/o interface. In: Supercomputing, ACM/IEEE Conference
Liao X, Xiao L, Yang C, Yutong LU (2014) Milkyway-2 supercomputer: system and application. Front Comput Sci 8(3):1
Liu Q, Logan J, Tian Y, Abbasi H, Podhorszki N, Choi JY, Klasky S, Tchoua R, Lofstead J, Oldfield RA (2014) Hello adios: the challenges and lessons of developing leadership class i/o frameworks. Concurr Comput Pract Exp 26(7):1453–1473
Schwan P (2003) Lustre: building a file system for 1000-node clusters. In: Proceedings of the the 2003 Linux Symposium
Shen Y (2018) Teno: An efficient high-throughput computing job execution framework on tianhe-2. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
Shen Y, Peng C, Du Y, Lu Y (2019) Bi-cluster: A high-performance data query framework for large-scale scientific data. In: IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
Shoshani, A., Rotem, D.: Scientific data management: challenges, technology, and deployment. CRC Press/Taylor & Francis (2009)
Soroush E, Balazinska M, Wang DL (2011) Arraystore: a storage manager for complex parallel array processing. In: Proceedings of the ACM SIGMOD International cCnference on Management of Data, pp 253–264
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE (2015) Big data: Astronomical or genomical? PLoS Biol 13(7):e1002195
Stonebraker M, Becla J, Dewitt DJ, Lim KT, Zdonik SB (2009) Requirements for science data bases and scidb. In: Conference on Cidr
Stonebraker M, Brown P, Zhang D, Becla J (2013) Scidb: a database management system for applications with complex analytics. Comput Sci Eng 15(3):54–62
Wang DL, Monkewitz SM, Lim KT, Becla J (2011) Qserv: a distributed shared-nothing database for the lsst catalog. In: High performance computing, networking, storage & analysis
Werner A, Schnorbus M, Shrestha R, Cannon A, Zwiers F, Dayon G, Anslow F (2019) A long-term, temporally consistent, gridded daily meteorological dataset for northwestern north america. Sci Data 6(1):1–16
Wu K, Ahern S, Bethel EW, Chen J, Childs H, Cormier-Michel E, Geddes C, Gu J, Hagen H, Hamann BA (2009) Fastbit: Interactively searching massive data. J Phys Conf 180:012053
Wu K, Otoo EJ, Shoshani A (2006) Optimizing bitmap indices with efficient compression. ACM Trans Database Syst 31(1):1–38
Wu K, Shoshani A, Stockinger K (2010) Analyses of multi-level and multi-component compressed bitmap indexes. ACM Trans Database Syst 35(1):1–52
Wu K, Stockinger K, Shoshani A (2008) Breaking the curse of cardinality on bitmap indexes. In: International Conference on Scientific and Statistical Database Management
Wu T, Chou JC, Podhorszki N, Gu J, Tian Y, Klasky S, Wu K (2017) Apply block index technique to scientific data analysis and I/O systems. In: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017, Madrid, Spain, May 14–17, 2017, pp 865–871
Wu T, Shyng H, Chou J, Dong B, Wu K (2016) Indexing blocks to reduce space and time requirements for searching large data files. In: 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
Zhang Y, Kersten ML, Manegold S (2013) Sciql: array data processing inside an RDBMS. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp 1049–1052
Acknowledgements
Supported by National Key R&D Program of China (2018YFB0204303), National Natural Science Foundation of China (No. 61872392 and U1811461), Guangdong Natural Science Foundation (2018B030312002), the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant (No. 2016ZT06D211), the Major Program of Guangdong Basic and Applied Research (2019B030302002), NSF of Hunan (No. 2019JJ40339), and NSF of NUDT (No. ZK18-03-01).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liao, X., Shen, Y., Li, S. et al. Optimizing data query performance of Bi-cluster for large-scale scientific data in supercomputers. J Supercomput 78, 2417–2441 (2022). https://doi.org/10.1007/s11227-021-03965-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03965-4