Optimizing data query performance of Bi-cluster for large-scale scientific data in supercomputers

Liao, Xia; Shen, Yixian; Li, Shengguo; Lu, Yutong; Du, Yufei; Chen, Zhiguang

doi:10.1007/s11227-021-03965-4

Optimizing data query performance of Bi-cluster for large-scale scientific data in supercomputers

Published: 29 June 2021

Volume 78, pages 2417–2441, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Xia Liao ORCID: orcid.org/0000-0001-6823-758X¹,
Yixian Shen²,
Shengguo Li¹,
Yutong Lu²,
Yufei Du² &
…
Zhiguang Chen²

328 Accesses
2 Citations
Explore all metrics

Abstract

Scientific exploration and discovery heavily rely on increasing datasets and strong supercomputing power. Surging data pose massive data management challenges in existing data query frameworks. Although many data management techniques have been developed to quickly locate the selected data records, the time and space required to build and store these indexes are often too expensive. To deal with the problem of data location in a parallel file system managing large-scale scientific data, we propose an improved high-performance query data framework called “Bi-cluster+.” In the aspect of index generation, a hierarchical index data structure is designed, which effectively balances index granularity and index construction overhead. According to the characteristics of the index offset, the write load balancing strategy is designed. The hierarchical index is written independently and in parallel. The in situ index generation is optimized by resource constraint analysis. In the aspect of data retrieval, optimization techniques are proposed to improve the query performance. Such as the strategy of the logical data block merging and reading. With the experiments by using multiple scientific datasets on a supercomputer, our optimizations improve data query performance by up to a factor of 1.9 compared with the original Bi-cluster implementation. The scalability of Bi-cluster+ can keep a good performance by evaluating on 17496 cores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A BeeGFS-Based Caching File System for Data-Intensive Parallel Computing

High-Performance Storage Support for Scientific Big Data Applications on the Cloud

Adaptive partitioning and indexing for in situ query processing

Article 15 November 2019

References

The hdf5 format. http://www.hdfgroup.org/HDF5
Netcdf. http://www.unidata.ucar.edu/software/netcdf
Aguilera MK, Golab W, Shah MA (2008) A practical scalable distributed b-tree. Proceedings of the Vldb Endowment
Behzad B, Luu HVT, Huchette J, Byna S, Prabhat Aydt RA, Koziol Q, Snir M (2013) Taming parallel i/o complexity with auto-tuning. In: International conference on High Performance Computing
Blanas S, Wu K, Byna S, Dong B, Shoshani A (2014) Parallel data analysis directly on scientific file formats. ACM
Bowers KJ, Albright BJ, Yin L, Bergen B, Kwan TJT (2008) Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulation. Phys Plasmas 15:199–434
Article Google Scholar
Chen C, Huang X, Fu H, Yang G (2012) The chunk-locality index: an efficient query method for climate datasets. In: 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pp 2104–2110
Cheng L, Wang Y, Pei Y, Epema D (2017) A coflow-based co-optimization framework for high-performance data analytics. In: Proceedings of 46th International Conference on Parallel Processing, pp 392–401
Cheng P, Wang Y, Lu Y, Du Y, Chen Z (2019) Uniindex: an index and query middleware for parallel file systems. Concurr Comput Pract Experience 32(10):1
Google Scholar
Chou J, Howison M, Austin B, Wu K, Ryne RD (2011) Parallel index and query for large scale data analysis. In: High Performance Computing, Networking, Storage & Analysis
Chou JC, Wu K (2011) Prabhat: Fastquery: A parallel indexing system for scientific data. In: IEEE International Conference on Cluster Computing (CLUSTER), Austin, TX, USA, Sept 26–30, 2011, pp 455–464
Dong B, Byna S, Wu K (2013) Sds: A framework for scientific data services. In: Parallel Data Storage Workshop
Hey T (2012) The fourth paradigm—data-intensive scientific discovery. Springer, Berlin
Book Google Scholar
Jha S, Qiu J, Luckow A, Mantha P (2014) Fox GC A tale of two data-intensive paradigms: Applications, abstractions, and architectures. CoRR arXiv:abs/1403.1528
Kim J, Abbasi H, Chacon L, Docan C, Wu K (2011) Parallel in situ indexing for data-intensive computing. In: Large data analysis & visualization
Kraska T, Beutel A, Chi EH, Dean J, Polyzotis N (2017) The case for learned index structures
Lakshminarasimhan S, Shah N, Ethier S, Klasky S, Latham R, Ross RB, Samatova NF (2011) Compressing the incompressible with ISABELA: in-situ reduction of spatio-temporal data. In: Euro-Par 2011 Parallel Processing - 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29–September 2, 2011, Proceedings, Part I, pp 366–379
Li J, Liao WK, Choudhary A, Ross R, Zingale M (2003) Parallel netcdf: A high-performance scientific i/o interface. In: Supercomputing, ACM/IEEE Conference
Liao X, Xiao L, Yang C, Yutong LU (2014) Milkyway-2 supercomputer: system and application. Front Comput Sci 8(3):1
MathSciNet Google Scholar
Liu Q, Logan J, Tian Y, Abbasi H, Podhorszki N, Choi JY, Klasky S, Tchoua R, Lofstead J, Oldfield RA (2014) Hello adios: the challenges and lessons of developing leadership class i/o frameworks. Concurr Comput Pract Exp 26(7):1453–1473
Article Google Scholar
Schwan P (2003) Lustre: building a file system for 1000-node clusters. In: Proceedings of the the 2003 Linux Symposium
Shen Y (2018) Teno: An efficient high-throughput computing job execution framework on tianhe-2. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
Shen Y, Peng C, Du Y, Lu Y (2019) Bi-cluster: A high-performance data query framework for large-scale scientific data. In: IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
Shoshani, A., Rotem, D.: Scientific data management: challenges, technology, and deployment. CRC Press/Taylor & Francis (2009)
Soroush E, Balazinska M, Wang DL (2011) Arraystore: a storage manager for complex parallel array processing. In: Proceedings of the ACM SIGMOD International cCnference on Management of Data, pp 253–264
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE (2015) Big data: Astronomical or genomical? PLoS Biol 13(7):e1002195
Stonebraker M, Becla J, Dewitt DJ, Lim KT, Zdonik SB (2009) Requirements for science data bases and scidb. In: Conference on Cidr
Stonebraker M, Brown P, Zhang D, Becla J (2013) Scidb: a database management system for applications with complex analytics. Comput Sci Eng 15(3):54–62
Article Google Scholar
Wang DL, Monkewitz SM, Lim KT, Becla J (2011) Qserv: a distributed shared-nothing database for the lsst catalog. In: High performance computing, networking, storage & analysis
Werner A, Schnorbus M, Shrestha R, Cannon A, Zwiers F, Dayon G, Anslow F (2019) A long-term, temporally consistent, gridded daily meteorological dataset for northwestern north america. Sci Data 6(1):1–16
Article Google Scholar
Wu K, Ahern S, Bethel EW, Chen J, Childs H, Cormier-Michel E, Geddes C, Gu J, Hagen H, Hamann BA (2009) Fastbit: Interactively searching massive data. J Phys Conf 180:012053
Article Google Scholar
Wu K, Otoo EJ, Shoshani A (2006) Optimizing bitmap indices with efficient compression. ACM Trans Database Syst 31(1):1–38
Article Google Scholar
Wu K, Shoshani A, Stockinger K (2010) Analyses of multi-level and multi-component compressed bitmap indexes. ACM Trans Database Syst 35(1):1–52
Article Google Scholar
Wu K, Stockinger K, Shoshani A (2008) Breaking the curse of cardinality on bitmap indexes. In: International Conference on Scientific and Statistical Database Management
Wu T, Chou JC, Podhorszki N, Gu J, Tian Y, Klasky S, Wu K (2017) Apply block index technique to scientific data analysis and I/O systems. In: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017, Madrid, Spain, May 14–17, 2017, pp 865–871
Wu T, Shyng H, Chou J, Dong B, Wu K (2016) Indexing blocks to reduce space and time requirements for searching large data files. In: 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
Zhang Y, Kersten ML, Manegold S (2013) Sciql: array data processing inside an RDBMS. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp 1049–1052

Download references

Acknowledgements

Supported by National Key R&D Program of China (2018YFB0204303), National Natural Science Foundation of China (No. 61872392 and U1811461), Guangdong Natural Science Foundation (2018B030312002), the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant (No. 2016ZT06D211), the Major Program of Guangdong Basic and Applied Research (2019B030302002), NSF of Hunan (No. 2019JJ40339), and NSF of NUDT (No. ZK18-03-01).

Author information

Authors and Affiliations

College of Computer Science, National University of Defense Technology, Changsha, 410073, China
Xia Liao & Shengguo Li
National Supercomputer Center in Guangzhou, and the School of Data and Computer Science, Sun Yatsen University, Guangzhou, 510006, China
Yixian Shen, Yutong Lu, Yufei Du & Zhiguang Chen

Authors

Xia Liao
View author publications
You can also search for this author in PubMed Google Scholar
Yixian Shen
View author publications
You can also search for this author in PubMed Google Scholar
Shengguo Li
View author publications
You can also search for this author in PubMed Google Scholar
Yutong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yufei Du
View author publications
You can also search for this author in PubMed Google Scholar
Zhiguang Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xia Liao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liao, X., Shen, Y., Li, S. et al. Optimizing data query performance of Bi-cluster for large-scale scientific data in supercomputers. J Supercomput 78, 2417–2441 (2022). https://doi.org/10.1007/s11227-021-03965-4

Download citation

Accepted: 18 June 2021
Published: 29 June 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11227-021-03965-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing data query performance of Bi-cluster for large-scale scientific data in supercomputers

Abstract

Access this article

Similar content being viewed by others

A BeeGFS-Based Caching File System for Data-Intensive Parallel Computing

High-Performance Storage Support for Scientific Big Data Applications on the Cloud

Adaptive partitioning and indexing for in situ query processing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimizing data query performance of Bi-cluster for large-scale scientific data in supercomputers

Abstract

Access this article

Similar content being viewed by others

A BeeGFS-Based Caching File System for Data-Intensive Parallel Computing

High-Performance Storage Support for Scientific Big Data Applications on the Cloud

Adaptive partitioning and indexing for in situ query processing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation