Abstract
This paper presents an effective method of metadata rebalance in exascale distributed file systems. Exponential data growth has led to the need for an adaptive and robust distributed file system whose typical architecture is composed of a large cluster of metadata servers and data servers. Though each metadata server can have an equally divided subset from the entire metadata set at first, there will eventually be a global imbalance in the placement of metadata among metadata servers, and this imbalance worsens over time. To ensure that disproportionate metadata placement will not have a negative effect on the intrinsic performance of a metadata server cluster, it is necessary to recover the balanced performance of the cluster periodically. However, this cannot be easily done because rebalancing seriously hampers the normal operation of a file system. This situation continues to get worse with both an ever-present heavy workload on the file system and frequent failures of server components at exascale. As one of the primary reasons for such a degraded performance, file system clients frequently fail to look up metadata from the metadata server cluster during the period of metadata rebalance; thus, metadata operations cannot proceed at their normal speed. We propose a metadata rebalance model that minimizes failures of metadata operations during the metadata rebalance period and validate the proposed model through a cost analysis. The analysis results demonstrate that our model supports the feasibility of online metadata rebalance without the normal operation obstruction and increases the chances of maintaining balance in a huge cluster of metadata servers.
Similar content being viewed by others
References
Borthakur D (2013) HDFS architecture guide. The Apache Software Foundation. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html. Accessed 4 Aug 2013
Konstantin S, Hairong K, Sanjay R, Robert C (2010) The hadoop distributed file system. In: Proceedings of the \(26{\rm th}\) IEEE Symposium on Mass Storage Systems and Technologies (MSST ’10), pp 1–10
Oracle (2010) Lustre 2.0 operations manual. Oracle corporation. https://docs.oracle.com/cd/E19527-01/821-2076-10/821-2076-10.pdf. Accessed July 2010
Shepler S, Eisler M, Noveck D (2010) Network file system version 4 minor version 1 protocol. Internet Engineering Task Force. https://tools.ietf.org/html/rfc5661. Accessed Jan 2010
Brent W, Marc U, Zainul A, Garth G, Brian M, Jason S, Jim Z, Bin Z (2008) Scalable performance of the panasas parallel file system. In: Proceedings of the \(6{\rm th}\) USENIX Conference on File and Storage Technologies (FAST ’08), pp 17–33
Julian MK, Thomas L (2007) Performance evaluation of the PVFS2 architecture. In: Proceedings of the \(15{\rm th}\) EUROMICRO International Conference on Parallel, Distributed and Network-based Processing (PDP’07), pp 509–516
Sage AW (2007) Ceph: reliable, scalable, and high-performance distributed storage. Doctoral dissertation, University of California
Sage AW, Scott AB, Ethan LM, Darrell DEL, Carlos M (2006) Ceph: a scalable, high-performance distributed file system. In: Proceedings of the \(7{\rm th}\) USENIX Symposium on Operating Systems Design and Implementation (OSDI ’06), pp 307–320
Dingshan H (2006) Data management in intelligent storage systems. Doctoral dissertation, University of Minnesota
Sanjay G, Howard G, Leung ST (2003) The Google File System. In: Proceedings of ACM Symposium on Operating Systems Principles (SOSP’03), pp 29–43
Sadaf RA, Hussein NEH, Kristopher H, Neil S, Fabio V (2011) Parallel I/O and the metadata wall. In: Proceedings of the \(6{\rm th}\) Workshop on Parallel Data Storage (PDSW’11), pp 13–18
Alexander T, Daniel JA (2015) CalvinFS: consistent WAN replication and scalable metadata management for distributed file Systems. In: Proceedings of the \(13{\rm th}\) USENIX Conference on File and Storage Technologies (FAST ’15), pp 1–14
Hua Y, Zhu Y, Jiang H, Feng D, Tian L (2011) Supporting Scalable and Adaptive Metadata Management in Ultralarge-scale File Systems. IEEE Trans Parallel Distrib Syst (TPDS) 22(4):580–593
Vilobh M, Xavier B, Xiangyong O, Raghunath R, Ravi PD, Dhabaleswar KP (2011) Can a decentralized metadata service layer benefit parallel filesystems? In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER’11), pp 484–493
Yifeng Z, Hong J, Jun W, Feng X (2008) HBA: distributed metadata management for large cluster-based storage systems. IEEE Trans Parallel Distrib Syst (TPDS) 19(6):750–763
Sage AW, Kristal TP, Scott AB, Ethan LM (2004) Dynamic metadata management for petabyte-scale file systems. In: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC’04)
Scott AB, Ethan LM, Darrell DEL, Lan X (2003) Efficient metadata management in large distributed storage systems. In: Proceedings of the \(20{\rm th}\) IEEE/\(11{\rm th}\) NASA Goddard Conference on Mass Storage Systems and Technologies (MSS’03), pp 290–298
Xiong J, Hu Y, Li G, Tang R, Fan Z (2011) Metadata distribution and consistency techniques for large-scale cluster file systems. IEEE Trans Parallel Distrib Syst (TPDS) 22(5):803–816
Xiao L, Ren K, Zheng Q, Gibson GA (2015) ShardFS vs. IndexFS: replication vs. caching strategies for distributed metadata management in cloud storage systems. In: Proceedings of the \(6{\rm th}\) ACM Symposium on Cloud Computing (SoCC‘15), pp 236–249
Praveen Y, Suman N, Haifeng Y, Phillip BG, Srinivasan S (2004) Beyond availability: towards a deeper understanding of machine failure characteristics in large distributed systems. In: Proceedings of the Workshop on Real, large Distributed Systems (WORLDS’04)
Hsiao HC, Chung HY, Shen H, Chao YC (2013) Load balancing for distributed file systems in clouds. IEEE Trans Parallel Distrib Syst (TPDS) 24(5):951–962
Hsiao HC, Chang CW (2013) A symmetric load balancing algorithm with performance guarantees for distributed hash tables. IEEE Trans Comput 62(4):662–675
Hsiao HC, Liao H, Chen ST, Huang KC (2011) Load balance with imperfect information in structured peer-to-peer systems. IEEE Trans Parallel Distrib Syst (TPDS) 22(4):634–649
Sean R, Dennis G, Timothy R, John K (2004) Handling churn in a DHT. In: Proceedings of the USENIX Annual Technical Conference (ATEC‘04), pp 127–140
Jeff S, Radek V, Bart S, Ben H, Chad W, Eric R, Mircea O, Kyle L, David M, Stephan E, John C, Ian R, Traian S, Himani A (2013) F1: a distributed SQL database that scales. In: Proceedings of the VLDBEndowment 6(11), pp 1068–1079
James CC, Jeffrey D, Michael E, Andrew F, Christopher F, JJ F, Sanjay G, Andrey G, Christopher H, Peter H, Wilson H, Sebastian K, Eugene K, Hongyi L, Alexander L, Sergey M, David M, David N, Sean Q, Rajesh R, Lindsay R, Yasushi S, Michal S, Christopher T, Ruth W, Dale W (2013) Spanner: Google’s globally distributed database. ACM Trans Comput Syst 31(3):1–22
Kim CK, Sedlar E, Chhugani J, Kaldewey T, Nguyen AD, Blas AD, Lee VW, Satish N, Dubey P (2009) Sort vs. hash revisited: fast join implementation on modern multi-core cpus. In: Proceedings of the VLDB Endowment 2(2), pp 1378–1389
Polychroniou O, Ross KA (2014) A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In: Proceedings of the 2014 ACM SIGMOD International conference on Management of Data. Snowbird, UT, pp 755–766
Wu L, Polychroniou O, Barker RJ, Kim MA, Ross KA (2014) Energy analysis of hardware and software range partitioning. ACM Trans Comput Syst 32(3):1–24
Mai HT, Park KH, Lee HS, Kim CS, Lee MY, Hur SJ (2014) Dynamic data migration in hybrid main memories for in-memory big data storage. ETRI J 36(6):988–998
Acknowledgments
This work was supported by Institute for Information and communications Technology Promotion(IITP) Grant Funded by the Korea government(MSIP) (No.R0126-15-1082, Management of Developing ICBMS(IoT, Cloud, Bigdata, Mobile, Security) Core Technologies and Development of Exascale Cloud Storage Technology).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cha, MH., Kim, DO., Kim, HY. et al. Adaptive metadata rebalance in exascale file system. J Supercomput 73, 1337–1359 (2017). https://doi.org/10.1007/s11227-016-1812-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1812-x