Enhancing throughput of the Hadoop Distributed File System for interaction-intensive tasks

https://doi.org/10.1016/j.jpdc.2014.03.010Get rights and content

Highlights

  • Analyzed the performance degradation of HDFS caused by interaction-intensive tasks.

  • Designed a two-layer structure to improve the performance of handling I/O request.

  • Integrated caches to reduce the overhead of accessing interaction-intensive files.

  • Developed a PSO-based storage allocation algorithm to improve the I/O throughput.

  • Designed a set of experiments to evaluate the performance of the proposed methods.

Abstract

The Hadoop Distributed File System (HDFS) is designed to run on commodity hardware and can be used as a stand-alone general purpose distributed file system (Hdfs user guide, 2008). It provides the ability to access bulk data with high I/O throughput. As a result, this system is suitable for applications that have large I/O data sets. However, the performance of HDFS decreases dramatically when handling the operations of interaction-intensive files, i.e., files that have relatively small size but are frequently accessed. The paper analyzes the cause of throughput degradation issue when accessing interaction-intensive files and presents an enhanced HDFS architecture along with an associated storage allocation algorithm that overcomes the performance degradation problem. Experiments have shown that with the proposed architecture together with the associated storage allocation algorithm, the HDFS throughput for interaction-intensive files increases 300% on average with only a negligible performance decrease for large data set tasks.

Introduction

The Hadoop Distributed File System (HDFS) is designed as a massive data storage framework and serves as the storage component for the Apache Hadoop platform. This file system is based on commodity hardware and provides highly reliable storage and global access with high throughput for large data sets  [15]. Because of these advantages, the HDFS is also used as a stand-alone general purpose distributed file system and serves for non-Hadoop applications  [6].

However, the advantage of high throughput that the HDFS provides diminishes quickly when handling interaction-intensive files, i.e., files that are of small sizes but are accessed frequently. The reason is that before an I/O transmission starts, there are a few necessary initialization steps which need to be completed, such as data location retrieving and storage space allocation. When transferring large data, this initialization overhead becomes relatively small and can be negligible compared with the data transmission itself. However, when transferring small size data, this overhead becomes significant. In addition to the initialization overhead, files with high I/O data access frequencies can also quickly overburden the regulating component in the HDFS, i.e., the single namenode that supervises and manages every access to datanodes  [1]. If the number of datanodes is large, the single namenode can quickly become a bottleneck when the frequency of I/O requests is high.

In many systems, frequent file access is unavoidable. For example, log file updating, which is a common procedure in many applications.1 Since the HDFS applies the rule of“Write-Once–Read-Many”, the updating procedure first reads these files, modifies them and then writes them back. Such an updating procedure generates I/O requests with high frequency. Another example is the synchronization procedure in a distributed system using incremental message delivery via the file system. In this case, an incremental message file is generated by a changed component of a distributed system. This file is relatively small and it is read by all participating components to synchronize the entire system. As the HDFS allows multiple tasks reading the same file simultaneously, it is possible that a file is read frequently within a short period of time.

To overcome these issues for interaction-intensive tasks, efforts are often made from three directions: (a) improving data structure or system architecture to provide faster I/O with less overhead[3], [13], (b) extending the namenode with a hierarchical structure  [12], [2], [5] to avoid single namenode overload, and (c) designing a better storage allocation algorithm to improve data accessibility  [7], [8].

In this paper, we use the HDFS as a stand-alone file system and present an integrated approach to addressing the HDFS performance degradation issue for interaction-intensive tasks. In particular, we extend the HDFS architecture by adding cache support and transforming single namenode to an extended hierarchical namenode architecture. Based on the extended architecture, we develop a Particle Swarm Optimization (PSO)-based storage allocation algorithm to improve the HDFS throughput for interaction-intensive tasks.

The rest of the paper is organized as follows. Section  2 discusses the related work focusing on the cause of throughput degradation when handling interaction-intensive tasks and possible solutions developed for the problem. Section  3 presents an enhanced HDFS namenode structure. Section  4 describes the proposed PSO-based storage allocation algorithms to be deployed on the extended structure. Experimental results are presented and analyzed in Section  5. Finally, in Section  6, we conclude the paper with a summary of our findings and point out future work.

Section snippets

Related work

As the application of the HDFS increases, the pitfalls of the HDFS are also being discovered and studied. Among them is the poor performance when the HDFS handles small and frequently interacted files. As Jiang et al.  [9] pointed out, the HDFS is designed for processing big data transmission rather than transferring a large number of small files, hence it is not suitable for interaction-intensive tasks.

Shvachko et al.  [15] notice that the HDFS sacrifices the immediate response time of

Extended namenode with cache support

In this section, an extended namenode architecture for the HDFS with cache support is specified. The following two subsections describe the details of the structure and its functionalities.

Storage allocation algorithm

As presented in Section  3, by changing the single namenode to a hierarchical namenode structure, the HDFS’s capability of handling frequent requests increases. In addition, caches introduced in this structure provide the ability of faster data access with shorter response time. Under this new structure, the throughput degradation caused by the access to interaction-intensive files can be further reduced by applying an optimized storage allocation strategy. The development of this strategy is

Experiment specifications and result analysis

In this section, we are to empirically show that modifications made to the original HDFS are able to (1) delay the time when the namenode becomes overloaded; and (2) the system throughput is increased for interaction-intensive tasks. The test-bed consists of 130 workstations with 2 Ghz CPU/4G RAM/5400 rpm HDD/1000 Mbps network connection. The router used are two Quidway S9306 routers with 1152 Mpps package forwarding rate and 6 Tbps backboard bandwidth. The test applications of the experiments

Conclusion

This paper has presented an enhanced HDFS in which the performance of handling interaction-intensive tasks is significantly improved. The modifications to the HDFS are: (1) changing the single namenode structure into an extended namenode structure; (2) deploying caches on each rack to improve the I/O performance of accessing interaction-intensive files; and (3) using PSO-based algorithms to find a near optimal storage allocation plan for incoming files.

Structurally, only small changes were made

Acknowledgments

The authors wish to thank Professor Bin Su, the dean of software engineer department of Southwest Jiaotong University, China, for his help in providing the infrastructure for the experiments. The authors also wish to thank Professor Bin Su and his two graduate students, Zuowe Si and Xiao Wang, for their help in implementing and testing code, deploying system, and obtaining experimental data.

“This research made use of Montage, funded by the National Aeronautics and Space Administration’s Earth

Xiayu Hua is a Ph.D. student in the Computer Science Department at Illinois Institute of Technology. His research interest is in distributed file system, virtualization technology, real-time scheduling and cloud computing. He earned his B.S. degree from the Northwestern Polytechnic University, China, in 2008 and his M.S. degree from the East China Normal University, China, in 2012.

References (18)

  • D. Borthakur, The hadoop distributed file system: architecture and design, Hadoop Project Website,...
  • D. Borthakur, Hdfs architecture guide, HADOOP APACHE PROJECT, 2008....
  • S. Chandrasekar et al.

    A novel indexing scheme for efficient handling of small files in hadoop distributed file system

  • E. Deelman et al.

    The cost of doing science on the cloud: the montage example

  • D. Fesehaye et al.

    Edfs: a semi-centralized efficient distributed file system

  • Hdfs user guide. [Online], 2008. Available:...
  • H. Hsiao, H. Chung, H. Shen, Y. Chao, Load rebalancing for distributed file systems in clouds,...
  • A. Indrayanto et al.

    Application of game theory and fictitious play in data placement

  • L. Jiang et al.

    The optimization of hdfs based on small files

There are more references available in the full text version of this article.

Cited by (0)

Xiayu Hua is a Ph.D. student in the Computer Science Department at Illinois Institute of Technology. His research interest is in distributed file system, virtualization technology, real-time scheduling and cloud computing. He earned his B.S. degree from the Northwestern Polytechnic University, China, in 2008 and his M.S. degree from the East China Normal University, China, in 2012.

Hao Wu is now a Ph.D. student in Computer Science Department at Illinois Institute of Technology. He received B.E. of Information Security from Sichuan University, Chengdu, China, 2007. He received M.S. of Computer Science from University of Bridgeport, Bridgeport, CT, 2009. His current research interests mainly focus on Resource Management in Cloud Computing.

Zheng Li received the B.S. degree in Computer Science and M.S. degree in Communication and Information System from University of Electronic Science and Technology of China, in 2005 and 2008, respectively. He is currently a Ph.D. candidate in the Department of Computer Science at the Illinois Institute of Technology. His research interests include real-time embedded and distributed systems.

Dr. Shangping Ren is an associate professor in Computer Science Department at the Illinois Institute of Technology. She earned her Ph.D. from UIUC in 1997. Before she joined IIT in 2003, she worked in software and telecommunication companies as software engineer and then lead software engineer. Her current research interests include coordination models for real-time distributed open systems, real-time, fault-tolerant and adaptive systems, Cyber-Physical System, parallel and distributed systems, cloud computing, and application-aware many-core virtualization for embedded and real-time applications.

View full text