skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Moving small files in a networked environment

Journal Article · · Future Generations Computer Systems

Globally distributed computing infrastructures, such as clouds and supercomputers, are currently used to manage data that is generated with an unprecedented speed from a variety of resources. Coping with this trend, the volume of data exchanged across distant sites increases substantially. To accelerate data transfer, high-speed networks are provided to connect remote sites. Most existing data movement solutions are optimized for moving large files. However, it is still challenging to transfer a large number of small files across networks. This disadvantage not only lowers data transfer performance, but also decreases overall system utilization. Here, we identify that moving small files is mainly constrained by degraded file system throughput, not just network performance as might be suspected. We have built a data transfer pipeline model to analyze the impact of small network I/O and storage I/O on data movement. Extending one of the widely used open source data movement solutions, GridFTP, we demonstrate several appropriate engineering approaches that mitigate the bottleneck and increase data transfer efficiency. We show optimizations that improve data transfer performance more than 5 times. In comparison to existing solutions, our approaches can save a significant amount of system resources for moving lots of small files.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE
Grant/Contract Number:
AC02-06CH11357
OSTI ID:
2320227
Journal Information:
Future Generations Computer Systems, Vol. 139; ISSN 0167-739X
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English

References (18)

mdtmFTP and its evaluation on ESNET SDN testbed journal February 2018
A technique for moving large data sets over high-performance long distance networks conference May 2011
Size Matters conference November 2018
Small-file access in parallel file systems conference May 2009
Wide-area analytics with multiple resources conference April 2018
Globus Online: Accelerating and Democratizing Science through Cloud-Based Services journal May 2011
TýrFS: Increasing Small Files Access Performance with Dynamic Metadata Replication conference May 2018
PLFS: a checkpoint filesystem for parallel applications conference January 2009
Globus XIO pipe open driver conference July 2011
Optimizing a hybrid SSD/HDD HPC storage system based on file size distributions conference May 2013
Comparative Performance Evaluation of High-performance Data Transfer Tools conference December 2018
FlashLite: A High Performance Machine for Data Intensive Science conference December 2015
BurstMem: A high-performance burst buffer system for scientific applications conference October 2014
HARP: Predictive Transfer Optimization Based on Historical Analysis and Real-Time Probing
  • Arslan, Engin; Guner, Kemal; Kosar, Tevfik
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.24
conference November 2016
The design and implementation of a log-structured file system conference September 1991
A fully meshed backbone network for data-intensive sciences and SDN services conference July 2016
Transferring a petabyte in a day journal November 2018
Hybrid Cloud Storage journal September 2017