skip to main content
10.1145/2792745.2792777acmotherconferencesArticle/Chapter ViewAbstractPublication PagesxsedeConference Proceedingsconference-collections
research-article

Storage utilization in the long tail of science

Published: 26 July 2015 Publication History

Abstract

The increasing expansion of computations in non-traditional domain sciences has resulted in an increasing demand for research cyberinfrastructure that is suitable for small- and mid-scale job sizes. The computational aspects of these emerging communities are coming into focus and being addressed through the deployment of several new XSEDE resources that feature easy on-ramps, customizable software environments through virtualization, and interconnects optimized for jobs that only use hundreds or thousands of cores; however, the data storage requirements for these emerging communities remains much less well characterized.
To this end, we examined the distribution of file sizes on two of the Lustre file systems within the Data Oasis storage system at the San Diego Supercomputer Center (SDSC). We found that there is a very strong preference for small files among SDSC's users, with 90% of all files being less than 2 MB in size. Furthermore, 50% of all file system capacity is consumed by files under 2 GB in size, and these distributions are consistent on both scratch and projects storage file systems. Because parallel file systems like Lustre and GPFS are optimized for parallel IO to large, widestripe files, these findings suggest that parallel file systems may not be the most suitable storage solutions when designing cyberinfrastructure to meet the needs of emerging communities.

References

[1]
Alam, S. R. et al. 2011. Parallel I/O and the metadata wall. Proceedings of the sixth workshop on Parallel Data Storage - PDSW '11 (New York, New York, USA, 2011), 13.
[2]
Baxter, D. et al. 2014. Gateways to Discovery : Cyberinfrastructure for the Long Tail of Science. Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment - XSEDE '14 (Atlanta, GA, 2014).
[3]
Liu, N. et al. 2012. On the Role of Burst Buffers in Leadership-Class Storage Systems. Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on (San Diego, CA, 2012), 1--11.
[4]
mdtest HPC Benchmark: http://sourceforge.net/projects/mdtest/.
[5]
Moore, R. L. et al. 2012. Analyzing throughput and utilization on trestles. Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment on Bridging from the eXtreme to the campus and beyond - XSEDE '12 (New York, New York, USA, 2012), 1.
[6]
Norman, M. L. and Snavely, A. 2010. Accelerating data-intensive science with Gordon and Dash. Proceedings of the 2010 TeraGrid Conference (2010), 1--7.
[7]
Raczy, C. et al. 2013. Isaac: Ultra-fast whole-genome secondary analysis on Illumina sequencing platforms. Bioinformatics. 29, 16 (2013), 2041--2043.
[8]
Rowstron, A. et al. 2012. Nobody ever got fired for using Hadoop on a cluster. Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing HotCDP 12 (2012), 1--5.
[9]
Strunk, J. D. 2012. Hybrid aggregates: combining SSDs and HDDs in a single storage pool. ACM SIGOPS Operating Systems Review. 46, (2012), 50.

Cited By

View all
  • (2019)A Quantitative Approach to Architecting All-Flash Lustre File SystemsHigh Performance Computing10.1007/978-3-030-34356-9_16(183-197)Online publication date: 16-Jun-2019
  • (2017)Minimal Coflow Routing and Scheduling in OpenFlow-Based Cloud Storage Area Networks2017 IEEE 10th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD.2017.36(222-229)Online publication date: Jun-2017
  • (2016)BIC-LSUProceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale10.1145/2949550.2949556(1-8)Online publication date: 17-Jul-2016

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
XSEDE '15: Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure
July 2015
296 pages
ISBN:9781450337205
DOI:10.1145/2792745
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

  • San Diego Super Computing Ctr: San Diego Super Computing Ctr
  • HPCWire: HPCWire
  • Omnibond: Omnibond Systems, LLC
  • SGI
  • Internet2
  • Indiana University: Indiana University
  • CASC: The Coalition for Academic Scientific Computation
  • NICS: National Institute for Computational Sciences
  • Intel: Intel
  • DDN: DataDirect Networks, Inc
  • DELL
  • CORSA: CORSA Technology
  • ALLINEA: Allinea Software
  • Cray
  • RENCI: Renaissance Computing Institute

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 July 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. lustre
  2. parallel file systems
  3. storage
  4. utilization

Qualifiers

  • Research-article

Funding Sources

  • NSF

Conference

XSEDE '15
Sponsor:
  • San Diego Super Computing Ctr
  • HPCWire
  • Omnibond
  • Indiana University
  • CASC
  • NICS
  • Intel
  • DDN
  • CORSA
  • ALLINEA
  • RENCI

Acceptance Rates

XSEDE '15 Paper Acceptance Rate 49 of 70 submissions, 70%;
Overall Acceptance Rate 129 of 190 submissions, 68%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2019)A Quantitative Approach to Architecting All-Flash Lustre File SystemsHigh Performance Computing10.1007/978-3-030-34356-9_16(183-197)Online publication date: 16-Jun-2019
  • (2017)Minimal Coflow Routing and Scheduling in OpenFlow-Based Cloud Storage Area Networks2017 IEEE 10th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD.2017.36(222-229)Online publication date: Jun-2017
  • (2016)BIC-LSUProceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale10.1145/2949550.2949556(1-8)Online publication date: 17-Jul-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media