skip to main content
10.1145/2792745.2792776acmotherconferencesArticle/Chapter ViewAbstractPublication PagesxsedeConference Proceedingsconference-collections
research-article

Using data science to understand tape-based archive workloads

Published: 26 July 2015 Publication History

Abstract

Data storage needs continue to grow in most fields, and the cost per byte for tape remains lower than the cost for disk, making tape storage a good candidate for cost-effective long-term storage. However, the workloads suitable for tape archives differ from those for disk file systems, and archives must handle internally generated workloads that can be more demanding than those generated by end users (e.g., migration of data from an old tape technology to a new one). To better understand the variegated workloads, we have followed the first steps in the data science methodology. For anyone considering the use or deployment of a tape-based data archive or for anyone interested in details of data archives in the context of data science, this paper describes key aspects of data archive workloads.

References

[1]
Adams, I., Miller, E., Storer, M., 2012. Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories. ACM Transactions on Storage, Vol. 8, No. 2, Article 6.
[2]
Columbia University, Data Science Institute, Certification of Professional Achievement in Data Sciences. http://datascience.columbia.edu/certification
[3]
Conway, D., 2013. The Data Science Venn Diagram. http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
[4]
Gandrud, C. 2013. Reproducible Research with R and RStudio, Chapman & Hall/CRC The R Series, CRC Press, Taylor & Francis Group, Boca Raton, London, New York.
[5]
Hart, D., Gillman, P., Thanhardt, E., 2013. NCAR Storage Accounting and Analysis Possibilities, XSEDE'13, San Diego, CA.
[6]
Hilbert, M., Lopez, P., 2011. The World's Technological Capacity to Store, Communicate, and Compute Information. Science, Vol. 332, No. 6025, pp. 60--65.
[7]
IBM Corporation, IBM's Tale of the Tape - 60th Anniversary of IBM Tape Innovation. https://www.flickr.com/photos/ibm_media/7198622184/sizes/o/in/photostream/
[8]
Johns Hopkins University, Data Science Specialization Certification, Offered through Coursera. https://www.coursera.org/specialization/jhudatascience/1?utm_medium=listingPage
[9]
Miller, E., Katz, R. 1993. An Analysis of File Migration in a Unix Supercomputing Environment. Proceedings of the Winter USENIX Technical Conference, pp. 421--433.
[10]
Oracle Corporation 2013, StorageTek T10000 Tape Drives: Enterprise-Class Design for Maximum Reliability, An Oracle White Paper, p. 2.
[11]
Oracle Corporation 2013, StorageTek T10000 Tape Drives: Enterprise-Class Design for Maximum Reliability, An Oracle White Paper, pp. 9--10.
[12]
Ousterhout, J., Agrawal, P., Erickson, D., et. al., 2009. The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM, SIGOPS Operating Systems Review, Vol. 43, No. 4, pp. 92--105.
[13]
Skillicorn, D., Understanding Complex Datasets: Data Mining with Matrix Decompositions, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, CRC Press, Taylor & Francis Group, Boca Raton, London, New York.
[14]
Spectra Logic Corporation 2014. Total Cost of Ownership - Tape vs. Disk: What You Should Consider.
[15]
Watson, R. W., 2005. High Performance Storage System Scalability: Architecture, Implementation, and Experience, 22nd IEEE 13th NASA Goddard Conference on Mass Storage Systems and Technologies Monterey, CA.
[16]
Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis, Springer, Dordrecht, Heidelberg, London, New York.
[17]
Wickham, H. 2014. Tidy Data, Journal of Statistical Software, August 2014, Volume 59, Issue 10, Published by the American Statistical Association, Alexandria, VA.
[18]
Wickham, H. 2014. Tidy Data, Journal of Statistical Software, August 2014, Volume 59, Issue 10, Section 2.3, Published by the American Statistical Association, Alexandria, VA, p. 4.
[19]
Xie, Y. 2013. Dynamic Documents with R and knitr, Chapman & Hall/CRC The R Series, CRC Press, Taylor & Francis Group, Boca Raton, London, New York.

Cited By

View all
  • (2019)Understanding Data Motion in the Modern HPC Data Center2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW)10.1109/PDSW49588.2019.00012(74-83)Online publication date: Nov-2019
  • (2019)ExaPlan Archive: Data Placement and Provisioning for Large Storage Systems with Archival Tiers2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS.2019.00023(125-137)Online publication date: Oct-2019

Index Terms

  1. Using data science to understand tape-based archive workloads

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    XSEDE '15: Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure
    July 2015
    296 pages
    ISBN:9781450337205
    DOI:10.1145/2792745
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • San Diego Super Computing Ctr: San Diego Super Computing Ctr
    • HPCWire: HPCWire
    • Omnibond: Omnibond Systems, LLC
    • SGI
    • Internet2
    • Indiana University: Indiana University
    • CASC: The Coalition for Academic Scientific Computation
    • NICS: National Institute for Computational Sciences
    • Intel: Intel
    • DDN: DataDirect Networks, Inc
    • DELL
    • CORSA: CORSA Technology
    • ALLINEA: Allinea Software
    • Cray
    • RENCI: Renaissance Computing Institute

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 July 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. analysis
    2. archive
    3. data science
    4. metrics

    Qualifiers

    • Research-article

    Conference

    XSEDE '15
    Sponsor:
    • San Diego Super Computing Ctr
    • HPCWire
    • Omnibond
    • Indiana University
    • CASC
    • NICS
    • Intel
    • DDN
    • CORSA
    • ALLINEA
    • RENCI

    Acceptance Rates

    XSEDE '15 Paper Acceptance Rate 49 of 70 submissions, 70%;
    Overall Acceptance Rate 129 of 190 submissions, 68%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Understanding Data Motion in the Modern HPC Data Center2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW)10.1109/PDSW49588.2019.00012(74-83)Online publication date: Nov-2019
    • (2019)ExaPlan Archive: Data Placement and Provisioning for Large Storage Systems with Archival Tiers2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS.2019.00023(125-137)Online publication date: Oct-2019

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media