skip to main content
research-article

Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories

Published:01 May 2012Publication History
Skip Abstract Section

Abstract

The scope of archival systems is expanding beyond cheap tertiary storage: scientific and medical data is increasingly digital, and the public has a growing desire to digitally record their personal histories. Driven by the increase in cost efficiency of hard drives, and the rise of the Internet, content archives have become a means of providing the public with fast, cheap access to long-term data. Unfortunately, designers of purpose-built archival systems are either forced to rely on workload behavior obtained from a narrow, anachronistic view of archives as simply cheap tertiary storage, or extrapolate from marginally related enterprise workload data and traditional library access patterns.

To close this knowledge gap and provide relevant input for the design of effective long-term data storage systems, we studied the workload behavior of several systems within this expanded archival storage space. Our study examined several scientific and historical archives, covering a mixture of purposes, media types, and access models---that is, public versus private. Our findings show that, for more traditional private scientific archival storage, files have become larger, but update rates have remained largely unchanged. However, in the public content archives we observed, we saw behavior that diverges from the traditional “write-once, read-maybe” behavior of tertiary storage. Our study shows that the majority of such data is modified---sometimes unnecessarily---relatively frequently, and that indexing services such as Google and internal data management processes may routinely access large portions of an archive, accounting for most of the accesses. Based on these observations, we identify areas for improving the efficiency and performance of archival storage systems.

References

  1. Agrawal, N., Bolosky, W. J., Douceur, J. R., and Lorch, J. R. 2007. A five-year study of file-system metadata. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST). 31--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, N., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2009. Generating realistic impressions for file-system benchmarking. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST). 125--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alaska State. 2010. Alaska’s digital archives. vilda.alaska.edu.Google ScholarGoogle Scholar
  4. Amazon. 2011. Amazon’s simple storage service. http://aws.amazon.com/s3/.Google ScholarGoogle Scholar
  5. Anderson, E. 2009. Capture, conversion, and analysis of an intense NFS workload. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Anderson, E., Arlitt, M., Charles B. Morrey, I., and Veitch, A. 2009. DataSeries: An efficient, flexible data format for structured serial data. ACM SIGOPS Operat. Syst. Rev. 43, 1, 70--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the SIGMETRICS Conference on Measurement and Modeling of Computer Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bairavasundaram, L. N., Goodson, G. R., Schroeder, B., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST). 223--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Baker, M., Keeton, K., and Martin, S. 2005. Why traditional systems don’t help us save stuff forever. In Proceedings of 1st IEEE Workshop on Hot Topics in System Dependendability. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Baker, M., Shah, M., Rosenthal, D. S. H., Roussopoulos, M., Maniatis, P., Giuli, T., and Bungale, P. 2006. A fresh look at the reliability of long-term digital storage. In Proceedings of EuroSys’06. 221--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., and Wingate, M. 2009. PLFS: A checkpoint filesystem for parallel applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. California DWR. 2010. California Department of Water Resources water reports. http://www.water.ca.gov/waterdatalibrary/docs/Hydstra/index.cfm.Google ScholarGoogle Scholar
  13. Chronicles. 2011. Chronicles of life: Save your memories forever. http://www.chronicleoflife.com/.Google ScholarGoogle Scholar
  14. Colarelli, D. and Grunwald, D. 2002. Massive arrays of idle disks for storage archives. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cornell University Library. 2010. Cornell University Library arXiv. http://arxiv.org/.Google ScholarGoogle Scholar
  16. Damoulakis, J. 2007. Opinion: Tape backup is WORN (write once, read never). http://www.computerworld.com/s/article/9026619/Opinion_Tape_backup_is_WORN_write_once_read_never_.Google ScholarGoogle Scholar
  17. Dayal, S. 2008. Characterizing HEC Storage Systems at Rest. Tech. rep. CMU-PDL-08-109, Carnegie Mellon University.Google ScholarGoogle Scholar
  18. Dropbox. 2011. Dropbox. http://www.dropbox.com/.Google ScholarGoogle Scholar
  19. Gibson, T., Miller, E. L., and Long, D. D. E. 1998. Long-term file activity and inter-reference patterns. In Proceedings of the 24th International Conference for the Resource Management and Performance and Performance Evaluation of Enterprise Computing Systems (CMG’98). CMG, Anaheim, CA, 976--987.Google ScholarGoogle Scholar
  20. Gibson, T. J. and Miller, E. L. 1998. Long-term file activity patterns in a UNIX workstation environment. In Proceedings of the 6th Goddard Conference on Mass Storage Systems and Technologies/15th IEEE Symposium on Mass Storage Systems. 355--372.Google ScholarGoogle Scholar
  21. HIPAA. 1996. Health Information Portability and Accountability Act.Google ScholarGoogle Scholar
  22. IBM. 2010. IBM 3380 direct access storage device. http://www-03.ibm.com/ibm/history/exhibits/storage/storage_3380e.html.Google ScholarGoogle Scholar
  23. Jaffe, E. and Kirkpatrick, S. 2009. Architecture of the Internet archive. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jensen, D. W. and Reed, D. A. 1993. File archive activity in a supercomputing environment. In Proceedings of the 7th International Conference on Supercomputing (SuperComputing’93). 387--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Leung, A. W., Pasupathy, S., Goodson, G., and Miller, E. L. 2008. Measurement and analysis of large-scale network file system workloads. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lillibridge, M., Elnikety, S., Birrell, A., Burrows, M., and Isard, M. 2003. A cooperative Internet backup scheme. In Proceedings of the USENIX Annual Technical Conference. 29--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Maniatis, P., Roussopoulos, M., Giuli, T. J., Rosenthal, D. S. H., and Baker, M. 2005. The LOCKSS peer-to-peer digital preservation system. ACM Trans. Comput. Syst. 23, 1, 2--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Miller, E. and Katz, R. 1993. An analysis of file migration in a Unix supercomputing environment. In Proceedings of the Winter USENIX Technical Conference. 421--433.Google ScholarGoogle Scholar
  29. Moore, R. L., D’Aoust, J., McDonald, R. H., and Minor, D. 2007. Disk and tape storage cost models. In Archiving 2007.Google ScholarGoogle Scholar
  30. New York State. 2010. New York State digital archives. http://www.archives.nysed.gov/aindex.shtml.Google ScholarGoogle Scholar
  31. NOAA. 2010. National Climatic Data Center. http://www.ncdc.noaa.gov/oa/ncdc.html.Google ScholarGoogle Scholar
  32. ORNL. 2010. Distributed Active Archive Center. http://daac.ornl.gov/.Google ScholarGoogle Scholar
  33. Pinheiro, E. and Bianchini, R. 2004. Energy conservation techniques for disk array-based servers. In Proceedings of the 18th International Conference on Supercomputing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Pinheiro, E., Weber, W.-D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Quinlan, S. and Dorward, S. 2002. Venti: A new approach to archival storage. In Proceedings of the Conference on File and Storage Technologies (FAST). USENIX, 89--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Roselli, D., Lorch, J., and Anderson, T. 2000. A comparison of file system workloads. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, 41--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sarbanes-Oxley. 2002. Sarbanes-Oxley act 2002. www.soxlaw.com.Google ScholarGoogle Scholar
  38. Schroeder, B. and Gibson, G. A. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST). 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Smith, A. J. 1981a. Analysis of long term file reference patterns for application to file migration algorithms. IEEE Trans. Softw. Engin. 7, 4, 403--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Smith, A. J. 1981b. Long term file migration: Development and evaluation of algorithms. Comm. ACM 24, 8, 521--532. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Storer, M. W., Greenan, K. M., Miller, E. L., and Voruganti, K. 2007. POTSHARDS: Secure long-term storage without encryption. In Proceedings of the USENIX Annual Technical Conference. 143--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Storer, M. W., Greenan, K. M., Miller, E. L., and Voruganti, K. 2008. Pergamum: Replacing tape with energy efficient, reliable, disk-based archival storage. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Strange, S. 1992. Analysis of long-term UNIX file access patterns for application to automatic file migration strategies. Tech. rep. UCB/CSD 92/700, University of California, Berkeley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Thereska, E., Salmon, B., Strunk, J., Wachs, M., Abd-El-Malek, M., Lopez, J., and Granger, G. R. 2006. Stardust: Tracking activity in a distributed storage system. In Proceedings of the SIGMETRICS Conference on Measurement and Modeling of Computer Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Traeger, A., Zadok, E., Joukov, N., and Wright, C. P. 2008. A nine year study of file system and storage benchmarking. ACM Trans. Storage 4, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Vogels, W. 1999. File system usage in Windows NT 4.0. In Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP’99). 93--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Washington State. 2010. Washington State digital archives. http://www.digitalarchives.wa.gov/.Google ScholarGoogle Scholar
  48. Wildani, A. and Miller, E. L. 2010. Semantic data placement for power management in archival storage. In Proceedings of the 5th International Workshop on Petascale Data Storage (PDSW10) (held in conjunction with SC2010).Google ScholarGoogle Scholar
  49. Wildani, A., Schwarz, T., Miller, E. L., and Long, D. D. E. 2009. Protecting against rare event failures in archival systems. In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google ScholarGoogle Scholar
  50. You, L. L., Pollack, K. T., and Long, D. D. E. 2005. Deep store: An archival storage system architecture. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Zhang, Z., Lian, Q., Lin, S., Chen, W., Chen, Y., and Jin, C. 2007. BitVault: A highly reliable distributed data retention platform. ACM SIGOPS Operat. Syst. Rev. 41, 2, 27--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Zhu, B., Li, K., and Patterson, H. 2008. Avoiding the disk bottleneck in the Data Domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zhu, Q., Chen, Z., Tan, L., Zhou, Y., Keeton, K., and Wilkes, J. 2005. Hibernator: Helping disk arrays sleep through the winter. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Storage
              ACM Transactions on Storage  Volume 8, Issue 2
              May 2012
              89 pages
              ISSN:1553-3077
              EISSN:1553-3093
              DOI:10.1145/2180905
              Issue’s Table of Contents

              Copyright © 2012 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 May 2012
              • Accepted: 1 October 2011
              • Revised: 1 August 2011
              • Received: 1 May 2011
              Published in tos Volume 8, Issue 2

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader