Abstract
The scope of archival systems is expanding beyond cheap tertiary storage: scientific and medical data is increasingly digital, and the public has a growing desire to digitally record their personal histories. Driven by the increase in cost efficiency of hard drives, and the rise of the Internet, content archives have become a means of providing the public with fast, cheap access to long-term data. Unfortunately, designers of purpose-built archival systems are either forced to rely on workload behavior obtained from a narrow, anachronistic view of archives as simply cheap tertiary storage, or extrapolate from marginally related enterprise workload data and traditional library access patterns.
To close this knowledge gap and provide relevant input for the design of effective long-term data storage systems, we studied the workload behavior of several systems within this expanded archival storage space. Our study examined several scientific and historical archives, covering a mixture of purposes, media types, and access models---that is, public versus private. Our findings show that, for more traditional private scientific archival storage, files have become larger, but update rates have remained largely unchanged. However, in the public content archives we observed, we saw behavior that diverges from the traditional “write-once, read-maybe” behavior of tertiary storage. Our study shows that the majority of such data is modified---sometimes unnecessarily---relatively frequently, and that indexing services such as Google and internal data management processes may routinely access large portions of an archive, accounting for most of the accesses. Based on these observations, we identify areas for improving the efficiency and performance of archival storage systems.
- Agrawal, N., Bolosky, W. J., Douceur, J. R., and Lorch, J. R. 2007. A five-year study of file-system metadata. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST). 31--45. Google ScholarDigital Library
- Agrawal, N., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2009. Generating realistic impressions for file-system benchmarking. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST). 125--138. Google ScholarDigital Library
- Alaska State. 2010. Alaska’s digital archives. vilda.alaska.edu.Google Scholar
- Amazon. 2011. Amazon’s simple storage service. http://aws.amazon.com/s3/.Google Scholar
- Anderson, E. 2009. Capture, conversion, and analysis of an intense NFS workload. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. Google ScholarDigital Library
- Anderson, E., Arlitt, M., Charles B. Morrey, I., and Veitch, A. 2009. DataSeries: An efficient, flexible data format for structured serial data. ACM SIGOPS Operat. Syst. Rev. 43, 1, 70--75. Google ScholarDigital Library
- Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the SIGMETRICS Conference on Measurement and Modeling of Computer Systems. Google ScholarDigital Library
- Bairavasundaram, L. N., Goodson, G. R., Schroeder, B., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST). 223--238. Google ScholarDigital Library
- Baker, M., Keeton, K., and Martin, S. 2005. Why traditional systems don’t help us save stuff forever. In Proceedings of 1st IEEE Workshop on Hot Topics in System Dependendability. Google ScholarDigital Library
- Baker, M., Shah, M., Rosenthal, D. S. H., Roussopoulos, M., Maniatis, P., Giuli, T., and Bungale, P. 2006. A fresh look at the reliability of long-term digital storage. In Proceedings of EuroSys’06. 221--234. Google ScholarDigital Library
- Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., and Wingate, M. 2009. PLFS: A checkpoint filesystem for parallel applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. Google ScholarDigital Library
- California DWR. 2010. California Department of Water Resources water reports. http://www.water.ca.gov/waterdatalibrary/docs/Hydstra/index.cfm.Google Scholar
- Chronicles. 2011. Chronicles of life: Save your memories forever. http://www.chronicleoflife.com/.Google Scholar
- Colarelli, D. and Grunwald, D. 2002. Massive arrays of idle disks for storage archives. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’02). Google ScholarDigital Library
- Cornell University Library. 2010. Cornell University Library arXiv. http://arxiv.org/.Google Scholar
- Damoulakis, J. 2007. Opinion: Tape backup is WORN (write once, read never). http://www.computerworld.com/s/article/9026619/Opinion_Tape_backup_is_WORN_write_once_read_never_.Google Scholar
- Dayal, S. 2008. Characterizing HEC Storage Systems at Rest. Tech. rep. CMU-PDL-08-109, Carnegie Mellon University.Google Scholar
- Dropbox. 2011. Dropbox. http://www.dropbox.com/.Google Scholar
- Gibson, T., Miller, E. L., and Long, D. D. E. 1998. Long-term file activity and inter-reference patterns. In Proceedings of the 24th International Conference for the Resource Management and Performance and Performance Evaluation of Enterprise Computing Systems (CMG’98). CMG, Anaheim, CA, 976--987.Google Scholar
- Gibson, T. J. and Miller, E. L. 1998. Long-term file activity patterns in a UNIX workstation environment. In Proceedings of the 6th Goddard Conference on Mass Storage Systems and Technologies/15th IEEE Symposium on Mass Storage Systems. 355--372.Google Scholar
- HIPAA. 1996. Health Information Portability and Accountability Act.Google Scholar
- IBM. 2010. IBM 3380 direct access storage device. http://www-03.ibm.com/ibm/history/exhibits/storage/storage_3380e.html.Google Scholar
- Jaffe, E. and Kirkpatrick, S. 2009. Architecture of the Internet archive. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09). Google ScholarDigital Library
- Jensen, D. W. and Reed, D. A. 1993. File archive activity in a supercomputing environment. In Proceedings of the 7th International Conference on Supercomputing (SuperComputing’93). 387--396. Google ScholarDigital Library
- Leung, A. W., Pasupathy, S., Goodson, G., and Miller, E. L. 2008. Measurement and analysis of large-scale network file system workloads. In Proceedings of the USENIX Annual Technical Conference. Google ScholarDigital Library
- Lillibridge, M., Elnikety, S., Birrell, A., Burrows, M., and Isard, M. 2003. A cooperative Internet backup scheme. In Proceedings of the USENIX Annual Technical Conference. 29--42. Google ScholarDigital Library
- Maniatis, P., Roussopoulos, M., Giuli, T. J., Rosenthal, D. S. H., and Baker, M. 2005. The LOCKSS peer-to-peer digital preservation system. ACM Trans. Comput. Syst. 23, 1, 2--50. Google ScholarDigital Library
- Miller, E. and Katz, R. 1993. An analysis of file migration in a Unix supercomputing environment. In Proceedings of the Winter USENIX Technical Conference. 421--433.Google Scholar
- Moore, R. L., D’Aoust, J., McDonald, R. H., and Minor, D. 2007. Disk and tape storage cost models. In Archiving 2007.Google Scholar
- New York State. 2010. New York State digital archives. http://www.archives.nysed.gov/aindex.shtml.Google Scholar
- NOAA. 2010. National Climatic Data Center. http://www.ncdc.noaa.gov/oa/ncdc.html.Google Scholar
- ORNL. 2010. Distributed Active Archive Center. http://daac.ornl.gov/.Google Scholar
- Pinheiro, E. and Bianchini, R. 2004. Energy conservation techniques for disk array-based servers. In Proceedings of the 18th International Conference on Supercomputing. Google ScholarDigital Library
- Pinheiro, E., Weber, W.-D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST). Google ScholarDigital Library
- Quinlan, S. and Dorward, S. 2002. Venti: A new approach to archival storage. In Proceedings of the Conference on File and Storage Technologies (FAST). USENIX, 89--101. Google ScholarDigital Library
- Roselli, D., Lorch, J., and Anderson, T. 2000. A comparison of file system workloads. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, 41--54. Google ScholarDigital Library
- Sarbanes-Oxley. 2002. Sarbanes-Oxley act 2002. www.soxlaw.com.Google Scholar
- Schroeder, B. and Gibson, G. A. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST). 1--16. Google ScholarDigital Library
- Smith, A. J. 1981a. Analysis of long term file reference patterns for application to file migration algorithms. IEEE Trans. Softw. Engin. 7, 4, 403--417. Google ScholarDigital Library
- Smith, A. J. 1981b. Long term file migration: Development and evaluation of algorithms. Comm. ACM 24, 8, 521--532. Google ScholarDigital Library
- Storer, M. W., Greenan, K. M., Miller, E. L., and Voruganti, K. 2007. POTSHARDS: Secure long-term storage without encryption. In Proceedings of the USENIX Annual Technical Conference. 143--156. Google ScholarDigital Library
- Storer, M. W., Greenan, K. M., Miller, E. L., and Voruganti, K. 2008. Pergamum: Replacing tape with energy efficient, reliable, disk-based archival storage. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST). Google ScholarDigital Library
- Strange, S. 1992. Analysis of long-term UNIX file access patterns for application to automatic file migration strategies. Tech. rep. UCB/CSD 92/700, University of California, Berkeley. Google ScholarDigital Library
- Thereska, E., Salmon, B., Strunk, J., Wachs, M., Abd-El-Malek, M., Lopez, J., and Granger, G. R. 2006. Stardust: Tracking activity in a distributed storage system. In Proceedings of the SIGMETRICS Conference on Measurement and Modeling of Computer Systems. Google ScholarDigital Library
- Traeger, A., Zadok, E., Joukov, N., and Wright, C. P. 2008. A nine year study of file system and storage benchmarking. ACM Trans. Storage 4, 2. Google ScholarDigital Library
- Vogels, W. 1999. File system usage in Windows NT 4.0. In Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP’99). 93--109. Google ScholarDigital Library
- Washington State. 2010. Washington State digital archives. http://www.digitalarchives.wa.gov/.Google Scholar
- Wildani, A. and Miller, E. L. 2010. Semantic data placement for power management in archival storage. In Proceedings of the 5th International Workshop on Petascale Data Storage (PDSW10) (held in conjunction with SC2010).Google Scholar
- Wildani, A., Schwarz, T., Miller, E. L., and Long, D. D. E. 2009. Protecting against rare event failures in archival systems. In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google Scholar
- You, L. L., Pollack, K. T., and Long, D. D. E. 2005. Deep store: An archival storage system architecture. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05). Google ScholarDigital Library
- Zhang, Z., Lian, Q., Lin, S., Chen, W., Chen, Y., and Jin, C. 2007. BitVault: A highly reliable distributed data retention platform. ACM SIGOPS Operat. Syst. Rev. 41, 2, 27--36. Google ScholarDigital Library
- Zhu, B., Li, K., and Patterson, H. 2008. Avoiding the disk bottleneck in the Data Domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST). Google ScholarDigital Library
- Zhu, Q., Chen, Z., Tan, L., Zhou, Y., Keeton, K., and Wilkes, J. 2005. Hibernator: Helping disk arrays sleep through the winter. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). ACM. Google ScholarDigital Library
Index Terms
- Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories
Recommendations
Analysis and workload characterization of the CERN EOS storage system
CHEOPS '22: Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage SystemsModern, large-scale scientific computing runs on complex exascale storage systems that support even more complex data workloads. Understanding the data access and movement patterns is vital for informing the design of future iterations of existing ...
Reliability and security of RAID storage systems and D2D archives using SATA disk drives
Information storage reliability and security is addressed by using personal computer disk drives in enterprise-class nearline and archival storage systems. The low cost of these serial ATA (SATA) PC drives is a tradeoff against drive reliability design ...
Hybrid S-RAID: A Power-Aware Archival Storage Architecture
PDCAT '12: Proceedings of the 2012 13th International Conference on Parallel and Distributed Computing, Applications and TechnologiesSemi-RAID (S-RAID) is an alternative RAID data layout for applications that exhibit sequential data access pattern in order to reduce power consumption of storage systems. However, it is not design for archival storage specially, and that makes it not ...
Comments