ABSTRACT
Identifying groups of blocks that tend to be read or written together in a given environment is the first step towards powerful techniques for device failure isolation and power management. For example, identified groups can be placed together on a single disk, avoiding excess drive activity across an exascale storage system. Unlike previous grouping work, we focus on identifying groupings in data that can be gathered from real, running systems with minimal impact. Using temporal, spatial, and access ordering information from an enterprise data set, we identified a set of groupings that consistently appear, indicating that these are working sets that are likely to be accessed together. We present several techniques to obtain groupings along with a discussion of what techniques best apply to particular types of real systems. We intend to use these preliminary results to inform our search for new types of workloads with a goal of identifying properties of easily separable workloads across different systems and dynamically moving groups in these workloads to reduce disk activity in large storage systems.
- A. Amer and D.D.E. Long. Aggregating caches: A mechanism for implicit file prefetching. In MASCOTS 2001, pages 293--301. IEEE, 2002. Google ScholarDigital Library
- A. Amer, D.D.E. Long, J.F. Paris, and R.C. Burns. File access prediction with adjustable accuracy. In IPCCC 2002, pages 131--140. IEEE Computer Society, 2002. Google ScholarDigital Library
- I. Ari, A. Amer, R. Gramacy, E.L. Miller, S.A. Brandt, and D.D.E. Long. ACME: adaptive caching using multiple experts. In Proceedings in Informatics, volume 14, pages 143--158. Citeseer, 2002.Google Scholar
- A. C. Arpaci-Dusseau, R.H. Arpaci-Dusseau, L.N. Bairavasundaram, T.E. Denehy, F.I. Popovici, V. Prabhakaran, and M. Sivathanu. Semantically-smart disk systems: past, present, and future. ACM SIGMETRICS Performance Evaluation Review, 33(4):29--35, 2006. Google ScholarDigital Library
- D. Colarelli and D. Grunwald. Massive arrays of idle disks for storage archives. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, page 11. IEEE Computer Society Press, 2002. Google ScholarDigital Library
- T. H. Cormen, C.E. Leiserson, and R.L. Rivest. Algorithms. MIT Press, Cambridge, Massachusetts, 1990.Google Scholar
- X. Ding, S. Jiang, F. Chen, K. Davis, and X. Zhang. DiskSeen: exploiting disk layout and access history to enhance I/O prefetch. In 2007 USENIX ATC, pages 1--14. USENIX Association, 2007. Google ScholarDigital Library
- S. Doraimani and A. Iamnitchi. File grouping for scientific data management: lessons from experimenting with real traces. In Proceedings of the 17th international symposium on High performance distributed computing, pages 153--164. ACM, 2008. Google ScholarDigital Library
- J. Duch and A. Arenas. Community detection in complex networks using extremal optimization. Physical Review E, 72(2):027104, 2005.Google ScholarCross Ref
- R. O. Duda, P.E. Hart, and D.G. Stork. Pattern classification, volume 2. Citeseer, 2001. Google ScholarDigital Library
- D. Essary and A. Amer. Predictive data grouping: Defining the bounds of energy and latency reduction through predictive data grouping and replication. Trans. Storage, 4(1):1--23, 2008. Google ScholarDigital Library
- S. Jiang, X. Ding, F. Chen, E. Tan, and X. Zhang. DULO: an effective buffer cache management scheme to exploit both temporal and spatial locality. In FAST 2005, page 8. USENIX Association, 2005. Google ScholarDigital Library
- T. M. Kroeger and D.D.E. Long. Predicting file system actions from prior events. In Proceedings of the 1996 annual conference on USENIX Annual Technical Conference, page 26. Usenix Association, 1996. Google ScholarDigital Library
- T. M. Kroeger and D.E. Long. Design and implementation of a predictive file prefetching algorithm. In USENIX Annual Technical Conference, General Track, pages 105--118, 2001. Google ScholarDigital Library
- Z. Li, Z. Chen, S.M. Srinivasan, and Y. Zhou. C-miner: Mining block correlations in storage systems. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies, pages 173--186. USENIX Association, 2004. Google ScholarDigital Library
- Marshall Kirk McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems, 2(3):181--197, August 1984. Google ScholarDigital Library
- D. Narayanan, A. Donnelly, and A. Rowstron. Write off-loading: Practical power management for enterprise storage. ACM Transactions on Storage (TOS), 4(3):1--23, 2008. Google ScholarDigital Library
- J. Oly and D.A. Reed. Markov model prediction of I/O requests for scientific applications. In Proceedings of the 16th international conference on Supercomputing, pages 147--155. ACM, 2002. Google ScholarDigital Library
- E. Pinheiro and R. Bianchini. Energy conservation techniques for disk array-based servers. In ICS '04, pages 68--78. ACM, 2004. Google ScholarDigital Library
- E. Pinheiro, W.D. Weber, and L.A. Barroso. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FASTâĂŹ07), 2007. Google ScholarDigital Library
- W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846--850, 1971.Google ScholarCross Ref
- A. Riska and E. Riedel. Disk drive level workload characterization. In Proceedings of the USENIX Annual Technical Conference, pages 97--103, 2006. Google ScholarDigital Library
- J. Schindler, J.L. Griffin, C.R. Lumb, and G.R. Ganger. Track-aligned extents: matching access patterns to disk drive characteristics. In Conference on File and Storage Technologies, 2002. Google ScholarDigital Library
- M. Sivathanu, V. Prabhakaran, A.C. Arpaci-Dusseau, and R.H. Arpaci-Dusseau. Improving storage system availability with D-GRAID. ACM TOS, 1(2):133--170, 2005. Google ScholarDigital Library
- M. Sivathanu, V. Prabhakaran, F.I. Popovici, T.E. Denehy, A.C. Arpaci-Dusseau, and R.H. Arpaci-Dusseau. Semantically-smart disk systems. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies, pages 73--88, 2003. Google ScholarDigital Library
- C. Staelin and H. Garcia-Molina. Clustering active disk data to improve disk performance. Princeton, NJ, USA, Tech. Rep. CS--TR--298--90, 1990.Google Scholar
- A. S. Tanenbaum, J.N. Herder, and H. Bos. File size distribution on UNIX systems: then and now. ACM SIGOPS Operating Systems Review, 40(1):104, 2006. Google ScholarDigital Library
- J. Wang and Y. Hu. PROFS-performance-oriented data reorganization for log-structured file system on multi-zone disks. In mascots, page 0285. Published by the IEEE Computer Society, 2001. Google ScholarDigital Library
- A. Wildani and E.L. Miller. Semantic data placement for power management in archival storage. In Petascale Data Storage Workshop (PDSW), 2010 5th, pages 1--5. IEEE, 2010.Google ScholarCross Ref
- N. J. Yadwadkar, C. Bhattacharyya, K. Gopinath, T. Niranjan, and S. Susarla. Discovery of application workloads from network file traces. In Proceedings of the 8th USENIX conference on File and storage technologies, page 14. USENIX Association, 2010. Google ScholarDigital Library
- X. Zhuang and H.H.S. Lee. Reducing cache pollution via dynamic data prefetch filtering. IEEE Transactions on Computers, pages 18--31, 2007. Google ScholarDigital Library
Index Terms
- Efficiently identifying working sets in block I/O streams
Recommendations
Techniques for efficiently allocating persistent storage
Efficient disk storage is a crucial component for many applications. The commonly used method of storing data on disk using file systems or databases incurs significant overhead which can be a problem for applications which need to frequently access and ...
A large-scale study of file-system contents
SIGMETRICS '99: Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Comments