Abstract
The analysis of data usage in a large set of real traces from a high-energy physics collaboration revealed the existence of an emergent grouping of files that we coined “filecules”. This paper presents the benefits of using this file grouping for prestaging data and compares it with previously proposed file grouping techniques along a range of performance metrics. Our experiments with real workloads demonstrate that filecule grouping is a reliable and useful abstraction for data management in science Grids; that preserving time locality for data prestaging is highly recommended; that job reordering with respect to data availability has significant impact on throughput; and finally, that a relatively short history of traces is a good predictor for filecule grouping. Our experimental results provide lessons for workload modeling and suggest design guidelines for data management in data-intensive resource-sharing environments.
Similar content being viewed by others
References
Adamic, L., Huberman, B., Lukose, R., Puniyani, A.: Search in power law networks. Phys. Rev. E 64, 46135–46143 (2001)
Allen, M., Wolski, R.: The Livny and Plank-Beck problems: studies in data movement on the computational grid. In: Supercomputing, 2003
Almeida, V., Bestavros, A., Crovella, M., de Oliveira, A.: Characterizing reference locality in the WWW. In: 4th International Conference on Parallel and Distributed Information Systems, pp. 92–103, Dec. 1996
Amer, A., Long, D.D.E., Burns, R.C.: Group-based management of distributed file caches. In: ICDCS, 2002
Arlitt, M.F., Williamson, C.L.: Internet web servers: workload characterization and performance implications. IEEE/ACM Trans. Netw. 5(5), 631–645 (1997)
Arlitt, M., Friedrich, R., Jin, T.: Workload characterization of a web proxy in a cable modem environment. SIGMETRICS Perform. Eval. Rev. 27(2), 25–36 (1999)
Barford, P., Bestavros, A., Bradley, A., Crovella, M.: Changes in web client access patterns: characteristics and caching implications. Proc. World Wide Web 2, 15–28 (1999)
Bestavros, A.: Demand-based document dissemination to reduce traffic and balance load in distributed information systems. In: SPDP ’95: Proceedings of the 7th IEEE Symposium on Parallel and Distributed Processing, Washington, DC, USA, 1995, p. 338. IEEE Computer Society, Los Alamitos (1995)
Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web caching and zipf-like distributions: evidence and implications. In: INFOCOM (1), pp. 126–134, 1999
Brun, R., Rademakers, F.: In: An Object Oriented Data Analysis Framework, 1996
Catalyurek, U., Kurc, T., Sadayappan, P., Saltz, J.: Scheduling file transfers for data-intensive jobs on heterogeneous clusters. In: Proceedings of the 13th European Conference on Parallel and Distributed Computing (Europar), 2007
Cohen, E., Fiat, A., Kaplan, H.: Associative search in peer to peer networks: harnessing latent semantics. In: Infocom, San Francisco, CA, 2003
Cunha, C., Bestavros, A., Crovella, M.: Characteristics of www client-based traces. Technical report, Boston, MA, USA, 1995
Ding, X., Jiang, S., Chen, F., Davis, K., Zhang, X.: Diskseen: exploiting disk layout and access history to enhance I/O prefetch. In: USENIX Annual Technical Conference, 2007
Doraimani, S.: Filecules: a new granularity for resource management in grids. Master’s thesis, University of South Florida (2007)
Doraimani, S., Iamnitchi, A.: File grouping for scientific data management: lessons from experimenting with real traces. In: 17th ACM/IEEE Symposium on High Performance Distributed Computing (HPDC), June 2008
Douceur, J.R., Bolosky, W.J.: A large-scale study of file-system contents. In: SIGMETRICS ’99: Proceedings of the 1999 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, New York, NY, USA, 1999, pp. 59–70. ACM Press, New York (1999)
The DZero Experiment. http://www-d0.fnal.gov
Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the Grid: enabling scalable virtual organizations. In: Lecture Notes in Computer Science, vol. 2150, pp. 1–4. Springer, Berlin (2001)
Ganger, G.R., Kaashoek, M.F.: Embedded inodes and explicit grouping: exploiting disk bandwidth for small files. In: USENIX Annual Technical Conference, pp. 1–17, 1997
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: SOSP ’03: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, New York, NY, USA, 2003, pp. 29–43. ACM Press, New York (2003)
Gish, A.S., Shavitt, Y., Tankel, T.: Geographical statistics and characteristics of p2p query strings. In: IPTPS2007—Proceedings of the 6th International Workshop on Peer-to-Peer Systems, February 2007
Gkantsidis, C., Karagiannis, T., Vojnovic, M.: Planet scale software updates. In: SIGCOMM, pp. 423–434, 2006
The Grid Workloads Archive. http://gwa.ewi.tudelft.nl/
Haeberlen, A., Mislove, A., Post, A., Druschel, P.: Fallacies in evaluating decentralized systems. In: The 5th International Workshop on Peer-to-Peer Systems (IPTPS’06), 2006
Iamnitchi, A., Foster, I.: Interest-aware information dissemination in small-world communities. In: 14th IEEE International Symposium on High Performance Distributed Computing (HPDC), July 2005
Iamnitchi, A., Doraimani, S., Garzoglio, G.: Filecules in high-energy physics: characteristics and impact on resource management. In: 15th IEEE International Symposium on High Performance Distributed Computing (HPDC), pp. 69–79, June 2006
Iamnitchi, A., Ripeanu, M., Foster, I.: Small-world file-sharing communities. In: Infocom, Hong Kong, China, 2004
Iosup, A., Epema, D.: Grenchmark: a framework for analyzing, testing, and comparing grids. In: CCGRID ’06: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID’06), Washington, DC, USA, 2006, pp. 313–320. IEEE Computer Society, Los Alamitos (2006)
Iosup, A., Epema, D.H.J., Couvares, P., Karp, A., Livny, M.: Build-and-test workloads for grid middleware: problem, analysis, and applications. In: Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGRID’07), Rio de Janeiro, Brazil, 2007, pp. 205–210. IEEE Computer Society, Los Alamitos (2007)
Iosup, A., Jan, M., Sonmez, O., Epema, D.H.J.: The characteristics and performance of groups of jobs in grids. In: EuroPar, 2007
Iosup, A., Jan, M., Sonmez, O., Epema, D.H.J.: On the dynamic resource availability in grids. In: Grid, 2007
Khanna, G., Vydyanathan, N., Catalyurek, U., Kurc, T., Krishnamoorthy, S., Sadayappan, P., Saltz, J.: Task scheduling and file replication for data-intensive jobs with batch-shared I/O. In: Proceedings of the 15th IEEE International Symposium on High Performance Distributed Computing (HPDC), 2006
Kuenning, G.H., Popek, G.J.: Automated hoarding for mobile computers. In: SOSP ’97: Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, New York, NY, USA, 1997, pp. 264–275. ACM Press, New York (1997)
Loebel-Carpenter, L., Lueking, L., Moore, C., Pordes, R., Trumbo, J., Veseli, S., Terekhov, I., Vranicar, M., White, S., White, V.: Sam and the particle physics data grid. In: Computing in High-Energy and Nuclear Physics, Beijing, China, 2001
Maniatis, P., Roussopoulos, M., Giuli, T.J., Rosenthal, D.S.H., Baker, M.: The LOCKSS peer-to-peer digital preservation system. ACM Trans. Comput. Syst. 23(1), 2–50 (2005)
Mullender, S.J., Tanenbaum, A.S.: Immediate files. Softw. Pract. Exper. 14(4), 365–368 (1984)
Otoo, E., Olken, F., Shoshani, A.: Disk cache replacement algorithm for storage resource managers in data grids. In: Supercomputing ’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, Los Alamitos, CA, USA, 2002, pp. 1–15. IEEE Computer Society Press, Los Alamitos (2002)
Otoo, E., Rotem, D., Romosan, A.: Optimal file-bundle caching algorithms for data-grids. In: SC ’04: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, Washington, DC, USA, 2004, p. 6. IEEE Computer Society, Los Alamitos (2004)
Otoo, E.J., Rotem, D., Romosan, A., Seshadri, S.: File caching in data intensive scientific applications on data-grids. In: Data Management in Grids, pp. 85–99, 2005
Otoo, E.J., Rotem, D., Seshadri, S.: Efficient algorithms for multi-file caching. In: 15th International Conference Database and Expert Systems Applications, pp. 707–719, 2004
Rajasekar, A., Wan, M., Moore, R., Kremenek, G., Guptil, T.: Data grids, collections, and grid bricks. In: MSS ’03: Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS’03), Washington, DC, USA, 2003, p. 2. IEEE Computer Society, Los Alamitos (2003)
Rajasekar, A., Wan, M., Moore, R., Schroeder, W.: A prototype rule-based distributed data management system. In: Workshop on Next Generation Distributed Data Management, 2006
Ranganathan, K., Foster, I.: Design and evaluation of dynamic replication strategies for a high performance data Grid. In: International Conference on Computing in High Energy and Nuclear Physics, 2001
Ranganathan, K., Foster, I.: Identifying dynamic replication strategies for a high performance data Grid. In: International Workshop on Grid Computing, 2001
Ranganathan, K., Foster, I.: Decoupling computation and data scheduling in distributed data-intensive applications. In: 11th IEEE International Symposium on High Performance Distributed Computing (HPDC-11), Edinburgh, Scotland, 2002
Ripeanu, M., Iamnitchi, A., Foster, I.: Mapping the Gnutella network: properties of large-scale peer-to-peer systems and implications for system design. Internet Comput. 6(1), 50–57 (2002)
Saroiu, S., Gummadi, K.P., Dunn, R.J., Gribble, S.D., Levy, H.M.: An analysis of internet content delivery systems. SIGOPS Oper. Syst. Rev. 36(SI), 315–327 (2002)
Shriver, E.A.M., Gabber, E., Huang, L., Stein, C.A.: Storage management for web proxies. In: Proceedings of the General Track: 2002 USENIX Annual Technical Conference, Berkeley, CA, USA, 2001, pp. 203–216. USENIX Association, Berkeley (2001)
Sripanidkulchai, K.: The popularity of gnutella queries and its implications on scalability. White paper, http://www.cs.cmu.edu/~kunwadee/research/p2p/gnutella.html (2001)
Staelin, C., Garcia-Molina, H.: Clustering active disk data to improve disk performance. Technical Report CS–TR–298–90, Princeton, NJ, USA, 1990
Staelin, C., Garcia-Molina, H.: Smart filesystems. In: USENIX Winter, pp. 45–52, 1991
Tait, C.D., Duchamp, D.: Detection and exploitation of file working sets. In: Proceedings of the 11th International Conference on Distributed Computing Systems (ICDCS), Washington, DC, 1991, pp. 2–9. IEEE Computer Society, Los Alamitos (1991)
Tanenbaum, A.S., Herder, J.N., Bos, H.: File size distribution on Unix systems: then and now. SIGOPS Oper. Syst. Rev. 40(1), 100–104 (2006)
Terekhov, I.: Meta-computing at d0. In: Nuclear Instruments and Methods in Physics Research, Section A, NIMA14225, vol. 502/2-3, pp. 402–406, 2002
Vogels, W.: File system usage in windows nt 4.0. In: SOSP ’99: Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles, New York, NY, USA, 1999, pp. 93–109. ACM Press, New York (1999)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Iamnitchi, A., Doraimani, S. & Garzoglio, G. Workload characterization in a high-energy data grid and impact on resource management. Cluster Comput 12, 153–173 (2009). https://doi.org/10.1007/s10586-009-0081-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-009-0081-3