Skip to main content
Log in

Workload characterization in a high-energy data grid and impact on resource management

  • Published:
Cluster Computing Aims and scope Submit manuscript


The analysis of data usage in a large set of real traces from a high-energy physics collaboration revealed the existence of an emergent grouping of files that we coined “filecules”. This paper presents the benefits of using this file grouping for prestaging data and compares it with previously proposed file grouping techniques along a range of performance metrics. Our experiments with real workloads demonstrate that filecule grouping is a reliable and useful abstraction for data management in science Grids; that preserving time locality for data prestaging is highly recommended; that job reordering with respect to data availability has significant impact on throughput; and finally, that a relatively short history of traces is a good predictor for filecule grouping. Our experimental results provide lessons for workload modeling and suggest design guidelines for data management in data-intensive resource-sharing environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others


  1. Adamic, L., Huberman, B., Lukose, R., Puniyani, A.: Search in power law networks. Phys. Rev. E 64, 46135–46143 (2001)

    Article  Google Scholar 

  2. Allen, M., Wolski, R.: The Livny and Plank-Beck problems: studies in data movement on the computational grid. In: Supercomputing, 2003

  3. Almeida, V., Bestavros, A., Crovella, M., de Oliveira, A.: Characterizing reference locality in the WWW. In: 4th International Conference on Parallel and Distributed Information Systems, pp. 92–103, Dec. 1996

  4. Amer, A., Long, D.D.E., Burns, R.C.: Group-based management of distributed file caches. In: ICDCS, 2002

  5. Arlitt, M.F., Williamson, C.L.: Internet web servers: workload characterization and performance implications. IEEE/ACM Trans. Netw. 5(5), 631–645 (1997)

    Article  Google Scholar 

  6. Arlitt, M., Friedrich, R., Jin, T.: Workload characterization of a web proxy in a cable modem environment. SIGMETRICS Perform. Eval. Rev. 27(2), 25–36 (1999)

    Article  Google Scholar 

  7. Barford, P., Bestavros, A., Bradley, A., Crovella, M.: Changes in web client access patterns: characteristics and caching implications. Proc. World Wide Web 2, 15–28 (1999)

    Article  Google Scholar 

  8. Bestavros, A.: Demand-based document dissemination to reduce traffic and balance load in distributed information systems. In: SPDP ’95: Proceedings of the 7th IEEE Symposium on Parallel and Distributed Processing, Washington, DC, USA, 1995, p. 338. IEEE Computer Society, Los Alamitos (1995)

    Chapter  Google Scholar 

  9. Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web caching and zipf-like distributions: evidence and implications. In: INFOCOM (1), pp. 126–134, 1999

  10. Brun, R., Rademakers, F.: In: An Object Oriented Data Analysis Framework, 1996

  11. Catalyurek, U., Kurc, T., Sadayappan, P., Saltz, J.: Scheduling file transfers for data-intensive jobs on heterogeneous clusters. In: Proceedings of the 13th European Conference on Parallel and Distributed Computing (Europar), 2007

  12. Cohen, E., Fiat, A., Kaplan, H.: Associative search in peer to peer networks: harnessing latent semantics. In: Infocom, San Francisco, CA, 2003

  13. Cunha, C., Bestavros, A., Crovella, M.: Characteristics of www client-based traces. Technical report, Boston, MA, USA, 1995

  14. Ding, X., Jiang, S., Chen, F., Davis, K., Zhang, X.: Diskseen: exploiting disk layout and access history to enhance I/O prefetch. In: USENIX Annual Technical Conference, 2007

  15. Doraimani, S.: Filecules: a new granularity for resource management in grids. Master’s thesis, University of South Florida (2007)

  16. Doraimani, S., Iamnitchi, A.: File grouping for scientific data management: lessons from experimenting with real traces. In: 17th ACM/IEEE Symposium on High Performance Distributed Computing (HPDC), June 2008

  17. Douceur, J.R., Bolosky, W.J.: A large-scale study of file-system contents. In: SIGMETRICS ’99: Proceedings of the 1999 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, New York, NY, USA, 1999, pp. 59–70. ACM Press, New York (1999)

    Chapter  Google Scholar 

  18. The DZero Experiment.

  19. Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the Grid: enabling scalable virtual organizations. In: Lecture Notes in Computer Science, vol. 2150, pp. 1–4. Springer, Berlin (2001)

    Google Scholar 

  20. Ganger, G.R., Kaashoek, M.F.: Embedded inodes and explicit grouping: exploiting disk bandwidth for small files. In: USENIX Annual Technical Conference, pp. 1–17, 1997

  21. Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: SOSP ’03: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, New York, NY, USA, 2003, pp. 29–43. ACM Press, New York (2003)

    Chapter  Google Scholar 

  22. Gish, A.S., Shavitt, Y., Tankel, T.: Geographical statistics and characteristics of p2p query strings. In: IPTPS2007—Proceedings of the 6th International Workshop on Peer-to-Peer Systems, February 2007

  23. Gkantsidis, C., Karagiannis, T., Vojnovic, M.: Planet scale software updates. In: SIGCOMM, pp. 423–434, 2006

  24. The Grid Workloads Archive.

  25. Haeberlen, A., Mislove, A., Post, A., Druschel, P.: Fallacies in evaluating decentralized systems. In: The 5th International Workshop on Peer-to-Peer Systems (IPTPS’06), 2006

  26. Iamnitchi, A., Foster, I.: Interest-aware information dissemination in small-world communities. In: 14th IEEE International Symposium on High Performance Distributed Computing (HPDC), July 2005

  27. Iamnitchi, A., Doraimani, S., Garzoglio, G.: Filecules in high-energy physics: characteristics and impact on resource management. In: 15th IEEE International Symposium on High Performance Distributed Computing (HPDC), pp. 69–79, June 2006

  28. Iamnitchi, A., Ripeanu, M., Foster, I.: Small-world file-sharing communities. In: Infocom, Hong Kong, China, 2004

  29. Iosup, A., Epema, D.: Grenchmark: a framework for analyzing, testing, and comparing grids. In: CCGRID ’06: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID’06), Washington, DC, USA, 2006, pp. 313–320. IEEE Computer Society, Los Alamitos (2006)

    Chapter  Google Scholar 

  30. Iosup, A., Epema, D.H.J., Couvares, P., Karp, A., Livny, M.: Build-and-test workloads for grid middleware: problem, analysis, and applications. In: Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGRID’07), Rio de Janeiro, Brazil, 2007, pp. 205–210. IEEE Computer Society, Los Alamitos (2007)

    Chapter  Google Scholar 

  31. Iosup, A., Jan, M., Sonmez, O., Epema, D.H.J.: The characteristics and performance of groups of jobs in grids. In: EuroPar, 2007

  32. Iosup, A., Jan, M., Sonmez, O., Epema, D.H.J.: On the dynamic resource availability in grids. In: Grid, 2007

  33. Khanna, G., Vydyanathan, N., Catalyurek, U., Kurc, T., Krishnamoorthy, S., Sadayappan, P., Saltz, J.: Task scheduling and file replication for data-intensive jobs with batch-shared I/O. In: Proceedings of the 15th IEEE International Symposium on High Performance Distributed Computing (HPDC), 2006

  34. Kuenning, G.H., Popek, G.J.: Automated hoarding for mobile computers. In: SOSP ’97: Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, New York, NY, USA, 1997, pp. 264–275. ACM Press, New York (1997)

    Chapter  Google Scholar 

  35. Loebel-Carpenter, L., Lueking, L., Moore, C., Pordes, R., Trumbo, J., Veseli, S., Terekhov, I., Vranicar, M., White, S., White, V.: Sam and the particle physics data grid. In: Computing in High-Energy and Nuclear Physics, Beijing, China, 2001

  36. Maniatis, P., Roussopoulos, M., Giuli, T.J., Rosenthal, D.S.H., Baker, M.: The LOCKSS peer-to-peer digital preservation system. ACM Trans. Comput. Syst. 23(1), 2–50 (2005)

    Article  Google Scholar 

  37. Mullender, S.J., Tanenbaum, A.S.: Immediate files. Softw. Pract. Exper. 14(4), 365–368 (1984)

    Article  Google Scholar 

  38. Otoo, E., Olken, F., Shoshani, A.: Disk cache replacement algorithm for storage resource managers in data grids. In: Supercomputing ’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, Los Alamitos, CA, USA, 2002, pp. 1–15. IEEE Computer Society Press, Los Alamitos (2002)

    Google Scholar 

  39. Otoo, E., Rotem, D., Romosan, A.: Optimal file-bundle caching algorithms for data-grids. In: SC ’04: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, Washington, DC, USA, 2004, p. 6. IEEE Computer Society, Los Alamitos (2004)

    Google Scholar 

  40. Otoo, E.J., Rotem, D., Romosan, A., Seshadri, S.: File caching in data intensive scientific applications on data-grids. In: Data Management in Grids, pp. 85–99, 2005

  41. Otoo, E.J., Rotem, D., Seshadri, S.: Efficient algorithms for multi-file caching. In: 15th International Conference Database and Expert Systems Applications, pp. 707–719, 2004

  42. Rajasekar, A., Wan, M., Moore, R., Kremenek, G., Guptil, T.: Data grids, collections, and grid bricks. In: MSS ’03: Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS’03), Washington, DC, USA, 2003, p. 2. IEEE Computer Society, Los Alamitos (2003)

    Chapter  Google Scholar 

  43. Rajasekar, A., Wan, M., Moore, R., Schroeder, W.: A prototype rule-based distributed data management system. In: Workshop on Next Generation Distributed Data Management, 2006

  44. Ranganathan, K., Foster, I.: Design and evaluation of dynamic replication strategies for a high performance data Grid. In: International Conference on Computing in High Energy and Nuclear Physics, 2001

  45. Ranganathan, K., Foster, I.: Identifying dynamic replication strategies for a high performance data Grid. In: International Workshop on Grid Computing, 2001

  46. Ranganathan, K., Foster, I.: Decoupling computation and data scheduling in distributed data-intensive applications. In: 11th IEEE International Symposium on High Performance Distributed Computing (HPDC-11), Edinburgh, Scotland, 2002

  47. Ripeanu, M., Iamnitchi, A., Foster, I.: Mapping the Gnutella network: properties of large-scale peer-to-peer systems and implications for system design. Internet Comput. 6(1), 50–57 (2002)

    Article  Google Scholar 

  48. Saroiu, S., Gummadi, K.P., Dunn, R.J., Gribble, S.D., Levy, H.M.: An analysis of internet content delivery systems. SIGOPS Oper. Syst. Rev. 36(SI), 315–327 (2002)

    Article  Google Scholar 

  49. Shriver, E.A.M., Gabber, E., Huang, L., Stein, C.A.: Storage management for web proxies. In: Proceedings of the General Track: 2002 USENIX Annual Technical Conference, Berkeley, CA, USA, 2001, pp. 203–216. USENIX Association, Berkeley (2001)

    Google Scholar 

  50. Sripanidkulchai, K.: The popularity of gnutella queries and its implications on scalability. White paper, (2001)

  51. Staelin, C., Garcia-Molina, H.: Clustering active disk data to improve disk performance. Technical Report CS–TR–298–90, Princeton, NJ, USA, 1990

  52. Staelin, C., Garcia-Molina, H.: Smart filesystems. In: USENIX Winter, pp. 45–52, 1991

  53. Tait, C.D., Duchamp, D.: Detection and exploitation of file working sets. In: Proceedings of the 11th International Conference on Distributed Computing Systems (ICDCS), Washington, DC, 1991, pp. 2–9. IEEE Computer Society, Los Alamitos (1991)

    Google Scholar 

  54. Tanenbaum, A.S., Herder, J.N., Bos, H.: File size distribution on Unix systems: then and now. SIGOPS Oper. Syst. Rev. 40(1), 100–104 (2006)

    Article  Google Scholar 

  55. Terekhov, I.: Meta-computing at d0. In: Nuclear Instruments and Methods in Physics Research, Section A, NIMA14225, vol. 502/2-3, pp. 402–406, 2002

  56. Vogels, W.: File system usage in windows nt 4.0. In: SOSP ’99: Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles, New York, NY, USA, 1999, pp. 93–109. ACM Press, New York (1999)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Adriana Iamnitchi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Iamnitchi, A., Doraimani, S. & Garzoglio, G. Workload characterization in a high-energy data grid and impact on resource management. Cluster Comput 12, 153–173 (2009).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

