Skip to main content

File Caching in Data Intensive Scientific Applications on Data-Grids

  • Conference paper
Data Management in Grids (DMG 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3836))

Included in the following conference series:

Abstract

We present some theoretical and experimental results of an important caching problem which arises frequently in data intensive scientific applications that are run in data-grids. Such applications often need to process several files simultaneously, i.e., the application runs only if all its needed files are present in some disk cache accessible to the compute resource of the application. The set of files requested by an application, all of which must be in cache for the application to run, is called a file-bundle. This requirement introduces the need for cache replacement algorithms that are based on file-bundles rather then individual files. We show that traditional caching algorithms such as Least Recently Used (LRU) and GreedyDual-Size (GDS) are not optimal in this case since they are not sensitive to file-bundles and may hold in the cache non-relevant combinations of files. We propose and analyze a new cache replacement algorithm specifically adapted to deal with file-bundles. Results of experimental studies of the new algorithm, using a disk cache simulation model under a wide range of conditions such as file request distributions, relative cache size, file size distribution, and incoming job queue size, show significant improvement over traditional caching algorithms such as GDS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. BaBar: (The babar collaboration), http://www.slac.stanford.edu/bfroot/

  2. Andrade, H., Kurc, T., Sussman, A., Borovikov, E., Saltz, J.: On cache replacement policies for servicing mixed data intensive query workloads. In: Proc. 2nd Workshop on Caching, Coherence, and Consistency, with the 16th ACM Int’l. Conf. on Supercomputing, New York, NY (2002)

    Google Scholar 

  3. Reiner, B., Hahn, K.: Optimized management of large-scale datasets stored on tertiary storage systems. IEEE Distributed Systems Online Magazine (2004)

    Google Scholar 

  4. Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., Tuecke, S.: The Data Grid: Towards an architecture for the distributed management and analysis of large scientific datasets. J. Network and Computer Applications 23, 187–200 (2000)

    Article  Google Scholar 

  5. Shoshani, A., Sim, A., Bernardo, L.M., Nordberg, H.: Coordinating simultaneous caching of file bundles from tertiary storage. In: Proc. 12th Int’l. Conf. on Scientific and Stat. Database Management, SSDBM 2000, pp. 196–206 (2000)

    Google Scholar 

  6. Ernst, M., Fuhrmann, P., Gasthuber, M., Mkrtchyan, T., Waldman, C.: dCache: a distributed data caching system. In: Computing In High Energy And Nuclear Physics, CHEP 2001 (2001)

    Google Scholar 

  7. Cao, P., Irani, S.: Cost-aware WWW proxy caching algorithms. In: USENIX Symposium on Internet Technologies and Systems (1997)

    Google Scholar 

  8. Young, N.: On-line file caching. In: SODA: ACM-SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms) (1998)

    Google Scholar 

  9. Otoo, E.J., Rotem, D., Shoshani, A.: Impact of admission and cache replacement policies on response times of jobs on data grids. In: Int’l. Workshop on Challenges of Large Applications in Distrib. Environments, Seatle, Washington. IEEE Computer Society, Los Alamitos (2003)

    Google Scholar 

  10. Otoo, E.J., Rotem, D., Romoson, A., Seshadri, S.: File caching in data intensive scientific applications. Technical report, Lawrence Berkeley National Laboratory, LBNL Report No 55587 (2004)

    Google Scholar 

  11. Wu, K., Koegler, W.S., Chen, J., Shoshani, A.: Using bitmap index for interactive exploration of large datasets. In: SSDBM 2003, Cambridge, Mass, pp. 65–74 (2003)

    Google Scholar 

  12. Devroye, L.: Lecture notes on bucket hashing. Birkhauser, Boston (1985)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Otoo, E., Rotem, D., Romosan, A., Seshadri, S. (2006). File Caching in Data Intensive Scientific Applications on Data-Grids. In: Pierson, JM. (eds) Data Management in Grids. DMG 2005. Lecture Notes in Computer Science, vol 3836. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11611950_8

Download citation

  • DOI: https://doi.org/10.1007/11611950_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-31212-3

  • Online ISBN: 978-3-540-32452-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics