skip to main content
10.1145/1183401.1183435acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Coupling prefix caching and collective downloads for remote dataset access

Published: 28 June 2006 Publication History

Abstract

Scientific datasets are typically archived at mass storage systems or data centers close to supercomputers/instruments. End-users of these datasets, however, usually perform parts of their workflows at their local computers. In such cases, client-side caching can offer significant gains by reducing the cost of wide-area data movement.Scientific data caches, however, traditionally cache entire data-sets, which may not be necessary. In this paper, we propose a novel combination of prefix caching and collective download. Prefix caching allows the bootstrapping of dataset downloads by caching only a prefix of the dataset, while collective download facilitates efficient parallel patching of the missing suffix from an external data source. To estimate the optimal prefix size, we further present an analytical model that considers both the initial download over-head and the downloading speed. We implemented our proposed approach in the FreeLoader distributed cache prototype. Experimental results (using multiple scientific data repositories and data transfer tools, as well as a real-world scientific dataset access trace) demonstrate that prefix caching and collective download can be implemented efficiently, our model can select an appropriate prefix size, and the cache hit rate can be improved significantly without hurting the local access rate of cached datasets.

References

[1]
Akamai, http://www.akamai.com/, 2005.]]
[2]
Squid web proxy cache, http://www.squid-cache.org/, 2005.]]
[3]
Ccsm-community climate system model. http://www.ccsm.ucar.edu, 2006.]]
[4]
Earth system grid, http://www.earthsystemgrid.org, 2006.]]
[5]
Ncsa gridftp client. http://dims.ncsa.uiuc.edu/set/uberftp/index.html, 2006.]]
[6]
S. Acharya and B. Smith. Middleman:a video caching proxy server. In Proceedings of 10th international workshop on network and operating system support for digital audio and video (NOSSDAV), 2000.]]
[7]
A. Adya, W. Bolosky, M. Castro, R. Chaiken, G. Cermak, J. Douceur, J. Howell, J. Lorch, M. Theimer, and R. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation, 2002.]]
[8]
W. Allcock, J. Bresnahan, R. Kettimuthu, and M. Link. The Globus Striped GridFTP framework and server. In Proceedings of Supercomputing '05, 2001.]]
[9]
J. Bester, I. Foster, C. Kesselman, J. Tedesco, and S. Tuecke. GASS: A data movement and access service for wide area computing systems. In Proceedings of the Sixth Workshop on I/O in Parallel and Distributed Systems, 1999.]]
[10]
A. Bharambe, C. Herley, and V. Padmanabhan. Analyzing and Improving a BitTorrent Network's Performance Mechanisms. In Proceedings of INFOCOM 2006, 2006.]]
[11]
R. Bordawekar, J. Rosario, and A. Choudhary. Design and evaluation of primitives for parallel I/O. In Proceedings of Supercomputing '93, 1993.]]
[12]
A. Butt, T. Johnson, Y. Zheng, and Y. Hu. Kosha: A peer-to-peer enhancement for the network file system. In Proceedings of Supercomputing, 2004.]]
[13]
J. Byers, M. Luby, M. Mitzenmacher, and A. Rege. A digital fountain approach to reliable distribution of bulk data. In Proceedings of the ACM SIGCOMM Conference, 1998.]]
[14]
B. Calder, A. Chien, J. Wang, and D. Yang. The Entropia virtual machine for desktop grids. In Proceedings of the 1st ACM/USENIX International Conference on Virtual Execution Environments, 2005.]]
[15]
R. A. Coyne and R. W. Watson. The parallel i/o architecture of the high-performance storage system (hpss). In Proceedings of the IEEE MSS Symposium, 1995.]]
[16]
B. Davison. Web caching and content delivery resources. http://www.web-caching.com/, 2005.]]
[17]
Vincent W. Freeh, Xiaosong Ma, Sudharshan S. Vazhkudai, and Jonathan W. Strickland. Controlling impact while aggressively scavenging idle resources. Technical Report TR-2006-7, North Carolina State University, Raleigh, NC, February 2006. In submission.]]
[18]
M. Gleicher. HSI: Hierarchical storage interface for HPSS. http://www.hpss-collaboration.org/hpss/HSI/.]]
[19]
J. Gray, D. Liu, M. Nieto-Santisteban, A. Szalay, G. Heber, and D. DeWitt. Scientific data management in the coming decade. Technical Report MSR-TR-2005-10, Microsoft, 2005.]]
[20]
J. Gray and A. Szalay. Scientific data federation. In I. Foster and C. Kesselman, editors, The Grid 2: Blueprint for a New Computing Infrastructure, 2003.]]
[21]
S. Gruber, J. Rexford, and A. Basso. Protocol considerations for a prefix-caching proxy for multimedia streams. http://www9.org/w9cdrom/349/349.html.]]
[22]
S. Iyer, A. Rowstron, and P. Druschel. Squirrel: a decentralized peer-to-peer web cache. In Proceedings of the 21st ACM Symposium on Principles of Distributed Computing, 2002.]]
[23]
S. Jin, A. Bestavros, and A. Iyengar. Network-aware partial caching for internet streaming media delivery. ACM/Springer Multimedia Systems, 9(4), 2003.]]
[24]
M. Kallahalla and P. J. Varman. PC-OPT: Optimal offline prefetching and caching for parallel I/O systems. IEEE Transactions on Computers, 51(11), 2002.]]
[25]
D. Kotz. Disk-directed I/O for MIMD multiprocessors. In Proceedings of the Symposium on Operating Systems Design and Implementation, November 1994.]]
[26]
J. Lee, X. Ma, R. Ross, R. Thakur, and M. Winslett. RFS: Efficient and flexible remote file access for MPI-IO. In Proceedings of the IEEE International Conference on Cluster Computing, 2004.]]
[27]
J. Lee, X. Ma, M. Winslett, and S. Yu. Active buffering plus compressed migration: An integrated solution to parallel simulations' data transport needs. In Proceedings of the 16th ACM International Conference on Supercomputing, 2002.]]
[28]
D. Libes. The expect home page, http://expect.nist.gov/, 2006.]]
[29]
M. Litzkow, M. Livny, and M. Mutka. Condor - a hunter of idle workstations. In Proceedings of the 8th International Conference on Distributed Computing Systems, 1988.]]
[30]
J. May. Parallel I/O for High Performance Computing. Morgan Kaufmann Publishers, 2001.]]
[31]
N. Nieuwejaar and D. Kotz. The Galley parallel file system. Parallel Computing, 23(4):447--476, 1997.]]
[32]
E. J. Otoo, D. Rotem, and A. Romosan. Optimal file-bundle caching algorithms for data-grids. In Proceedings of Supercomputing, 2004.]]
[33]
V. Padmanabhan. Using Predictive Prefetching to Improve World Wide Web Latency. In Proceedings of ACM SIGCOMM, 1996.]]
[34]
J. Plank, S. Atchley, Y. Ding, and M. Beck. Algorithms for high performance, wide-area distributed file downloads. Parallel Processing Letters, 13(2), 2003.]]
[35]
J. Plank, M. Beck, W. Elwasif, T. Moore, M. Swany, and R. Wolski. The Internet Backplane Protocol: Storage in the network. In Proceedings of the Network Storage Symposium, 1999.]]
[36]
P. Rodriguez, A. Kirpal, and W. E. Biersack. Parallel-access for Mirror Sites in the Internet. In Proceedings of IEEE INFOCOM, 2000.]]
[37]
H. Schulzrinne, A. Rao, and R. Lanphier. Real time streaming protocol (rtsp). http://www.ietf.org/rfc/rfc2326.txt, 1998.]]
[38]
Sloan digital sky survey, http://www.sdss.org, 2005.]]
[39]
K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-directed collective I/O in Panda. In Proceedings of Supercomputing '95, 1995.]]
[40]
S. Sen, J. Rexford, and D. Towsley. Proxy prefix caching for multimedia streams. In Proceedings of the IEEE INFOCOM Conference, 1999.]]
[41]
A. Shoshani, A. Sim, and J. Gu. Storage resource managers: Essential components for the grid. In J. Nabrzyski, J. Schopf, and J. Weglarz, editors, Grid Resource Management: State of the Art and Future Trends, 2003.]]
[42]
R. Thakur, W. Gropp, and E. Lusk. Data sieving and collective I/O in ROMIO. In Proceedings of the 7th Symposium on the Frontiers of Massively Parallel Computation, February 1999.]]
[43]
R. Thakur, W. Gropp, and E. Lusk. On implementing MPI-IO portably and with high performance. In Proceedings of the Sixth Workshop on I/O in Parallel and Distributed Systems, May 1999.]]
[44]
B. Tierney, D. Gunter, J. Lee, and M. Stoufer. Enabling network-aware applications. In Proceedings of the IEEE High Performance Distributed Computing conference, 2001.]]
[45]
B. Tierney, J. Lee, M. Holding, J. Hylton, and F. Drake. A network-aware distributed storage cache for data intensive environments. In Proceedings of the IEEE High Performance Distributed Computing conference (HPDC-8), 1999.]]
[46]
S. Vazhkudai. Distributed downloads of bulk, replicated grid data. International Journal of Grid Computing, (2), 2004.]]
[47]
S. Vazhkudai, X. Ma, V. Freeh, J. Strickland, N. Tammineedi, and S. Scott. Freeloader: Scavenging desktop storage resources for bulk, transient data. In Proceedings of Supercomputing, 2005.]]
[48]
E. Weigle and A. Chien. The composite endpoint protocol (CEP): Scalable endpoints for terabit flows. In Proceedings of the IEEE Conference on Cluster Computing and the Grid, 2005.]]

Cited By

View all
  • (2010)On‐demand data co‐allocation with user‐level cache for gridsConcurrency and Computation: Practice and Experience10.1002/cpe.158722:18(2488-2513)Online publication date: 12-Nov-2010
  • (2009)/scratch as a cacheProceedings of the 23rd international conference on Supercomputing10.1145/1542275.1542325(350-359)Online publication date: 8-Jun-2009
  • (2009)Improving Data Availability for Better Access Performance: A Study on Caching Scientific Data on Distributed Desktop WorkstationsJournal of Grid Computing10.1007/s10723-009-9122-77:4(419-438)Online publication date: 16-Jul-2009
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '06: Proceedings of the 20th annual international conference on Supercomputing
June 2006
385 pages
ISBN:1595932828
DOI:10.1145/1183401
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2006

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

ICS06
Sponsor:
ICS06: International Conference on Supercomputing 2006
June 28 - July 1, 2006
Queensland, Cairns, Australia

Acceptance Rates

ICS '06 Paper Acceptance Rate 37 of 141 submissions, 26%;
Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2010)On‐demand data co‐allocation with user‐level cache for gridsConcurrency and Computation: Practice and Experience10.1002/cpe.158722:18(2488-2513)Online publication date: 12-Nov-2010
  • (2009)/scratch as a cacheProceedings of the 23rd international conference on Supercomputing10.1145/1542275.1542325(350-359)Online publication date: 8-Jun-2009
  • (2009)Improving Data Availability for Better Access Performance: A Study on Caching Scientific Data on Distributed Desktop WorkstationsJournal of Grid Computing10.1007/s10723-009-9122-77:4(419-438)Online publication date: 16-Jul-2009
  • (2007)Optimizing center performance through coordinated data staging, scheduling and recoveryProceedings of the 2007 ACM/IEEE conference on Supercomputing10.1145/1362622.1362696(1-11)Online publication date: 16-Nov-2007
  • (2007)Recovering transient dataACM SIGOPS Operating Systems Review10.1145/1228291.122829741:1(14-18)Online publication date: 1-Jan-2007
  • (2006)Positioning Dynamic Storage Caches for Transient Data2006 IEEE International Conference on Cluster Computing10.1109/CLUSTR.2006.311900(1-9)Online publication date: Sep-2006

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media