Skip to main content

Advertisement

Log in

Data Reduction Analysis for Climate Data Sets

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Global climate modeling not only requires computation capabilities, but also brings tough challenges for data storage systems. The input and output data sets generally require hundreds or even thousands of terabytes storage. Therefore, storage reduction methods, such as content deduplication and various data compression methods, are extremely important for reducing the storage size requirement in climate modeling. However, little work has been done on investigating the effectiveness of these data reduction methods for climate data sets. In this paper, the potential benefit of data reduction for climate data is studied by investigating a total of 46.5 TB climate data sets, including 3 observation data sets (14.1 TB) and 3 climate model output data sets (32.4 TB). Five different data compression algorithms and two types of content deduplication mechanisms are applied to these data sets to study the possible data reduction effectiveness. Further more, the compressibility of different climate component data is also examined. Our work demonstrates the potential of applying data reduction methods in climate modeling platforms, and provides guidance for selecting the suitable methods for different kinds of climate data sets. We find that the compression method \({LCFP}\) can provide the best compression ratio; however, its throughputs, especially the inflate throughputs are much lower than all the others. To strike a better balance between compression ratio and throughputs, we propose a new compression method for the model output data. The new compression method can achieve comparable compression ratio, while attain about 20 times higher inflate throughput than that of \({LCFP}\).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. 120.0-G-2, C. Lossless data compression. In: Report Concerning Space Data System Standards (2006), Green Book, Issue 2

  2. Biggar, H.: Experiencing data de-duplication: improving efficiency and reducing capacity requirements. The Enterprise Strategy Group (2007)

  3. Burtscher, M., Ratanaworabhan, P.: Fpc: a high-speed compressor for double-precision floating-point data. IEEE Trans. Comput. 58(1), 18–31 (2009)

    Article  MathSciNet  Google Scholar 

  4. bzip2. http://www.bzip.org

  5. compression-rating. http://compressionratings.com

  6. Constantinescu, C., Glider, J., Chambliss, D.: Mixing deduplication and compression on active data sets. In: Data Compression Conference (DCC), 2011, IEEE, pp. 393–402 (2011)

  7. Earth System Grid Federation. http://pcmdi9.llnl.gov/esgf-web-fe/

  8. Eshghi, K., Tang, H.: A framework for analyzing and improving content-based chunking algorithms. Hewlett-Packard Labs Technical Report TR 30 (2005)

  9. EUMETSAT. http://www.eumetsat.int

  10. HDF group—HDF5. http://www.hdfgroup.org/HDF5/

  11. Homepage of Martin Isenburg. http://www.cs.unc.edu/~isenburg/

  12. Hong, B., Plantenberg, D., Long, D., Sivan-Zimet, M.: Duplicate data elimination in a san file system. In: Proceedings of the 12th NASA Goddard, 21st IEEE Conference on Mass Storage Systems and Technologies), pp. 301–314 (2004)

  13. Ibarria, L., Lindstrom, P., Rossignac, J., Szymczak, A.: Out-of-core compression and decompression of large n-dimensional scalar fields. In: Computer Graphics Forum (2003), vol. 22, Wiley Online Library, pp. 343–348

  14. Isenburg, M., Lindstrom, P., Snoeyink, J.: Lossless compression of predicted floating-point geometry. Comput.-Aided Des. 37(8), 869–877 (2005)

    Article  MATH  Google Scholar 

  15. Jin, K., Miller, E.: The effectiveness of deduplication on virtual machine disk images. In: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, ACM, p. 7 (2009)

  16. Kulkarni, P., Douglis, F., LaVoie, J., Tracey, J.M.: Redundancy elimination within large collections of files. In: Proceedings of the USENIX Annual Technical Conference, pp. 59–72 (2004)

  17. Lakshminarasimhan, S., Shah, N., Ethier, S., Klasky, S., Latham, R., Ross, R., Samatova, N.: Compressing the incompressible with isabela: in-situ reduction of spatio-temporal data. Euro-Par 2011 Parallel Processing, pp. 366–379 (2011)

  18. Lu, M., Chambliss, D., Glider, J., Constantinescu, C.: Insights for data reduction in primary storage: a practical analysis. In: Proceedings of the 5th Annual International Systems and Storage Conference, ACM, p. 17 (2012)

  19. LZO Documentation. http://www.oberhumer.com/opensource/lzo/lzodoc.php

  20. LZO real-time data compression library. http://www.oberhumer.com/opensource/lzo/

  21. Meister, D., Brinkmann, A.: Multi-level comparison of data deduplication in a backup scenario. In: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, ACM, p. 8 (2009)

  22. Meister, D., Kaiser, J., Brinkmann, A., Cortes, T., Kuhn, M., Kunkel, J.: A study on data deduplication in hpc storage systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, p. 7 (2012)

  23. Muthitacharoen, A., Chen, B., Mazieres, D.: A low-bandwidth network file system. In: ACM SIGOPS Operating Systems Review, vol. 35. ACM, pp. 174–187 (2001)

  24. NOAA Radar Data. http://www.ncdc.noaa.gov/radar-data

  25. NOAA Satellite Data. http://www.ncdc.noaa.gov/satellite-data

  26. Overpeck, J., Meehl, G., Bony, S., Easterling, D.: Climate data challenges in the 21st century. Science 331(6018), 700–702 (2011)

    Article  Google Scholar 

  27. Park, N., Lilja, D.J.: Characterizing datasets for data deduplication in backup applications. In: Workload Characterization (IISWC), 2010 IEEE International Symposium on (2010), IEEE, pp. 1–10

  28. Quinlan, S., Dorward, S.: Venti: a new approach to archival storage. In: Proceedings of the FAST 2002 Conference on File and Storage Technologies, vol. 4 (2002)

  29. Rice, R.F.: Practical universal noiseless coding. In: 23rd Annual Technical Symposium. International Society for Optics and Photonics, pp. 247–267 (1979)

  30. Schendel, E.R., Pendse, S.V., Jenkins, J., Boyuka II, D.A., Gong, Z., Lakshminarasimhan, S., Liu, Q., Kolla, H., Chen, J., Klasky, S., et al.: Isobar hybrid compression-i/o interleaving for large-scale parallel i/o optimization. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, ACM, pp. 61–72 (2012)

  31. Schmalzl, J.: Using standard image compression algorithms to store data from computational fluid dynamics. Comput. Geosci. 29(8), 1021–1031 (2003)

    Article  Google Scholar 

  32. Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: idedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of the Tenth USENIX Conference on File and Storage Technologies (FAST12), San Jose, CA (2012)

  33. Taylor, K., Stouffer, R., Meehl, G.: An overview of cmip5 and the experiment design. Bull. Am. Meteorol. Soc. 93(4), 485 (2012)

    Article  Google Scholar 

  34. The netCDF-4 format. http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/NetCDF_002d4-Format.html

  35. Wallace, G., Douglis, F., Qian, H., Shilane, P., Smaldone, S., Chamness, M., Hsu, W.: Characteristics of backup workloads in production systems. In: Proceedings of the 10th USENIX Conference on File and Storage Technologies (Berkeley, CA, USA, 2012), FAST’12, USENIX Association, pp. 4–4

  36. Wang, C., Yu, H., Ma, K.-L.: Application-driven compression for visualizing large-scale time-varying data. IEEE Comput. Gr. Appl. 30(1), 59–69 (2010)

    Article  MathSciNet  Google Scholar 

  37. Welton, B., Kimpe, D., Cope, J., Patrick, C.M., Iskra, K., Ross, R.: Improving i/o forwarding throughput with data compression. In: Cluster Computing (CLUSTER), 2011 IEEE International Conference on (2011), IEEE, pp. 438–445

  38. Wessel, P.: Compression of large data grids for internet transmission. Comput. Geosci. 29(5), 665–671 (2003)

    Article  Google Scholar 

  39. Wheeler, D., Burrows, M.: A block-sorting lossless data compression algorithm. Digital Systems Research Center Report 124 (1994)

  40. Williams, D.N.: Climate Science Responds to ’Big Data’ Challenges: Accessing Analyzing Model Output and Observations. http://downloads.usgcrp.gov/downloads/igim/05_Williams.pdf

  41. Yeh, P.-S., Xia-Serafino, W., Miles, L., Kobler, B., Menasce, D.: Implementation of ccsds lossless data compression in hdf. In: Earth Science Technology Conference (2002)

  42. Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies, vol. 18 (2008)

  43. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    Article  MATH  MathSciNet  Google Scholar 

  44. zlib. http://www.zlib.net

Download references

Acknowledgments

We would like to thanks Ma Qiang from China Meteorological Administration, professor Lanning Wang from Beijing Normal University, and the researchers from the First Institute of Oceanography and Chinese Academy of Sciences for providing access to their data sets. This research was sponsored by the National High Technology Development Program of China (2010AA012401, 2011AA01A203).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaomeng Huang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, S., Huang, X., Fu, H. et al. Data Reduction Analysis for Climate Data Sets. Int J Parallel Prog 43, 508–527 (2015). https://doi.org/10.1007/s10766-013-0287-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-013-0287-0

Keywords

Navigation