Abstract
Global climate modeling not only requires computation capabilities, but also brings tough challenges for data storage systems. The input and output data sets generally require hundreds or even thousands of terabytes storage. Therefore, storage reduction methods, such as content deduplication and various data compression methods, are extremely important for reducing the storage size requirement in climate modeling. However, little work has been done on investigating the effectiveness of these data reduction methods for climate data sets. In this paper, the potential benefit of data reduction for climate data is studied by investigating a total of 46.5 TB climate data sets, including 3 observation data sets (14.1 TB) and 3 climate model output data sets (32.4 TB). Five different data compression algorithms and two types of content deduplication mechanisms are applied to these data sets to study the possible data reduction effectiveness. Further more, the compressibility of different climate component data is also examined. Our work demonstrates the potential of applying data reduction methods in climate modeling platforms, and provides guidance for selecting the suitable methods for different kinds of climate data sets. We find that the compression method \({LCFP}\) can provide the best compression ratio; however, its throughputs, especially the inflate throughputs are much lower than all the others. To strike a better balance between compression ratio and throughputs, we propose a new compression method for the model output data. The new compression method can achieve comparable compression ratio, while attain about 20 times higher inflate throughput than that of \({LCFP}\).







Similar content being viewed by others
References
120.0-G-2, C. Lossless data compression. In: Report Concerning Space Data System Standards (2006), Green Book, Issue 2
Biggar, H.: Experiencing data de-duplication: improving efficiency and reducing capacity requirements. The Enterprise Strategy Group (2007)
Burtscher, M., Ratanaworabhan, P.: Fpc: a high-speed compressor for double-precision floating-point data. IEEE Trans. Comput. 58(1), 18–31 (2009)
bzip2. http://www.bzip.org
compression-rating. http://compressionratings.com
Constantinescu, C., Glider, J., Chambliss, D.: Mixing deduplication and compression on active data sets. In: Data Compression Conference (DCC), 2011, IEEE, pp. 393–402 (2011)
Earth System Grid Federation. http://pcmdi9.llnl.gov/esgf-web-fe/
Eshghi, K., Tang, H.: A framework for analyzing and improving content-based chunking algorithms. Hewlett-Packard Labs Technical Report TR 30 (2005)
EUMETSAT. http://www.eumetsat.int
HDF group—HDF5. http://www.hdfgroup.org/HDF5/
Homepage of Martin Isenburg. http://www.cs.unc.edu/~isenburg/
Hong, B., Plantenberg, D., Long, D., Sivan-Zimet, M.: Duplicate data elimination in a san file system. In: Proceedings of the 12th NASA Goddard, 21st IEEE Conference on Mass Storage Systems and Technologies), pp. 301–314 (2004)
Ibarria, L., Lindstrom, P., Rossignac, J., Szymczak, A.: Out-of-core compression and decompression of large n-dimensional scalar fields. In: Computer Graphics Forum (2003), vol. 22, Wiley Online Library, pp. 343–348
Isenburg, M., Lindstrom, P., Snoeyink, J.: Lossless compression of predicted floating-point geometry. Comput.-Aided Des. 37(8), 869–877 (2005)
Jin, K., Miller, E.: The effectiveness of deduplication on virtual machine disk images. In: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, ACM, p. 7 (2009)
Kulkarni, P., Douglis, F., LaVoie, J., Tracey, J.M.: Redundancy elimination within large collections of files. In: Proceedings of the USENIX Annual Technical Conference, pp. 59–72 (2004)
Lakshminarasimhan, S., Shah, N., Ethier, S., Klasky, S., Latham, R., Ross, R., Samatova, N.: Compressing the incompressible with isabela: in-situ reduction of spatio-temporal data. Euro-Par 2011 Parallel Processing, pp. 366–379 (2011)
Lu, M., Chambliss, D., Glider, J., Constantinescu, C.: Insights for data reduction in primary storage: a practical analysis. In: Proceedings of the 5th Annual International Systems and Storage Conference, ACM, p. 17 (2012)
LZO Documentation. http://www.oberhumer.com/opensource/lzo/lzodoc.php
LZO real-time data compression library. http://www.oberhumer.com/opensource/lzo/
Meister, D., Brinkmann, A.: Multi-level comparison of data deduplication in a backup scenario. In: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, ACM, p. 8 (2009)
Meister, D., Kaiser, J., Brinkmann, A., Cortes, T., Kuhn, M., Kunkel, J.: A study on data deduplication in hpc storage systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, p. 7 (2012)
Muthitacharoen, A., Chen, B., Mazieres, D.: A low-bandwidth network file system. In: ACM SIGOPS Operating Systems Review, vol. 35. ACM, pp. 174–187 (2001)
NOAA Radar Data. http://www.ncdc.noaa.gov/radar-data
NOAA Satellite Data. http://www.ncdc.noaa.gov/satellite-data
Overpeck, J., Meehl, G., Bony, S., Easterling, D.: Climate data challenges in the 21st century. Science 331(6018), 700–702 (2011)
Park, N., Lilja, D.J.: Characterizing datasets for data deduplication in backup applications. In: Workload Characterization (IISWC), 2010 IEEE International Symposium on (2010), IEEE, pp. 1–10
Quinlan, S., Dorward, S.: Venti: a new approach to archival storage. In: Proceedings of the FAST 2002 Conference on File and Storage Technologies, vol. 4 (2002)
Rice, R.F.: Practical universal noiseless coding. In: 23rd Annual Technical Symposium. International Society for Optics and Photonics, pp. 247–267 (1979)
Schendel, E.R., Pendse, S.V., Jenkins, J., Boyuka II, D.A., Gong, Z., Lakshminarasimhan, S., Liu, Q., Kolla, H., Chen, J., Klasky, S., et al.: Isobar hybrid compression-i/o interleaving for large-scale parallel i/o optimization. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, ACM, pp. 61–72 (2012)
Schmalzl, J.: Using standard image compression algorithms to store data from computational fluid dynamics. Comput. Geosci. 29(8), 1021–1031 (2003)
Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: idedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of the Tenth USENIX Conference on File and Storage Technologies (FAST12), San Jose, CA (2012)
Taylor, K., Stouffer, R., Meehl, G.: An overview of cmip5 and the experiment design. Bull. Am. Meteorol. Soc. 93(4), 485 (2012)
The netCDF-4 format. http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/NetCDF_002d4-Format.html
Wallace, G., Douglis, F., Qian, H., Shilane, P., Smaldone, S., Chamness, M., Hsu, W.: Characteristics of backup workloads in production systems. In: Proceedings of the 10th USENIX Conference on File and Storage Technologies (Berkeley, CA, USA, 2012), FAST’12, USENIX Association, pp. 4–4
Wang, C., Yu, H., Ma, K.-L.: Application-driven compression for visualizing large-scale time-varying data. IEEE Comput. Gr. Appl. 30(1), 59–69 (2010)
Welton, B., Kimpe, D., Cope, J., Patrick, C.M., Iskra, K., Ross, R.: Improving i/o forwarding throughput with data compression. In: Cluster Computing (CLUSTER), 2011 IEEE International Conference on (2011), IEEE, pp. 438–445
Wessel, P.: Compression of large data grids for internet transmission. Comput. Geosci. 29(5), 665–671 (2003)
Wheeler, D., Burrows, M.: A block-sorting lossless data compression algorithm. Digital Systems Research Center Report 124 (1994)
Williams, D.N.: Climate Science Responds to ’Big Data’ Challenges: Accessing Analyzing Model Output and Observations. http://downloads.usgcrp.gov/downloads/igim/05_Williams.pdf
Yeh, P.-S., Xia-Serafino, W., Miles, L., Kobler, B., Menasce, D.: Implementation of ccsds lossless data compression in hdf. In: Earth Science Technology Conference (2002)
Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies, vol. 18 (2008)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
zlib. http://www.zlib.net
Acknowledgments
We would like to thanks Ma Qiang from China Meteorological Administration, professor Lanning Wang from Beijing Normal University, and the researchers from the First Institute of Oceanography and Chinese Academy of Sciences for providing access to their data sets. This research was sponsored by the National High Technology Development Program of China (2010AA012401, 2011AA01A203).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, S., Huang, X., Fu, H. et al. Data Reduction Analysis for Climate Data Sets. Int J Parallel Prog 43, 508–527 (2015). https://doi.org/10.1007/s10766-013-0287-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-013-0287-0