Abstract
New methods need to be developed to handle the increasing size of data sets in atmospheric science - traditional analysis scripts often inefficiently read and process the data. NetCDF4 is a common file format used in atmospheric and ocean sciences, and Python is widely used in atmospheric and ocean science data analysis. The aim of this work is to provide insight into which read patterns and sizes are most effective when using the netCDF4-python library. Quantitative information on these would be useful information for scientists, library developers, and data managers.
Three different read patterns were compared to simulate different types of reads: sequential, strided, and random, with each tested across three file systems - Panasas, Lustre, and GPFS. Read rate and standard deviation were measured using Python and C, reading from plain binary files and NetCDF4 files. Read performance for netCDF4-python was compared with the performance of native Python, the C NetCDF library, and the C Posix library.
As expected, comparison between the different read modes shows that access pattern and read size significantly affect achieved performance. The results also show read performance profiles that are similar for the C, C NetCDF, and Python tests, however netCDF4-python performs less efficiently.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
ARCHER User Guide. http://www.archer.ac.uk/documentation/user-guide/
Centre for Environmental Data Analysis. http://www.ceda.ac.uk/
clock. http://pubs.opengroup.org/onlinepubs/009695399/functions/clock.html
h5netcdf 0.2.2. https://pypi.python.org/pypi/h5netcdf/
HDF Group. https://www.hdfgroup.org/HDF5/
strace(1) - Linux man page. http://linux.die.net/man/1/strace
time - Time access and conversions. https://docs.python.org/2/library/time.html
Barton, E., Dilger, A.: High Performance Parallel I/O. CRC Press, Boca Raton (2015). Chap. 8, pp. 91–106
Bartz, C., Chasapis, K., Kuhn, M., Nerge, P., Ludwig, T.: A best practice analysis of HDF\(5\) and NetCDF-\(4\) using Lustre. In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 274–281. Springer, Heidelberg (2015). doi:10.1007/978-3-319-20119-1_20
Blower, J., Gemmell, A., Griffiths, G., Haines, K., Santokhee, A., Yang, X.: A Web Map Service implementation for the visualization of multidimensional gridded environmental data, September 2013. http://centaur.reading.ac.uk/31396/12/ncWMS_paper_EMS_2013.pdf
Buck, J.B., Watkins, N., Lefevre, J., Maltzahn, C., Brandt, S.: SciHadoop : array-based query processing in Hadoop categories and subject descriptors. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, p. 66 (2011)
Castain, R.H., Kulkarni, O., Zhenyu, X.: MapReduce and running Hadoop in a high performance computing environment Lustre : Agenda. In: Lustre User Group 2013, China and Japan (2013)
Cinquini, L., Crichton, D.J., Braverman, A.J., Kyo, L., Fuchs, T., Turmon, M.: Dawn: A Simulation Model for Evaluating Costs and Tradeoffs of Big Data Science Architectures. AGU Fall Meeting Abstracts, p. 3, December 2014
Gao, K., Jin, C., Choudhary, A., Liao, W.K.: Supporting computational data model representation with high-performance I/O in parallel netCDF. In: 18th International Conference on High Performance Computing, HiPC 2011 (2011)
Henty, D., Jackson, A., Moulinec, C., Szeremi, V.: Performance of Parallel IO on ARCHER (2015). http://www.archer.ac.uk/documentation/white-papers/parallelIO/ARCHER_wp_parallelIO.pdf
Hildebrand, D., Schmuck, F.: High Performance Parallel I/O. CRC Press, Boca Raton (2015). Chap. 9, pp. 91–106
Hübbe, N., Kunkel, J.: Reducing the HPC-datastorage footprint with MAFISC multidimensional adaptive filtering improved scientific data compression. Comput. Sci. Res. Dev. 28(2–3), 231–239 (2012)
Lawrence, B.N., Bennett, V.L., Churchill, J., Juckes, M., Kershaw, P., Pascoe, S., Pritchard, M., Stephens, A., Pepler, S.: Storing and manipulating environmental big data with JASMIN. In: IEEE Big Data 2013 (2013)
Lee, C., Yang, M., Aydt, R.: NetCDF-4 Performance Report. Technical report, HDF Group (2008)
Silberschatz, A., Baer Galvin, P., Gagne, G.: Operating System Concepts, 9th edn. Wiley, Hoboken (2013)
Srirama, S.N., Jakovits, P., Vainikko, E.: Adapting scientific computing problems to clouds using MapReduce. Future Gener. Comput. Syst. 28(1), 184–192 (2012)
Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, pp. 307–320. USENIX Association, November 2006
Welch, B., Unangst, M., Abbasi, Z., Gibson, G., Mueller, B., Small, J., Zelenka, J., Zhou, B.: White paper scalable performance of the Panasas parallel file system. In: 6th USENIX Conference on File and Storage Technologies (FAST 2008), pp. 1–22, May 2010
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Jones, M., Blower, J., Lawrence, B., Osprey, A. (2016). Investigating Read Performance of Python and NetCDF When Using HPC Parallel Filesystems. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-46079-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46078-9
Online ISBN: 978-3-319-46079-6
eBook Packages: Computer ScienceComputer Science (R0)