Investigating Read Performance of Python and NetCDF When Using HPC Parallel Filesystems

Jones, Matthew; Blower, Jon; Lawrence, Bryan; Osprey, Annette

doi:10.1007/978-3-319-46079-6_12

Matthew Jones¹⁶,
Jon Blower¹⁶,
Bryan Lawrence^16,17,18 &
…
Annette Osprey^16,18

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9945))

Included in the following conference series:

International Conference on High Performance Computing

2494 Accesses
2 Altmetric

Abstract

New methods need to be developed to handle the increasing size of data sets in atmospheric science - traditional analysis scripts often inefficiently read and process the data. NetCDF4 is a common file format used in atmospheric and ocean sciences, and Python is widely used in atmospheric and ocean science data analysis. The aim of this work is to provide insight into which read patterns and sizes are most effective when using the netCDF4-python library. Quantitative information on these would be useful information for scientists, library developers, and data managers.

Three different read patterns were compared to simulate different types of reads: sequential, strided, and random, with each tested across three file systems - Panasas, Lustre, and GPFS. Read rate and standard deviation were measured using Python and C, reading from plain binary files and NetCDF4 files. Read performance for netCDF4-python was compared with the performance of native Python, the C NetCDF library, and the C Posix library.

As expected, comparison between the different read modes shows that access pattern and read size significantly affect achieved performance. The results also show read performance profiles that are similar for the C, C NetCDF, and Python tests, however netCDF4-python performs less efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

ARCHER User Guide. http://www.archer.ac.uk/documentation/user-guide/
Centre for Environmental Data Analysis. http://www.ceda.ac.uk/
clock. http://pubs.opengroup.org/onlinepubs/009695399/functions/clock.html
CMIP6. http://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6
h5netcdf 0.2.2. https://pypi.python.org/pypi/h5netcdf/
HDF Group. https://www.hdfgroup.org/HDF5/
strace(1) - Linux man page. http://linux.die.net/man/1/strace
time - Time access and conversions. https://docs.python.org/2/library/time.html
Barton, E., Dilger, A.: High Performance Parallel I/O. CRC Press, Boca Raton (2015). Chap. 8, pp. 91–106
Google Scholar
Bartz, C., Chasapis, K., Kuhn, M., Nerge, P., Ludwig, T.: A best practice analysis of HDF\(5\) and NetCDF-\(4\) using Lustre. In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 274–281. Springer, Heidelberg (2015). doi:10.1007/978-3-319-20119-1_20
Chapter Google Scholar
Blower, J., Gemmell, A., Griffiths, G., Haines, K., Santokhee, A., Yang, X.: A Web Map Service implementation for the visualization of multidimensional gridded environmental data, September 2013. http://centaur.reading.ac.uk/31396/12/ncWMS_paper_EMS_2013.pdf
Buck, J.B., Watkins, N., Lefevre, J., Maltzahn, C., Brandt, S.: SciHadoop : array-based query processing in Hadoop categories and subject descriptors. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, p. 66 (2011)
Google Scholar
Castain, R.H., Kulkarni, O., Zhenyu, X.: MapReduce and running Hadoop in a high performance computing environment Lustre : Agenda. In: Lustre User Group 2013, China and Japan (2013)
Google Scholar
Cinquini, L., Crichton, D.J., Braverman, A.J., Kyo, L., Fuchs, T., Turmon, M.: Dawn: A Simulation Model for Evaluating Costs and Tradeoffs of Big Data Science Architectures. AGU Fall Meeting Abstracts, p. 3, December 2014
Google Scholar
Gao, K., Jin, C., Choudhary, A., Liao, W.K.: Supporting computational data model representation with high-performance I/O in parallel netCDF. In: 18th International Conference on High Performance Computing, HiPC 2011 (2011)
Google Scholar
Henty, D., Jackson, A., Moulinec, C., Szeremi, V.: Performance of Parallel IO on ARCHER (2015). http://www.archer.ac.uk/documentation/white-papers/parallelIO/ARCHER_wp_parallelIO.pdf
Hildebrand, D., Schmuck, F.: High Performance Parallel I/O. CRC Press, Boca Raton (2015). Chap. 9, pp. 91–106
Google Scholar
Hübbe, N., Kunkel, J.: Reducing the HPC-datastorage footprint with MAFISC multidimensional adaptive filtering improved scientific data compression. Comput. Sci. Res. Dev. 28(2–3), 231–239 (2012)
Google Scholar
Lawrence, B.N., Bennett, V.L., Churchill, J., Juckes, M., Kershaw, P., Pascoe, S., Pritchard, M., Stephens, A., Pepler, S.: Storing and manipulating environmental big data with JASMIN. In: IEEE Big Data 2013 (2013)
Google Scholar
Lee, C., Yang, M., Aydt, R.: NetCDF-4 Performance Report. Technical report, HDF Group (2008)
Google Scholar
Silberschatz, A., Baer Galvin, P., Gagne, G.: Operating System Concepts, 9th edn. Wiley, Hoboken (2013)
Google Scholar
Srirama, S.N., Jakovits, P., Vainikko, E.: Adapting scientific computing problems to clouds using MapReduce. Future Gener. Comput. Syst. 28(1), 184–192 (2012)
Article Google Scholar
Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, pp. 307–320. USENIX Association, November 2006
Google Scholar
Welch, B., Unangst, M., Abbasi, Z., Gibson, G., Mueller, B., Small, J., Zelenka, J., Zhou, B.: White paper scalable performance of the Panasas parallel file system. In: 6th USENIX Conference on File and Storage Technologies (FAST 2008), pp. 1–22, May 2010
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Meteorology, University of Reading, Reading, UK
Matthew Jones, Jon Blower, Bryan Lawrence & Annette Osprey
STFC Rutherford Appleton Laboratory, Centre for Environmental Data Analysis, Didcot, UK
Bryan Lawrence
National Centre for Atmospheric Science, Manchester, UK
Bryan Lawrence & Annette Osprey

Authors

Matthew Jones
View author publications
You can also search for this author in PubMed Google Scholar
Jon Blower
View author publications
You can also search for this author in PubMed Google Scholar
Bryan Lawrence
View author publications
You can also search for this author in PubMed Google Scholar
Annette Osprey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthew Jones .

Editor information

Editors and Affiliations

University of Delaware, Newark, Delaware, USA
Michela Taufer
Forschungszentrum Jülich, Jülich, Germany
Bernd Mohr
DKRZ, Hamburg, Germany
Julian M. Kunkel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jones, M., Blower, J., Lawrence, B., Osprey, A. (2016). Investigating Read Performance of Python and NetCDF When Using HPC Parallel Filesystems. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-46079-6_12
Published: 06 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46078-9
Online ISBN: 978-3-319-46079-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics