Skip to main content

Investigating Read Performance of Python and NetCDF When Using HPC Parallel Filesystems

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9945))

Included in the following conference series:

Abstract

New methods need to be developed to handle the increasing size of data sets in atmospheric science - traditional analysis scripts often inefficiently read and process the data. NetCDF4 is a common file format used in atmospheric and ocean sciences, and Python is widely used in atmospheric and ocean science data analysis. The aim of this work is to provide insight into which read patterns and sizes are most effective when using the netCDF4-python library. Quantitative information on these would be useful information for scientists, library developers, and data managers.

Three different read patterns were compared to simulate different types of reads: sequential, strided, and random, with each tested across three file systems - Panasas, Lustre, and GPFS. Read rate and standard deviation were measured using Python and C, reading from plain binary files and NetCDF4 files. Read performance for netCDF4-python was compared with the performance of native Python, the C NetCDF library, and the C Posix library.

As expected, comparison between the different read modes shows that access pattern and read size significantly affect achieved performance. The results also show read performance profiles that are similar for the C, C NetCDF, and Python tests, however netCDF4-python performs less efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. ARCHER User Guide. http://www.archer.ac.uk/documentation/user-guide/

  2. Centre for Environmental Data Analysis. http://www.ceda.ac.uk/

  3. clock. http://pubs.opengroup.org/onlinepubs/009695399/functions/clock.html

  4. CMIP6. http://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6

  5. h5netcdf 0.2.2. https://pypi.python.org/pypi/h5netcdf/

  6. HDF Group. https://www.hdfgroup.org/HDF5/

  7. strace(1) - Linux man page. http://linux.die.net/man/1/strace

  8. time - Time access and conversions. https://docs.python.org/2/library/time.html

  9. Barton, E., Dilger, A.: High Performance Parallel I/O. CRC Press, Boca Raton (2015). Chap. 8, pp. 91–106

    Google Scholar 

  10. Bartz, C., Chasapis, K., Kuhn, M., Nerge, P., Ludwig, T.: A best practice analysis of HDF\(5\) and NetCDF-\(4\) using Lustre. In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 274–281. Springer, Heidelberg (2015). doi:10.1007/978-3-319-20119-1_20

    Chapter  Google Scholar 

  11. Blower, J., Gemmell, A., Griffiths, G., Haines, K., Santokhee, A., Yang, X.: A Web Map Service implementation for the visualization of multidimensional gridded environmental data, September 2013. http://centaur.reading.ac.uk/31396/12/ncWMS_paper_EMS_2013.pdf

  12. Buck, J.B., Watkins, N., Lefevre, J., Maltzahn, C., Brandt, S.: SciHadoop : array-based query processing in Hadoop categories and subject descriptors. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, p. 66 (2011)

    Google Scholar 

  13. Castain, R.H., Kulkarni, O., Zhenyu, X.: MapReduce and running Hadoop in a high performance computing environment Lustre : Agenda. In: Lustre User Group 2013, China and Japan (2013)

    Google Scholar 

  14. Cinquini, L., Crichton, D.J., Braverman, A.J., Kyo, L., Fuchs, T., Turmon, M.: Dawn: A Simulation Model for Evaluating Costs and Tradeoffs of Big Data Science Architectures. AGU Fall Meeting Abstracts, p. 3, December 2014

    Google Scholar 

  15. Gao, K., Jin, C., Choudhary, A., Liao, W.K.: Supporting computational data model representation with high-performance I/O in parallel netCDF. In: 18th International Conference on High Performance Computing, HiPC 2011 (2011)

    Google Scholar 

  16. Henty, D., Jackson, A., Moulinec, C., Szeremi, V.: Performance of Parallel IO on ARCHER (2015). http://www.archer.ac.uk/documentation/white-papers/parallelIO/ARCHER_wp_parallelIO.pdf

  17. Hildebrand, D., Schmuck, F.: High Performance Parallel I/O. CRC Press, Boca Raton (2015). Chap. 9, pp. 91–106

    Google Scholar 

  18. Hübbe, N., Kunkel, J.: Reducing the HPC-datastorage footprint with MAFISC multidimensional adaptive filtering improved scientific data compression. Comput. Sci. Res. Dev. 28(2–3), 231–239 (2012)

    Google Scholar 

  19. Lawrence, B.N., Bennett, V.L., Churchill, J., Juckes, M., Kershaw, P., Pascoe, S., Pritchard, M., Stephens, A., Pepler, S.: Storing and manipulating environmental big data with JASMIN. In: IEEE Big Data 2013 (2013)

    Google Scholar 

  20. Lee, C., Yang, M., Aydt, R.: NetCDF-4 Performance Report. Technical report, HDF Group (2008)

    Google Scholar 

  21. Silberschatz, A., Baer Galvin, P., Gagne, G.: Operating System Concepts, 9th edn. Wiley, Hoboken (2013)

    Google Scholar 

  22. Srirama, S.N., Jakovits, P., Vainikko, E.: Adapting scientific computing problems to clouds using MapReduce. Future Gener. Comput. Syst. 28(1), 184–192 (2012)

    Article  Google Scholar 

  23. Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, pp. 307–320. USENIX Association, November 2006

    Google Scholar 

  24. Welch, B., Unangst, M., Abbasi, Z., Gibson, G., Mueller, B., Small, J., Zelenka, J., Zhou, B.: White paper scalable performance of the Panasas parallel file system. In: 6th USENIX Conference on File and Storage Technologies (FAST 2008), pp. 1–22, May 2010

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthew Jones .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Jones, M., Blower, J., Lawrence, B., Osprey, A. (2016). Investigating Read Performance of Python and NetCDF When Using HPC Parallel Filesystems. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46079-6_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46078-9

  • Online ISBN: 978-3-319-46079-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics