Skip to main content

SWIRL ++ : Evaluating Performance Models to Guide Code Transformation in Convolutional Neural Networks

  • Conference paper
  • First Online:
Languages and Compilers for Parallel Computing (LCPC 2019)

Abstract

Convolutional Neural Networks (CNNs) are ubiquitous in applications ranging from self-driving cars to various branches of health care. CPUs with large core counts and wide SIMD support are used in HPC clusters and supercomputers; therefore, high-performance CPU implementations of CNNs are valuable, in addition to the more prevalent GPU implementations. In this paper, we describe SWIRL ++, an optimization approach for CNNs that incorporates an analytical performance model to identify optimization strategies that minimize data movement overheads of CNN execution. We integrate the model with the SWIRL DSL compiler to automatically generate high-performance implementations of CNNs, optimized for cache hierarchies, and both thread-level and SIMD parallelism.

We compare resulting performance of generated code with TensorFlow, integrated with Intel’s MKL-DNN library (TF-MKL), and PyTorch on an Intel Xeon 8280 CascadeLake platform. Performance exceeds PyTorch on average by \(2\times \), and is comparable on average for both TF-MKL and the SWIRL compiler, showing that an automated code optimization approach achieves performance comparable to hand-tuned libraries and DSL compiler techniques.

R. Barik—Author was affiliated with Intel Labs during the course of this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. NVIDIA GPU Inference Engine (2016). https://devblogs.nvidia.com/parallelforall/production-deep-learning-nvidia-gpu-inference-engine/. Accessed 6 July 2020

  2. Abadi, M., et al.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/

  3. Agarwal, A., et al.: An introduction to computational networks and the computational network toolkit. Technical Report MSR-TR-2014-112 (2014). http://research.microsoft.com/apps/pubs/default.aspx?id=226641

  4. Ansel, J., et al.: Opentuner: an extensible framework for program autotuning. In: International Conference on Parallel Architectures and Compilation Techniques. Edmonton, Canada (2014). http://groups.csail.mit.edu/commit/papers/2014/ansel-pact14-opentuner.pdf

  5. Baghdadi, R., et al.: Tiramisu: a code optimization framework for high performance systems. arXiv preprint arXiv:1804.10694 (2018)

  6. Balaprakash, P., et al.: Autotuning in high-performance computing applications. Proc. IEEE 106(11), 2068–2083 (2018). https://doi.org/10.1109/JPROC.2018.2841200

    Article  Google Scholar 

  7. Bergstra, J., et al.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy) (2010)

    Google Scholar 

  8. Chen, C., Chame, J., Hall, M.: Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In: International Symposium on Code Generation and Optimization. CGO 2005, pp. 111–122. IEEE (2005)

    Google Scholar 

  9. Chen, T., et al.: Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)

  10. Chen, T., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018), pp. 578–594 (2018)

    Google Scholar 

  11. Chetlur, S., et al.: cuDNN: Efficient Primitives for Deep Learning. CoRR abs/1410.0759 (2014). http://arxiv.org/abs/1410.0759

  12. Chintala, S.: Convnet Benchmarks (2015). https://github.com/soumith/convnet-benchmarks. Accessed 6 July 2020

  13. Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A MATLAB-like environment for machine learning. In: BigLearn, NIPS Workshop. No. EPFL-CONF-192376 (2011)

    Google Scholar 

  14. Cyphers, S., et al.: Intel® nGraph™: an intermediate representation, compiler, and executor for deep learning. arXiv preprint arXiv:1801.08058 (2018)

  15. Donadio, S., et al.: A language for the compact representation of multiple program versions. In: Workshop on Languages and Compilers for Parallel Computing (LCPC) (2005)

    Google Scholar 

  16. Dukhan, M.: NNPACK (2016). https://github.com/Maratyszcza/NNPACK. Accessed 6 July 2020

  17. Google: TensorFlow XLA (2016). https://www.tensorflow.org/xla/. Accessed 6 July 2020

  18. Google: Improving the speed of neural networks on CPUs (2011). https://research.google.com/pubs/pub37631.html. Accessed 6 July 2020

  19. Hall, M.W., Chame, J., Chen, C., Shin, J., Rudy, G., Khan, M.M.: Loop transformation recipes for code generation and auto-tuning. In: Proceedings of the 22nd International Workshop on Languages and Compilers for Parallel Computing (2009)

    Google Scholar 

  20. Hartono, A., Norris, B., Sadayappan, P.: Annotation-based empirical performance tuning using orio. In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–11 (2009). https://doi.org/10.1109/IPDPS.2009.5161004

  21. Hartono, A., Norris, B., Sadayappan, P.: Annotation-based empirical performance tuning using Orio. In: IPDPS (2009)

    Google Scholar 

  22. Hezaveh, Y.D., Levasseur, L.P., Marshall, P.J.: Fast automated analysis of strong gravitational lenses with convolutional neural networks. Nature 548 (2017)

    Google Scholar 

  23. Intel: Intel MKL-DNN. https://github.com/01org/mkl-dnn. Accessed 6 July 2020

  24. Jia, Y., et al.: Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014)

  25. Jin, L., Wang, Z., Gu, R., Yuan, C., Huang, Y.: Training large scale deep neural networks on the intel xeon phi many-core coprocessor. In: 2014 IEEE International Parallel Distributed Processing Symposium Workshops, pp. 1622–1630 (2014). https://doi.org/10.1109/IPDPSW.2014.194

  26. Khan, M., Basu, P., Rudy, G., Hall, M., Chen, C., Chame, J.: A script-based autotuning compiler system to generate high-performance CUDA code. ACM Trans. Archit. Code Optim. 9(4), 31:1–31:25 (2013). https://doi.org/10.1145/2400682.2400690

  27. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  28. Kurth, T., et al.: Deep learning at 15PF: supervised and semi-supervised classification for scientific data. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York, NY, pp. 7:1–7:11. SC 2017, ACM (2017). https://doi.org/10.1145/3126908.3126916

  29. Liu, Y., et al.: Application of deep convolutional neural networks for detecting extreme weather in climate datasets. CoRR abs/1605.01156 (2016). http://arxiv.org/abs/1605.01156

  30. Milova, E., Sveshnikova, S., Gankevich, I.: Speedup of deep neural network learning on the mic-architecture. In: 2016 International Conference on High Performance Computing Simulation (HPCS), pp. 989–992 (2016). https://doi.org/10.1109/HPCSim.2016.7568443

  31. Mullapudi, R.T., Adams, A., Sharlet, D., Ragan-Kelley, J., Fatahalian, K.: Automatically scheduling halide image processing pipelines. ACM Trans. Graph. 35(4), 83:1–83:11 (2016). https://doi.org/10.1145/2897824.2925952

  32. Nelson, T., et al.: Generating efficient tensor contractions for GPUs. In: 2015 44th International Conference on Parallel Processing, pp. 969–978. IEEE (2015)

    Google Scholar 

  33. Palkar, S., et al: Weld: a common runtime for high performance data analytics. In: Biennial Conference on Innovative Data Systems Research (CIDR). CIDR 2017 (2017)

    Google Scholar 

  34. Qi, H., Sparks, E.R., Talwalkar, A.: Paleo: a performance model for deep neural networks (2016)

    Google Scholar 

  35. Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48(6), 519–530 (2013)

    Article  Google Scholar 

  36. Rotem, N., et al.: Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907 (2018)

  37. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  38. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. CoRR abs/1312.6229 (2013). http://arxiv.org/abs/1312.6229

  39. Shashank Kaira, C., et al.: Automated correlative segmentation of large transmission x-ray microscopy (TXM) tomograms using deep learning. Mater. Characterization 142, 203–210 (2018). https://doi.org/10.1016/j.matchar.2018.05.053

    Article  Google Scholar 

  40. Silver, D., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016). https://doi.org/10.1038/nature16961

    Article  Google Scholar 

  41. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014)

    Google Scholar 

  42. Systems, N.: NEON (2016). https://github.com/NervanaSystems/neon. Accessed 6 July 2020

  43. Szegedy, C., et al.: Going deeper with convolutions. CoRR abs/1409.4842 (2014). http://arxiv.org/abs/1409.4842

  44. Tapus, C., I-Hsin Chung, Hollingsworth, J.K.: Active harmony: towards automated performance tuning. In: SC 2002: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, pp. 44–44 (2002). https://doi.org/10.1109/SC.2002.10062

  45. Truong, L., et al.: Latte: A language, compiler, and runtime for elegant and efficient deep neural networks. In: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, New York, NY, pp. 209–223. PLDI 2016. ACM (2016). https://doi.org/10.1145/2908080.2908105

  46. Venkat, A., Rusira, T., Barik, R., Hall, M., Truong, L.: SWIRL: high-performance many-core CPU code generation for deep neural networks. Int. J. High Perform. Comput. Appl. 1094342019866247. https://doi.org/10.1177/1094342019866247

  47. Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, Washington, DC. SC 1998, pp. 1–27. IEEE Computer Society (1998). http://dl.acm.org/citation.cfm?id=509058.509096

  48. Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: ACM SIGPLAN Notices, vol. 26, pp. 30–44. ACM (1991)

    Google Scholar 

  49. Yotov, K., et al.: A comparison of empirical and model-driven optimization. In: Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, New York, NY. PLDI 2003, pp. 63–76. ACM (2003). https://doi.org/10.1145/781131.781140

  50. Yotov, K., et al.: Is search really necessary to generate high-performance blas? Proc. IEEE 93(2), 358–386 (2005)

    Article  MathSciNet  Google Scholar 

  51. Zlateski, A., Lee, K., Seung, H.S.: ZNN - A fast and scalable algorithm for training 3D convolutional networks on multi-core and many-core shared memory machines. CoRR abs/1510.06706 (2015). http://arxiv.org/abs/1510.06706

Download references

Acknowledgement

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, by the Department of Energy Scientific Discovery through Advanced Computation program, and by the National Science Foundation under CCF-1564074.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tharindu R. Patabandi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Patabandi, T.R., Venkat, A., Barik, R., Hall, M. (2021). SWIRL ++ : Evaluating Performance Models to Guide Code Transformation in Convolutional Neural Networks. In: Pande, S., Sarkar, V. (eds) Languages and Compilers for Parallel Computing. LCPC 2019. Lecture Notes in Computer Science(), vol 11998. Springer, Cham. https://doi.org/10.1007/978-3-030-72789-5_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-72789-5_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-72788-8

  • Online ISBN: 978-3-030-72789-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics