Abstract
This paper comparatively evaluates the microarchitectural performance of two representative Computational Fluid Dynamics (CFD) applications on the Intel Many Integrated Core (MIC) product, the Intel Knights Corner (KNC) coprocessor, and the Intel Sand Bridge (SNB) processor. Performance Monitoring Unit-based measurement method is used, along with a two-phase measurement method and some considerations to minimize the errors and instabilities. The results show that the CFD applications are sensitive to architecture factors. Their single thread performance and efficiency on KNC are much lower than that on SNB. Branch prediction and memory access are two primary factors that make the performance difference. The applications’ low-computational intensity and inefficient vector instruction usage are two additional factors. To be more efficient for the CFD applications, the MIC architecture needs to improve its branch prediction mechanism and memory hierarchy. Fine tuning of application codes is also crucial and is hard work.



















Similar content being viewed by others
References
Intel Corporation. Many Integrated Core (MIC) Architecture. http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html
Intel Corporation (2012) Intel Xeon Phi coprocessor datasheet
Jeffers J, Reinders J (2013) Intel Xeon Phi coprocessor high performance programming. Morgan Kaufmann Press, Menlo Park
Intel Corporation (2012) An overview of programming for Intel Xeon processors and Intel Xeon Phi coprocessors, Rev 20121015
Top500 Supercomputers sites. http://www.top500.org/
Kanter D (2010) Intels sandy bridge microarchitecture. http://www.realworldtech.com/sandy-bridge/
Raman K (2013) Sandias molecular dynamics miniMD performance optimizations
Kamruzzaman M, Swanson S, Tullsen DM (2010) Data software, spreading: leveraging distributed caches to improve single thread performance. PLDI’10, Toronto, Ontario, Canada, June 5–10
Wellein G, Hager G (2012) Performance engineering for multi- and manycores: unveiling the mysteries of application performance. Invited session “Application performance: lessons learned from petascale computing” at ISC12, June 18, 2012. http://blogs.fau.de/hager/files/2010/09/Hager-ISC12
Schulz KW, Ulerich R, Malaya N, Bauman PT, Stogner R, Simmons C (2012) Early experiences porting scientific applications to the many integrated core (MIC) platform. In: TACC-Intel highly parallel computing symposium, Austin, TX, April 10–11
Glenn Brook R, Hadri B, Betro VC, Hulguin RC, Braby R (2012) Early application experiences with the Intel MIC architecture in a cray CX1. Cray User Group Meeting, Stuttgart, Germany, April 29–May 3. 2012, paper no.194
Satish N, Kim C, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M, Dubey P (2012) Can traditional programming bridge the ninja performance gap for parallel computing applications? ISCA, pp 440–451
Williams S, Kalamkar DD et al (2012) Optimization of geometric multigrid for emerging multi- and manycore processors. SC’12, Salt Lake City, Utah, USA, paper no. 96
Cramer T, Schmidl D, Klemmy M, an Mey D (2012) OpenMP programming on Intel Xeon Phi coprocessors an early performance comparison. Many-core applications research community symposium, pp 38–44
Vladimirov A, Karpusenko V (2013) Test-driving Intel Xeon Phi coprocessors with a basic N-body simulation. http://goparallel.sourceforge.net/wp-content/uploads/2013/01/Colfax_Nbody_Xeon_Phi
Koesterke L, Milfeld K et al (2013) Optimizing the PCIT algorithm on Stampede’s Xeon and Xeon Phi processors for faster discovery of biological networks. XSEDE’13, San Diego, CA, USA, July 22–25
Meng Q, Humphrey A, Berzins M, Schmidt J (2013) Preliminary experiences with the Uintah framework on Intel Xeon Phi and stampede. XSEDE’13, San Diego, California, USA, July 22–25
Cadambi S, Coviello G, Li C-H, Phull R, Rao K, Sankaradass M, Chakradhar S (2013) COSMIC: middleware for high performance and reliable multiprocessing on Xeon Phi Coprocessors. HPDC’13, New York, NY, USA, June 17–21, pp 215–226
Li Yuqian, Che Yonggang, Wang Zhenghua (2013) Performance evaluation and scalability analysis of NPB-MZ on Intel Xeon Phi coprocessor. Commun Comput Inf Sci 396:153–162
Van der Wijngaart RF, Jin H (2003) NAS parallel benchmarks, multi-zone versions. NAS Technical Report NAS-03-010
Xiaogang Deng, Hanxin Zhang (2000) Developing high-order accurate nonlinear schemes. J Comput Phys 165:22–44
Deng X, Mao M, Tu G et al (2010) Extending the fifth-order weighted compact nonlinear scheme to complex grids with characteristic-based interface conditions. AIAA J 48(12):2840–2851
Deng Xiaogang, Mao Meiliang, Zhang Hanxin, Zhang Yifeng (2012) High-order and high accurate CFD methods and their applications for complex grid problems. J Comput Phys 11(4):1081–1102
Che Y-G, Zhang L-L, Wang Y-X, Xu C-F, Liu W, Wang Z-H, Liu H-Y (2012) Uniprocessor performance tuning of a structured grid based parallel CFD application. In: Annual conference on high performance computing of China, Zhangjiajie, China, October 29–31, pp 39–46 (in Chinese with English abstract)
Intel Corporation (2013) Multiplying matrices using dgemm. http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/tutorials/mkl_mmx_f/GUID-36BFBCE9-EB0A-43B0-ADAF-2B65275726EA.htm
Browne S, Dongarra J, Garner N, Ho G, Mucci P (2000) A portable programming interface for performance evaluation on modern processors. Int J High Perform Comput Appl 14:189–204. http://icl.cs.utk.edu/papi/
Intel Corporation (2013) Intel 64 and IA-32 architectures optimization reference manual. Order number: 248966-028
Serdjuk N (2012) Enabling huge paging on MIC with libhugetlbfs library. Intel Corporation
Intel Corporation (2012) Intel Xeon Phi coprocessor (codename: Knights Corner) Performance Monitoring Units. Revision 1.01
Intel Corporation (2013) Intel 64 and IA-32 architectures software developer’s manual combined volumes
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76
Sun Xian-He, Wang Dawei (2012) APC: a performance metric of memory systems. ACM Sigmetrics Perform Eval Rev 40(2):125–130
McCalpin JD (2012) Some comments on the Xeon Phi coprocessor. Posted on November 17, 2012. http://blogs.utexas.edu/jdm4372/2012/11/17/some-comments-on-the-xeon-phi-coprocessor/
Acknowledgments
The authors would like to thank the HPC Application Research Center of National University of Defense Technology that provides the platforms for the performance evaluation. The authors would also like to thank Huayong Liu from the State Key Laboratory of Aerodynamics of China for his help. This work was partially supported by the National Natural Science Foundation of China under Grant Nos. 60603055 and 11272352, and the open Research Program of China State Key Laboratory of Aerodynamics under Grant No. SKLA20130105.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Che, Y., Zhang, L., Wang, Y. et al. Microarchitectural performance comparison of Intel Knights Corner and Intel Sandy Bridge with CFD applications. J Supercomput 70, 321–348 (2014). https://doi.org/10.1007/s11227-014-1245-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1245-3