Towards enhanced I/O performance of a highly integrated many-core processor by empirical analysis

Lee, Cheongjun; Lee, Jaehwan; Koo, Donghun; Kim, Chungyong; Bang, Jiwoo; Byun, Eun-Kyu; Eom, Hyeonsang

doi:10.1007/s10586-021-03288-2

Towards enhanced I/O performance of a highly integrated many-core processor by empirical analysis

Published: 01 May 2021

Volume 26, pages 2643–2655, (2023)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Cheongjun Lee¹,
Jaehwan Lee ORCID: orcid.org/0000-0001-6248-9567¹,
Donghun Koo²,
Chungyong Kim²,
Jiwoo Bang²,
Eun-Kyu Byun³ &
…
Hyeonsang Eom²

324 Accesses
2 Citations
Explore all metrics

Abstract

Optimized for parallel operations, Intel’s second generation Xeon Phi processor, code-named Knights Landing (KNL), is actively utilized in high performance computing systems based on its highly integrated cores and high-bandwidth on-package memory, Multi-Channel DRAM (MCDRAM). Recently, the emergence of data-intensive applications and the utilization of many-core processors have further increased I/O performance requirements of high performance computing systems. Therefore, it is necessary to understand and analyze the I/O characteristics of many integrated core systems. In this paper, we experimentally analyze the I/O characteristics of KNL, focusing on single-thread, buffered-write operations. We determine that KNL has a bottleneck in its buffered write operation that utilizes page cache. To find this bottleneck point and identify its cause, we conduct the experiments in two different ways. First, we measure the execution time of kernel functions through kernel I/O path. Second, we measure the occurrence count of system events such as cache-misses and branch-misses. With results from these experiments, we discuss the characteristics on KNL’s I/O performance involving the performance bottlenecks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating the Performance of Kunpeng 920 Processors on Modern HPC Applications

Case Study for Running Memory-Bound Kernels on RISC-V CPUs

SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

Data availability

Available upon request

References

Asaadi, H., Khaldi, D., Chapman, B.: A comparative survey of the hpc and big data paradigms: Analysis and experiments. In: Proceedings of the 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 423–432 (2016)
Han, J., Koo, D., Lockwood, G.K., Lee, J., Eom, H., Hwang, S.: Accelerating a burst buffer via user-level i/o isolation. In: Proceedings of the 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 245–255 (2017)
Koo, D., Lee, J., Liu, J., Byun, E.-K., Kwak, J.-H., Lockwood, G.K., Hwang, S., Antypas, K., Wu, K., Eom, H.: An empirical study of i/o separation for burst buffers in hpc systems. J. Parallel Distrib. Comput. 148, 96–108 (2021)
Article Google Scholar
Xuan, P., Ligon, W.B., Srimani, P.K., Ge, R., Luo, F.: Accelerating big data analytics on hpc clusters using two-level storage. Parallel Comput. 61, 18–34 (2017), special Issue on 2015 Workshop on Data Intensive Scalable Computing Systems (DISCS-2015). http://www.sciencedirect.com/science/article/pii/S0167819116300631
Zhao, D., Liu, N., Kimpe, D., Ross, R., Sun, X., Raicu, I.: Towards exploring data-intensive scientific applications at extreme scales through systems and simulations. IEEE Trans. Parallel Distrib. Syst. 27(6), 1824–1837 (2016)
Article Google Scholar
Leak, S.: Introduction to Cori. NERSC User Engagement Group. https://www.nersc.gov/assets/Uploads/Intro-to-Cori.pdf (2017)
“Kisti nurion,” https://www.ksc.re.kr/eng/resource/overview
“Kisti pushes the boundaries of science and technology with nurion,” Intel®, Case Study Report, https://www.intel.co.kr/content/www/kr/ko/products/docs/network-io/high-performance-fabrics/opa-xeon-scalable-kisti-nurion-study.html
Agelastos, A.M. et al.: Performance on trinity phase 2 (a cray xc40 utilizing intel xeon phi processors) with acceptance applications and benchmarks. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), Tech. Rep. (2017)
Sodani, A.: Knights landing (knl): 2nd generation intel®xeon phi processor. In: Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS), pp. 1–24 (Aug 2015)
Sodani, A., et al.: Knights landing: second-generation intel xeon phi product. IEEE Micro 36(2), 34–46 (2016)
Article Google Scholar
Woo, J., Choi, H., Lee, J.: Empirical performance analysis of collective communication for distributed deep learning in a many-core cpu environment. Appl. Sci. 10(19), 6717 (2020)
Article Google Scholar
Chen, L., Peng, B., Zhang, B., Liu, T., Zou, Y., Jiang, L., Henschel, R., Stewart, C., Zhang, Z., McCallum, E., Tom, Z., Jon, O., Qiu, J.: Benchmarking harp-daal: High performance hadoop on knl clusters. In: Proceedings of the 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), pp. 82–89 (2017)
Byun, C., Kepner, J., Arcand, W., Bestor, D., Bergeron, B., Gadepally, V., Houle, M., Hubbell, M., Jones, M., Klein, A., Michaleas, P., Milechin, L., Mullen, J., Prout, A., Rosa, A., Samsi, S., Yee, C., Reuther, A.: Benchmarking data analysis and machine learning applications on the intel knl many-core processor. In: Proceedings of the 2017 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2017)
“Cgroups,” https://en.wikipedia.org/wiki/Cgroups
S. A. et al.: Improving i/o resource sharing of linux cgroup for nvme ssds on multi-core systems. In: 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16). Denver, CO: USENIX Association. https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/ahn (2016)
Oh, K., Park, J., Eom, Y.I.: Weight-based page cache management scheme for enhancing i/o proportionality of cgroups. In: Proceedings of the 2019 IEEE International Conference on Consumer Electronics (ICCE), pp. 1–3 (2019)
“Ior wiki,” https://wiki.lustre.org/IOR
Kljajić, J., Bogdanović, N., Nankovski, M., Tončev, M., Djordjević, B.: Performance analysis of 64-bit ext4, xfs and btrfs filesystems on the solid-state disk technology. INFOTEH-JAHORINA 15, 563–566 (2016)
Google Scholar
“How to choose your red hat enterprise linux file system,” https://access.redhat.com/articles/3129891
“Linux perf profiler,” https://en.wikipedia.org/wiki/Perf_(Linux)
Bang, J., Kim, C., Kim, S., Chen, Q., Lee, C., Byun, E.-K., Lee, J., Eom, H.: Finer-lru: A scalable page management scheme for hpc manycore architectures, submitted to IPDPS‘21 (May 2021)
Liu, J. et al.: Understanding the i/o performance gap between cori knl and haswell. Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States), Tech. Rep. (2017)
“Intel silvermont microarchitecture,” https://en.wikipedia.org/wiki/Silvermont
Xie, B., Liu, X., McKee, S.A., Zhan, J., Jia, Z., Wang, L., Zhang, L.: Understanding data analytics workloads on intel(r) xeon phi(r). In: Proceedings of the 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 206–215 (2016)
D’Agostino, D., et al.: Performance and economic evaluations in adopting low power architectures: A real case analysis. In: Pham, C., Altmann, J., Bañares, J.Á. (eds.) Economics of Grids, Clouds, Systems, and Services, pp. 177–189. Springer International Publishing, Cham (2017)
Chapter Google Scholar
Mittal, S.: A survey of techniques for architecting tlbs. Concurr. Comput. 29(10), e4061 (2017)
Article Google Scholar
“Translation lookaside buffer (tlb),” https://en.wikipedia.org/wiki/Translation_lookaside_buffer
Jabbie, I.A. et al.: Performance comparison of intel xeon phi knights landing. SIAM Undergraduate Research Online (SIURO), vol. 10 (2017)
Park, G., Rho, S., Kim, J.-S., Nam, D.: Towards optimal scheduling policy for heterogeneous memory architecture in many-core system. Clust. Comput. 22(1), 121–133 (2019)
Article Google Scholar
Ahn, S., La, K., Kim, J.: Improving i/o resource sharing of linux cgroup for nvme ssds on multi-core systems. In: Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16). Denver, CO: USENIX Association. https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/ahn (2016)
Pathak, A.R., Pandey, M., Rautaray, S.S.: Approaches of enhancing interoperations among high performance computing and big data analytics via augmentation. Cluster Computing, pp. 1–36. Springer, New York (2019)
Google Scholar
Li, D., Dong, M., Tang, Y., Ota, K.: A novel disk i/o scheduling framework of virtualized storage system. Clust. Comput. 22(1), 2395–2405 (2019)
Article Google Scholar

Download references

Funding

This work was supported by the Korea Institute of Science and Technology Information (K-21-L02-C08-S01), the National Supercomputing Center with supercomputing resources including technical support (KSC-2020-INO-0044), the PF Class Heterogeneous High Performance Computer Development Program (NRF-2016M3C4A7952587), the Next-Generation Information Computing Development Program (NRF-2015M3C4A7065646), the Basic Science Research Program (NRF-2020R1F1A1072696), BK21 FOUR Intelligence Computing (4199990214639, Dept. of Computer Science and Engineering, Seoul National University) through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT, Seoul R&D Program (CY20038) “Commercializing of technology for CAT Pro Web service based on AI” and the Technology development Program (S2878336) funded by the Ministry of SMEs and Startups (MSS, Korea)

Author information

Authors and Affiliations

Korea Aerospace University, Goyang-si, Republic of Korea
Cheongjun Lee & Jaehwan Lee
Seoul National University, Seoul, Republic of Korea
Donghun Koo, Chungyong Kim, Jiwoo Bang & Hyeonsang Eom
Korea Institute of Science and Technology Information, Daejeon, Republic of Korea
Eun-Kyu Byun

Authors

Cheongjun Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jaehwan Lee
View author publications
You can also search for this author in PubMed Google Scholar
Donghun Koo
View author publications
You can also search for this author in PubMed Google Scholar
Chungyong Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jiwoo Bang
View author publications
You can also search for this author in PubMed Google Scholar
Eun-Kyu Byun
View author publications
You can also search for this author in PubMed Google Scholar
Hyeonsang Eom
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

CL: Software, Writing—original draft, Review, Validation. JL: Conceptualization, Supervision, Writing—original draft, Funding acquisition, Project administration. DK: Software, Validation, Data curation. CK: Writing—original, Validation, Methodology. JB: Methodology, Data curation. EB: Supervision, Resources, Funding acquisition. HE: Project administration, Funding acquisition, Conceptualization.

Corresponding author

Correspondence to Jaehwan Lee.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, C., Lee, J., Koo, D. et al. Towards enhanced I/O performance of a highly integrated many-core processor by empirical analysis. Cluster Comput 26, 2643–2655 (2023). https://doi.org/10.1007/s10586-021-03288-2

Download citation

Received: 04 January 2021
Revised: 12 April 2021
Accepted: 19 April 2021
Published: 01 May 2021
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10586-021-03288-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards enhanced I/O performance of a highly integrated many-core processor by empirical analysis

Abstract

Access this article

Similar content being viewed by others

Evaluating the Performance of Kunpeng 920 Processors on Modern HPC Applications

Case Study for Running Memory-Bound Kernels on RISC-V CPUs

SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards enhanced I/O performance of a highly integrated many-core processor by empirical analysis

Abstract

Access this article

Similar content being viewed by others

Evaluating the Performance of Kunpeng 920 Processors on Modern HPC Applications

Case Study for Running Memory-Bound Kernels on RISC-V CPUs

SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation