Characterizing and optimizing Java-based HPC applications on Intel many-core architecture

Yu, Yang; Lei, Tianyang; Chen, Haibo; Zang, Binyu

doi:10.1007/s11432-015-0989-3

Characterizing and optimizing Java-based HPC applications on Intel many-core architecture

基于Intel众核架构的Java高性能计算研究与优化

Research Paper
Published: 08 May 2017

Volume 60, article number 122106, (2017)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Yang Yu^1,2,
Tianyang Lei²,
Haibo Chen² &
…
Binyu Zang²

150 Accesses
1 Citation
Explore all metrics

Abstract

The increasing demand for performance has stimulated the wide adoption of many-core accelerators like Intel^® Xeon Phi^TM Coprocessor, which is based on Intel’s Many Integrated Core architecture. While many HPC applications running in native mode have been tuned to run efficiently on Xeon Phi, it is still unclear how a managed runtime like JVM performs on such an architecture. In this paper, we present the first measurement study of a set of Java HPC applications on Xeon Phi under JVM. One key obstacle to the study is that there is currently little support of Java for Xeon Phi. This paper presents the result based on the first porting of OpenJDK platform to Xeon Phi, in which the HotSpot virtual machine acts as the kernel execution engine. The main difficulty includes the incompatibility between Xeon Phi ISA and the assembly library of Hotspot VM. By evaluating the multithreaded Java Grande benchmark suite and our ported Java Phoenix benchmarks, we quantitatively study the performance and scalability issues of JVM on Xeon Phi and draw several conclusions from the study. To fully utilize the vector computing capability and hide the significant memory access latency on the coprocessor, we present a semi-automatic vectorization scheme and software prefetching model in HotSpot. Together with 60 physical cores and tuning, our optimized JVM achieves averagely 2.7x and 3.5x speedup compared to Xeon CPU processor by using vectorization and prefetching accordingly. Our study also indicates that it is viable and potentially performance-beneficial to run applications written for such a managed runtime like JVM on Xeon Phi.

摘要

创新点

基于Intel集成众核架构(MIC)的Xeon Phi协处理器是近年来十分流行的一款众核产品, 而Java由于其优秀的平台移植性与日益提升的虚拟机性能也越来越多地被应用于高性能计算领域, 然而遗憾的是, Intel并未对Xeon Phi提供Java环境支持。本文实现了首个针对Intel MIC平台的OpenJDK移植工作, 并成功地搭建了一个完整的Java运行时环境。同时, 我们基于一系列计算密集型Java程序详细研究了MIC上的Java HPC性能吞吐量与可扩展性, 并针对其中存在的问题分别提出了一个半自动的向量化模型与数据预取解决方案。实验表明, 本文所提出的优化方案可以在MIC上带来显著的性能提升, 并同时论证了Intel众核平台在Java高性能计算领域拥有巨大的潜力。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Containerization technologies: taxonomies, applications and challenges

Article 08 June 2021

Ouafa Bentaleb, Adam S. Z. Belloum, … Aouaouche El-Maouhab

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Article Open access 06 April 2024

Peter Thoman & Philip Salzmann

Analyzing the impact of various parameters on job scheduling in the Google cluster dataset

Article 29 March 2024

Danyal Shahmirzadi, Navid Khaledian & Amir Masoud Rahmani

References

Chrysos G. Intel® Xeon Phi^TM Coprocessor-the Architecture. Intel Whitepaper, 2014
Google Scholar
Shafi A, Carpenter B, Baker M. Nested parallelism for multi-core HPC systems using Java. J Parall Distrib Comput, 2009, 69: 532–545
Article Google Scholar
Moreira J E, Midkiff S P, Gupta M, et al. NINJA: Java for high performance numerical computing. Sci Program, 2002, 10: 19–33
MATH Google Scholar
Amedro B, Bodnartchouk V, Caromel D, et al. Current state of Java for HPC. Technical Report RT-0353. INRIA, 2008
Google Scholar
O’Mullane W, Luri X, Parsons P, et al. Using Java for distributed computing in the Gaia satellite data processing. Exp Astron, 2011, 31: 243–258
Article Google Scholar
Taboada G L, Touri˜no J, Doallo R. Java for high performance computing: assessment of current research and practice. In: Proceedings of the 7th International Conference on Principles and Practice of Programming in Java. New York: ACM, 2009. 30–39
Google Scholar
Boisvert R F, Moreira J, Philippsen M, et al. Java and numerical computing. Comput Sci Eng, 2001, 3: 18–24
Article Google Scholar
Guide P. Intel R 64 and IA-32 Architectures Software Developer’s Manual. 2010
Google Scholar
Blumofe R D, Joerg C F, Kuszmaul B C, et al. Cilk: an efficient multithreaded runtime system. In: Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 1995. 207–216
Google Scholar
Lindholm T, Yellin F, Bracha G, et al. The Java Virtual Machine Specification. 8th ed. Redwood City: Pearson Education, 2014
Google Scholar
Intel. Intel ® Xeon PhiTM Coprocessor Instruction Set Architecture Reference Manual. 2012
Smith L A, Bull J M, Obdrizalek J. A parallel Java grande benchmark suite. In: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing. New York: ACM, 2001. 8
Chapter Google Scholar
Ranger C, Raghuraman R, Penmetsa A, et al. Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceedings of IEEE 13th International Symposium on High Performance Computer Architecture. Washington, DC: IEEE, 2007. 13–24
Google Scholar
Fang Z, Mehta S, Yew P C, et al. Measuring microarchitectural details of multi-and many-core memory systems through microbenchmarking. ACM Trans Architect Code Optim, 2015, 11: 55
Google Scholar
Intel. Intel® Xeon PhiTM Coprocessor System Software Developers Guide. 2013
Mehta S, Fang Z, Zhai A, et al. Multi-stage coordinated prefetching for present-day processors. In: Proceedings of the 28th ACM International Conference on Supercomputing. New York: ACM, 2014. 73–82
Google Scholar
Krishnaiyer R, Kultursay E, Chawla P, et al. Compiler-based data prefetching and streaming non-temporal store generation for the Intel® Xeon PhiTM coprocessor. In: Proceedings of IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), Cambridge, 2013. 1575–1586
Google Scholar
Wurthinger T, Wimmer C, Mossenbock H. Visualization of program dependence graphs. In: Proceedings of the Joint European Conferences on Theory and Practice of Software and the 17th International Conference on Compiler Construction. Berlin/Heidelberg: Springer-Verlag, 2008. 193–196
Google Scholar
Tuck J, Ceze L, Torrellas J. Scalable cache miss handling for high memory-level parallelism. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. Washington, DC: IEEE, 2006. 409–422
Google Scholar
Fang J, Varbanescu A L, Sips H, et al. An empirical study of Intel Xeon Phi. arXiv:1310.5842
Ramachandran A, Vienne J, van der Wijngaart R, et al. Performance evaluation of NAS parallel benchmarks on Intel Xeon Phi. In: Proceedings of the 42nd International Conference on Parallel Processing, Lyon, 2013. 736–743
Google Scholar
Heinecke A, Vaidyanathan K, Smelyanskiy M, et al. Design and implementation of the linpack benchmark for single and multi-node systems based on Intel R Xeon PhiTM coprocessor. In: Proceedings of IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), Boston, 2013. 126–137
Google Scholar
Eyerman S, Eeckhout L. The benefit of SMT in the multi-core era: flexibility towards degrees of thread-level parallelism. ACM SIGARCH Comput Architect News, 2014, 42: 591–606
Google Scholar
Chen K Y, Chang J M, Hou T W. Multithreading in Java: performance and scalability on multicore systems. IEEE Trans Comput, 2011, 60: 1521–1534
Article MathSciNet Google Scholar
Gidra L, Thomas G, Sopena J, et al. A study of the scalability of stop-the-world garbage collectors on multicores. In: Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2013. 229–240
Google Scholar
Yan Y H, Grossman M, Sarkar V. JCUDA: a programmer-friendly interface for accelerating Java programs with CUDA. In: Proceedings of the 15th International Euro-Par Conference on Parallel Processing. Berlin/Heidelberg: Springer-Verlag, 2009. 887–899
Google Scholar
Docampo J, Ramos S, Taboada G L, et al. Evaluation of Java for general purpose GPU computing. In: Proceedings of the 27th International Conference on Advanced Information Networking and Applications Workshops. Washington, DC: IEEE, 2013. 1398–1404
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Fudan University, Shanghai, 200433, China
Yang Yu
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University, Shanghai, 200240, China
Yang Yu, Tianyang Lei, Haibo Chen & Binyu Zang

Authors

Yang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Tianyang Lei
View author publications
You can also search for this author in PubMed Google Scholar
Haibo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Binyu Zang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Binyu Zang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, Y., Lei, T., Chen, H. et al. Characterizing and optimizing Java-based HPC applications on Intel many-core architecture. Sci. China Inf. Sci. 60, 122106 (2017). https://doi.org/10.1007/s11432-015-0989-3

Download citation

Received: 03 October 2016
Accepted: 13 December 2016
Published: 08 May 2017
DOI: https://doi.org/10.1007/s11432-015-0989-3

Keywords

关键词

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Characterizing and optimizing Java-based HPC applications on Intel many-core architecture

Abstract

摘要

创新点

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Analyzing the impact of various parameters on job scheduling in the Google cluster dataset

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

关键词

Navigation

Characterizing and optimizing Java-based HPC applications on Intel many-core architecture

Abstract

摘要

创新点

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Analyzing the impact of various parameters on job scheduling in the Google cluster dataset

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

关键词

Search

Navigation