Theoretical peak FLOPS per instruction set: a tutorial

Dolbeau, Romain

doi:10.1007/s11227-017-2177-5

Theoretical peak FLOPS per instruction set: a tutorial

Published: 16 November 2017

Volume 74, pages 1341–1377, (2018)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Romain Dolbeau ORCID: orcid.org/0000-0002-4466-8948¹

2739 Accesses
28 Citations
5 Altmetric
Explore all metrics

A Correction to this article was published on 27 March 2022

This article has been updated

Abstract

Traditionally, evaluating the theoretical peak performance of a CPU in FLOPS (floating-point operations per second) was merely a matter of multiplying the frequency by the number of floating-point instructions per cycle. Today however, CPUs have features such as vectorization, fused multiply-add, hyperthreading, and “turbo” mode. In this tutorial, we look into this theoretical peak for recent fully featured Intel CPUs and other hardware, taking into account not only the simple absolute peak, but also the relevant instruction sets, encoding and the frequency scaling behaviour of modern hardware.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

High-Performance Computing Basics

CUDA Achievements and GPU Challenges Ahead

Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library

Change history

27 March 2022
A Correction to this paper has been published: https://doi.org/10.1007/s11227-022-04443-1

Notes

Most processors even older than Nehalem but supporting SSE2 would fall into the same category. Strictly speaking, SSE only supports single-precision floating-point operations, and SSE2 supports double precision. Processors without SSE2 have to rely on x87 for double-precision arithmetic and are not considered. In the remainder of this tutorial, the term SSE will be used to describe the SSE & SSE2 combination, since both are mandatory on all x86-64 processors.
i.e. -march=native -mtune=native.
Numbers not shown are the same than for the next shown number, e.g. using 18 cores has the same limits than using 20 cores.
Beware that consumer-grade GPUs might have degraded double-precision performance compared to their compute-oriented siblings; this is documented in footnotes of the aforementioned table.
The now-obsolete Tesla micro-architecture (Compute Capability 1.x) also supported an extra multiplication-only single-precision pipeline, but we only consider Fermi and newer (Compute Capability 2.x and higher) GPUs here.

References

Alverson R, Callahan D, Cummings D, Koblenz B, Porterfield A, Smith B (1990) The tera computer system. ACM SIGARCH Comput Archit News 18(3b):1–6
Article Google Scholar
AMD® (2017) AMD optimizing C/C++ compiler. http://developer.amd.com/amd-aocc/
AMD® (2017) Introducing the Radeon™ RX Vega\(^{64}\). https://gaming.radeon.com/en/product/vega/radeon-rx-vega-64/
Arm® (2017) Cortex-A57 processor. https://www.arm.com/products/processors/cortex-a/cortex-a57-processor.php
Arm® (2017) NEON. https://developer.arm.com/technologies/neon
Arm® (2017) Arm compiler for HPC. https://developer.arm.com/products/software-development-tools/hpc/arm-compiler-for-hpc
Zuras D, Cowlishaw M, Aiken A, Applegate M, Bailey D, Bass S, Bhandarkar D, Bhat M, Bindel D, Boldo S et al (2008) IEEE standard for floating-point arithmetic. IEEE Std 754–2008, pp 1–70
August MC, Brost GM, Hsiung CC, Schiffleger AJ (1989) Cray X-MP: the birth of a supercomputer. Computer 22(1):45–52
Article Google Scholar
Brisebarre N, Defour D, Kornerup P, Muller JM, Revol N (2005) A new range-reduction algorithm. IEEE Trans Comput 54(3):331–339
Article Google Scholar
Buchholz W (1962) Planning a computer system: project stretch. McGraw-Hill Inc, Hightstown, NJ, USA
Google Scholar
Butler M (2010) Bulldozer: a new approach to multi-threaded compute performance. In: Hot Chips 22 Symposium (HCS), 2010 IEEE. IEEE, pp 1–17
Butler M, Barnes L, Sarma DD, Gelinas B (2011) Bulldozer: an approach to multithreaded compute performance. IEEE Micro 31(2):6–15. https://doi.org/10.1109/MM.2011.23
Clark M (2016) A new X86 core architecture for the next generation of computing. Hot Chips 28 Symposium (HCS). IEEE, pp 1–19
Daumas M, Mazenc C, Merrheim X, Muller JM (1995) Modular range reduction: a new algorithm for fast and accurate computation on the elementary functions. J Univers Comput Sci 1(3):162–175
MathSciNet MATH Google Scholar
Diefendorff K, Dubey PK, Hochsprung R, Scale H (2000) Altivec extension to PowerPC accelerates media processing. IEEE Micro 20(2):85–95
Article Google Scholar
Dolbeau R, Seznec A (2004) CASH: revisiting hardware sharing in single-chip parallel processor. J Instr Level Parallelism 6:1–16
Google Scholar
Fayneh E, Yuffe M, Knoll E, Zelikson M, Abozaed M, Talker Y, Shmuely Z, Rahme SA (2016) 4.1 14nm 6th-Generation core processor soc with low power consumption and improved performance. In: Solid-State Circuits Conference (ISSCC), 2016 IEEE International. IEEE, pp 72–73
Fog A (1996–2016) Instruction tables: lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA cpus. Copenhagen University College of Engineering. http://www.agner.org/optimize/instruction_tables.pdf
Govindu G, Zhuo L, Choi S, Prasanna V (2004) Analysis of high-performance floating-point arithmetic on fpgas. In: Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. IEEE, p 149
Grisenthwaite R (2011) Armv8 technology preview. In: IEEE Conference
Gwennap L (2011) Adapteva: more flops, less watts. Microprocess Rep 6(13):11–02
Google Scholar
Henderson D (2000) Elementary functions: algorithms and implementation. Math Comput Educ 34(1):94
Google Scholar
Hennessy JL, Patterson DA (2011) Computer architecture: a quantitative approach, 5th edn. Elsevier, Amsterdam
MATH Google Scholar
Intel® (2010) Intel® Xeon® Processor X5650 (12M Cache, 2.66 GHz, 6.40 GT/s Intel® QPI). http://ark.intel.com/products/47922/Intel-Xeon-Processor-X5650-12M-Cache-2_66-GHz-6_40-GTs-Intel-QPI
Intel® (2014) Intel® Xeon® Processor E5-2695 v3 (35m Cache, 2.30 GHz). http://ark.intel.com/products/81057/Intel-Xeon-Processor-E5-2695-v3-35M-Cache-2_30-GHz
Intel® (2014) Intel® Xeon® Processor E5 v3 product families specification update. http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf
Intel® (2014) Optimizing performance with Intel® advanced vector extensions. http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf
Intel® (2016) Intel® 64 and IA-32 architectures software developer’s manual volume 2 (2A, 2B & 2C): instruction set reference, A–Z. 325383-060. http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.html
Intel® (2016) Intel® Xeon Phi™ processor software optimization guide (334541-001). https://software.intel.com/sites/default/files/managed/11/56/intel-xeon-phi-processor-software-optimization-guide.pdf
Intel® (2017) Intel® Intrinsics guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Kanter D (2016) AMD finds Zen in microarchitecture. Microprocess Rep. http://www.linleygroup.com/newsletters/newsletter_detail.php?num=5577
Kumar A (1997) The HP PA-8000 RISC CPU. IEEE Micro 17(2):27–32
Article Google Scholar
Kumar R, Jouppi NP, Tullsen DM (2004) Conjoined-core chip multiprocessing. In: Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp 195–206
Lee B, Burgess N (2002) Parameterisable floating-point operations on FPGA. In: Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems and Computers, 2002, vol 2. IEEE, pp 1064–1068
LLVM (2017) LLVM.org. https://llvm.org
LLVM Documentation (2017) Auto-vectorization in LLVM. https://llvm.org/docs/Vectorizers.html
Lo YJ, Williams S, Van Straalen B, Ligocki TJ, Cordery MJ, Wright NJ, Hall MW, Oliker L (2014) Roofline model toolkit: a practical tool for architectural and program analysis. In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. Springer, pp 129–148
Mantor M (2012) AMD Radeon™ HD 7970 with graphics core next (GCN) architecture. In: Hot Chips 24 Symposium (HCS), 2012 IEEE. IEEE, pp 1–35
Montoye RK, Hokenek E, Runyon SL (1990) Design of the IBM RISC System/6000 floating-point execution unit. IBM J Res Dev 34(1):59–70
Article Google Scholar
Munger B, Akeson D, Arekapudi S, Burd T, Fair HR, Farrell J, Johnson D, Krishnan G, McIntyre H, McLellan E et al (2016) Carrizo: a high performance, energy efficient 28 nm APU. IEEE J Solid State Circuits 51(1):105–116
Article Google Scholar
Muñoz DM, Sanchez DF, Llanos CH, Ayala-Rincón M (2010) Tradeoff of FPGA design of a floating-point library for arithmetic operators. J Integr Circuits Syst 5(1):42–52
Article Google Scholar
NVidia (2008–2017) CUDA C programming guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/
NVidia (2008–2017) CUDA C programming guide: 5.4.1. Arithmetic instructions. http://docs.nvidia.com/cuda/cuda-c-programming-guide/#arithmetic-instructions
NVidia (2008–2017) CUDA GPUs. https://developer.nvidia.com/cuda-gpus
Oberman S, Favor G, Weber F (1999) AMD 3DNow! technology: architecture and implementations. IEEE Micro 19(2):37–48
Article Google Scholar
Olofsson A, Nordström T, Ul-Abdin Z (2014) Kickstarting high-performance energy-efficient manycore architectures with epiphany. In: 2014 48th Asilomar Conference on Signals, Systems and Computers. IEEE, pp 1719–1726
Russell RM (1978) The CRAY-1 computer system. Commun ACM 21(1):63–72
Article Google Scholar
Shayesteh A (2006) Factored multi-core architectures. PhD thesis, University of California Los Angeles
Singh AYG, Favor G, Yeung A (2014) AppliedMicro X-Gene 2. In: HotChips
Smith JE, Sohi GS (1995) The microarchitecture of superscalar processors. Proc IEEE 83(12):1609–1624
Article Google Scholar
Snavely A, Carter L, Boisseau J, Majumdar A, Gatlin KS, Mitchell N, Feo J, Koblenz B (1998) Multi-processor performance on the Tera MTA. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, pp 1–8
Sodani A (2015) Knights landing (KNL): 2nd Generation Intel® Xeon Phi Processor. In: Hot Chips 27 Symposium (HCS), 2015 IEEE. IEEE, pp 1–24
Stephens N (2016) Technology update: the scalable vector extension (sve) for the armv8-a architecture. https://community.arm.com/groups/processors/blog/2016/08/22/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture
Strenski D (2007) FPGA floating point performance—a pencil and paper evaluation. HPC Wire. https://www.hpcwire.com/2007/01/12/fpga_floating_point_performance/
Thornton JE (1965) Parallel operation in the control data 6600. In: Proceedings of the October 27–29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems. ACM, New York, NY, USA, AFIPS ’64 (Fall, part II), pp 33–40. https://doi.org/10.1145/1464039.1464045
Tullsen DM, Eggers SJ, Levy HM (1995) Simultaneous multithreading: maximizing on-chip parallelism. ACM SIGARCH Comp Archit News 23(2):392–403. http://doi.acm.org/10.1145/225830.224449
Wikipedia (2017) x87. https://en.wikipedia.org/wiki/X87

Download references

Author information

Authors and Affiliations

Atos, 12 Rue du Patis Tatelin, 35700, Rennes, France
Romain Dolbeau

Authors

Romain Dolbeau
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Romain Dolbeau.

Additional information

The author thanks all the colleagues who’ve helped with this manuscript, and in particular his Atos UK colleagues Crispin Keable, Neeraj Morar and Martyn Foster for their help with the language. The author also thanks the editors and anonymous referees for their helpful suggestions in improving this manuscript. Any errors remaining in this paper are the fault of the author alone.

The original online version of this article was revised: an error in two units (at the end of section “NVidia graphical processing units”

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dolbeau, R. Theoretical peak FLOPS per instruction set: a tutorial. J Supercomput 74, 1341–1377 (2018). https://doi.org/10.1007/s11227-017-2177-5

Download citation

Published: 16 November 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s11227-017-2177-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Theoretical peak FLOPS per instruction set: a tutorial

Abstract

Access this article

Similar content being viewed by others

High-Performance Computing Basics

CUDA Achievements and GPU Challenges Ahead

Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library

Change history

27 March 2022

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Theoretical peak FLOPS per instruction set: a tutorial

Abstract

Access this article

Similar content being viewed by others

High-Performance Computing Basics

CUDA Achievements and GPU Challenges Ahead

Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library

Change history

27 March 2022

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation