Instruction-throughput regulation in computer processors with data-center applications

Chen, Xinwei; Wardi, Yorai; Yalamanchili, Sudhakar

doi:10.1007/s10626-017-0254-9

Instruction-throughput regulation in computer processors with data-center applications

Published: 22 August 2017

Volume 28, pages 127–158, (2018)
Cite this article

Discrete Event Dynamic Systems Aims and scope Submit manuscript

188 Accesses
4 Citations
Explore all metrics

Abstract

This paper tests a recently-proposed technique for regulating output performance of Discrete Event Dynamic Systems and Stochastic Hybrid Systems. The controller is based on an integrator with a variable gain, adjusted so as to guarantee wide stability margins of the closed-loop system. The gain is adjusted by estimating, in real time, the derivative of the plant function via approximations to its IPA derivative. The technique is robust to computational errors in the loop, and hence these approximations are designed for fast computation rather than precision. The development of the regulation technique has been motivated by applications in computer processors, and extensively tested in the past on a cycle-level, full system simulator. In this paper we describe implementations of the regulator on an Intel machine based on the Haswell processor, and apply it to control the instructions’ throughput of various industry program-benchmarks as well as data-center applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Patmos: a time-predictable microprocessor

Article 23 February 2018

Martin Schoeberl, Wolfgang Puffitsch, … Daniel Prokesch

Dual-IS: Instruction Set Modality for Efficient Instruction Level Parallelism

Instruction Scheduling in Microprocessors

Notes

The plant can be stochastic or deterministic, and correspondingly the plant function is either a realization of a random function or a deterministic function. This point will be clarified in the sequel in the context of the later discussion.
The term “drastic change” is not used here in a stochastic sense, like singularly perturbed processes (Levy and Vźquez-Abad 2010) referring to stochastic processes which change abruptly between stationary regimes. We refer to that term simply as a change of the plant-system’s input-output characteristics.
Typically a core is dedicated to the processing of a program or a thread, namely a subprogram, as determined by the programmer or the operating system. In the forthcoming discussion we will use the term program to designate a thread as well.
Modern microprocessors include many hardware counters that record the occurrences of various events during program executions. Examples of such events include i) completion of the execution of an integer instruction, ii) a cache miss, or iii) an instruction that accesses memory. The Performance Application Programming Interface (PAPI) is a publicly available software infrastructure for accessing these performance counters during program execution.

References

Almoosa N, Song W, Wardi Y, Yalamanchili S (2012a) A power capping controller for multicore processors. In: Proceedings of the 2012 American control conference. Montreal
Almoosa N, Song W, Wardi Y, Yalamanchili S (2012b) Throughput regulation in multicore processors via IPA. In: Proceedings of the 51 IEEE Conference on decision and control (CDC). Maui
Bauer M, Pacher M, Brinkschulte U (2010) A chip-size evaluation of a multi-threaded processor enhanced with a PID controller. In: Proceedings of the 8th IFIP workshop on software technologies for future embedded and ubiquitous systems (SEUS 2010). Waidhofen
Brinkschulte U, Pacher M (2009) A theoretical examination of a self-adaptation approach to improve the real-time capabilities in multi-threaded microprocessors. In: Proceedings of the 2009 Third IEEE international conference on self-adaptive and self-organizing systems. San Francisco
Browne S, Dongarra J, Garner N, Ho G, Mucci P (2000) A portable programming interface for performance evaluation on modern processors. Int J High Perform Comput Appl 14(3):189–204.
Article Google Scholar
Cassandras CG (2006) Stochastic flow systems: modeling and sensitivity analysis. In: Cassandras CG, Lygeros J (eds) Stochastic hybrid systems: recent developments and research trends. CRC Press, New York, pp 137–165
Chapter Google Scholar
Cassandras CG, Lafortune S (2008) Introduction to discrete event systems, 2nd edn. Springer
Cassandras CG, Wardi Y, Melamed B, Sun G, Panayiotou CG (2002) Perturbation analysis for on-line control and optimization of stochastic fluid models. IEEE Trans Autom Control 47(8):1234–1248
Article MATH Google Scholar
Cassandras CG, Wardi Y, Panayiotou CG, Yao C (2010) Perturbation analysis and optimization of stochastic hybrid systems. Eur J Control 16:642–664
Article MathSciNet MATH Google Scholar
Chen X, Xiao H, Wardi Y, Yalamanchili S (2015) Throughput regulation in shared memory multicore prtocessors. In: Proceedings of the 22nd IEEE Intl. conference on high performance computing (HiPC). Bengaluru
Chen X, Wardi Y, Yalamanchili S (2016) IPA in the loop: control design for throughput regulation in computer processors. In: Proceedings of the 13th international workshop on discrete event systems (WODES 2016). Xi’an
Franklin GF, Powell JD, Emami-Naeini A (2015) Feedback control of dynamic systems. Prentice Hall
Hammarlund P, Martinez AJ, Bajwa AA, Hill DL, Hallnor E, Jiang H, Dixon M, Derr M, Hunsaker M, Kumar R, Osborne RB, Rajwar R, Singhal R, D’Sa R, Chappell R, Kaushik S, Chennupaty S, Jourdan S, Gunther S, Piazza T, Butron T (2014) Haswell: the fourth-generation intel core processor. IEEE Micro 34(2):6–20
Article Google Scholar
Hennessey JL, Patterson DA (2012) Computer architecture: a quantitative approach. Morgan Kaufmann
Ho YC, Cao XR (1991) Perturbation analysis of discrete event dynamic systems. Kluwer Academic Publishers, Boston
Book MATH Google Scholar
Lancaster P (1966) Error analysis for the Newton-Raphson method. Numer Math 9:55–68
Article MathSciNet MATH Google Scholar
Levy K, Vaźquez-Abad FJ (2010) Change-point monitoring for online stochastic approximations. Automatica 46:1657–1674
Article MathSciNet MATH Google Scholar
Lohn D, Pacher M, Brinkschulte U (2011) A generalized model to control the throughput in a processor for real-time applications. In: 2011 14th IEEE International symposium on object/component/service-oriented real-time distributed computing. Newport Beach
Nai L, Xia Y, Tanase IG, Kimy H, Lin CY (2015) GraphBIG: understanding graph computing in the context of industrial solutions. In: SC15 proceedings of the international conference for high performance computing, networking, storage and analysis. Austin
Tanase IG, Xia Y, Nai L, Liu Y, Tan W, Crawford J, Lin C-Y (2014) A highly efficient runtime and graph library for large scale graph analytics. In: GRADES14 proceedings of workshop on graph data management experiences and systems. Utah
Wang J, Beu J, Behda R, Conte T, Dong Z, Kersey C, Rasquinha M, Riley G, Song W, Xiao H, Xu P, Yalamanchili S (2014) Manifold: a parallel simulation framework for multicore systems. In: Proceedings of the IEEE International symposium on performance evaluation of systems and software (ISPASS)
Wardi Y, Seatzu C (2017) Performance regulation in discrete event and hybrid dynamical systems using IPA. Eur J Control. vol. 36, pp. 51–61, also in doi:10.1016/j.ejcon.2017.02.004
Wardi Y, Seatzu C, Chen X, Yalamanchili S (2016) Performance regulation of event-driven dynamical systems using infinitesimal perturbation analysis, Nonlinear Analysis: Hybrid Systems, vol 22, pp 116–136, 2016. Also in arXiv:1601.03799v1 [math.OC]
Woo SC, Oharat M, Torriet E, Singhi J, Guptat A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the ISCA 22nd annual international symposium on computer architectures, (ISCA’95). Santa Margherita Ligure

Download references

Acknowledgments

Research supported in part by the NSF under Grant Number CNS-1239225.

Author information

Authors and Affiliations

School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, 30332-0250, USA
Xinwei Chen, Yorai Wardi & Sudhakar Yalamanchili

Authors

Xinwei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yorai Wardi
View author publications
You can also search for this author in PubMed Google Scholar
Sudhakar Yalamanchili
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yorai Wardi.

Additional information

This article belongs to the Topical Collection: Special Issue on Performance Analysis and Optimization of Discrete Event Systems

Guest Editors: Christos G. Cassandras and Alessandro Giua

Appendix

This section provides a quantitative description of the instruction-flow in the OOO-cache high-level model described at the beginning of Section III.

Denote by I _i, i = 1, 2, … , the instructions arriving at the instruction queue in increasing order. Let u denote the clock rate, or frequency, and let τ := u ⁻¹ be the clock cycle. Denote by a _i(τ) the arrival time of I _i relative to the arrival time of I ₁, namely a ₁(0) := 0, and let ξ _i be the clock counter at which I _i arrives. Then, a _i(τ) = ξ _i τ. Denote by α _i(τ) the time at which execution of I _i starts, and let β _i(τ) denote the time at which execution of I _i ends.

We next describe a way to compute α _i(τ). Consider first the case were I _i is a computational instruction. If all of its required variables are available at its arrival time then α _i(τ) = a _i(τ) + τ. On the other hand, if I _i has to wait for such variables, let k(i) denote the index (counter) of the instruction last to provide such a variable, then α _i(τ) = β _k(i)(τ) + τ. Next, if I _i is a memory instruction, then α _i(τ) is the time it starts a cache access. If the memory queue is not full at time a _i(τ), then α _i(τ) = a _i(τ) + τ. On the other hand, if the memory queue is full at time a _i(τ), let ℓ(i) denote the index of the instruction at the head of the queue, then, α _i(τ) = β _ℓ(i)(τ) + τ.

To compute β _i(τ), consider first the case where I _i is a computational instruction. Let μ _i denote the number of clock cycles it takes to execute I _i. Then, β _i(τ) = α _i(τ) + μ _i τ. On the other hand, if I _i is a memory instruction, let ν _i denote the number of clock cycles it takes to perform a cache attempt. If the cache attempt is successful and the variable is found in cache, then β _i(τ) = α _i(τ) + ν _i τ. If the variable is not in cache, the instruction is directed to the memory queue. Its transfer there involves a small number of clock cycles, m _i, hence it arrives at the queue at time α _i(τ) + ν _i τ + m _i τ. The memory queue is a FIFO queue whose service time represents an external-memory access, which is independent of the core’s clock. Denote by S _i the sojourn time of I _i at the memory queue. Then β _i(τ) = α _i(τ) + ν _i τ + m _i τ + S _i + τ.

Finally, the departure time of I _i from the instruction queue, denoted by d _i(τ), is $d_{i}(\tau )=\max $ $\left \{\beta _{i}(\tau ),d_{i-1}(\tau )\right \}+\tau $. Given a control cycle consisting of N instructions, the throughput is defined as N/d _N(τ). Since u = τ ⁻¹, we can view the throughput as a function of u and denote it by y(u). A more detailed discussion of the model can be found in Wardi et al. (2016).

Concerning the IPA derivative $\frac {\partial y}{\partial u}$, Ref. Wardi et al. (2016) has described a recursive algorithm for its computation in real time, and that algorithm was used in the simulations described in Section 3. It is based on the facts that y = N/d _N(u) and u = 1/τ which imply (after some algebra) that

$$ \frac{\partial y}{\partial u}=\frac{1}{N}\left( \frac{y}{u}\right)^{2} \frac{\partial d_{N}}{\partial\tau}. $$

(16)

Assuming that y is measured in real time, it remains to compute the term $\frac {\partial d_{N}}{\partial \tau }$ in Eq. 16. This can be done by a recursive procedure for real-time computations of the terms $\frac {\partial d_{i}}{\partial \tau }$, i = 1, 2, … , N, as described in detail in Wardi et al. (2016), pp. 24–25.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, X., Wardi, Y. & Yalamanchili, S. Instruction-throughput regulation in computer processors with data-center applications. Discrete Event Dyn Syst 28, 127–158 (2018). https://doi.org/10.1007/s10626-017-0254-9

Download citation

Received: 26 September 2016
Accepted: 11 July 2017
Published: 22 August 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s10626-017-0254-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Instruction-throughput regulation in computer processors with data-center applications

Abstract

Access this article

Similar content being viewed by others

Patmos: a time-predictable microprocessor

Dual-IS: Instruction Set Modality for Efficient Instruction Level Parallelism

Instruction Scheduling in Microprocessors

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Patmos: a time-predictable microprocessor

Dual-IS: Instruction Set Modality for Efficient Instruction Level Parallelism

Instruction Scheduling in Microprocessors

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation