Skip to main content
Log in

Instruction-throughput regulation in computer processors with data-center applications

  • Published:
Discrete Event Dynamic Systems Aims and scope Submit manuscript

Abstract

This paper tests a recently-proposed technique for regulating output performance of Discrete Event Dynamic Systems and Stochastic Hybrid Systems. The controller is based on an integrator with a variable gain, adjusted so as to guarantee wide stability margins of the closed-loop system. The gain is adjusted by estimating, in real time, the derivative of the plant function via approximations to its IPA derivative. The technique is robust to computational errors in the loop, and hence these approximations are designed for fast computation rather than precision. The development of the regulation technique has been motivated by applications in computer processors, and extensively tested in the past on a cycle-level, full system simulator. In this paper we describe implementations of the regulator on an Intel machine based on the Haswell processor, and apply it to control the instructions’ throughput of various industry program-benchmarks as well as data-center applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36
Fig. 37

Similar content being viewed by others

Notes

  1. The plant can be stochastic or deterministic, and correspondingly the plant function is either a realization of a random function or a deterministic function. This point will be clarified in the sequel in the context of the later discussion.

  2. The term “drastic change” is not used here in a stochastic sense, like singularly perturbed processes (Levy and Vźquez-Abad 2010) referring to stochastic processes which change abruptly between stationary regimes. We refer to that term simply as a change of the plant-system’s input-output characteristics.

  3. Typically a core is dedicated to the processing of a program or a thread, namely a subprogram, as determined by the programmer or the operating system. In the forthcoming discussion we will use the term program to designate a thread as well.

  4. Modern microprocessors include many hardware counters that record the occurrences of various events during program executions. Examples of such events include i) completion of the execution of an integer instruction, ii) a cache miss, or iii) an instruction that accesses memory. The Performance Application Programming Interface (PAPI) is a publicly available software infrastructure for accessing these performance counters during program execution.

References

  • Almoosa N, Song W, Wardi Y, Yalamanchili S (2012a) A power capping controller for multicore processors. In: Proceedings of the 2012 American control conference. Montreal

  • Almoosa N, Song W, Wardi Y, Yalamanchili S (2012b) Throughput regulation in multicore processors via IPA. In: Proceedings of the 51 IEEE Conference on decision and control (CDC). Maui

  • Bauer M, Pacher M, Brinkschulte U (2010) A chip-size evaluation of a multi-threaded processor enhanced with a PID controller. In: Proceedings of the 8th IFIP workshop on software technologies for future embedded and ubiquitous systems (SEUS 2010). Waidhofen

  • Brinkschulte U, Pacher M (2009) A theoretical examination of a self-adaptation approach to improve the real-time capabilities in multi-threaded microprocessors. In: Proceedings of the 2009 Third IEEE international conference on self-adaptive and self-organizing systems. San Francisco

  • Browne S, Dongarra J, Garner N, Ho G, Mucci P (2000) A portable programming interface for performance evaluation on modern processors. Int J High Perform Comput Appl 14(3):189–204.

    Article  Google Scholar 

  • Cassandras CG (2006) Stochastic flow systems: modeling and sensitivity analysis. In: Cassandras CG, Lygeros J (eds) Stochastic hybrid systems: recent developments and research trends. CRC Press, New York, pp 137–165

    Chapter  Google Scholar 

  • Cassandras CG, Lafortune S (2008) Introduction to discrete event systems, 2nd edn. Springer

  • Cassandras CG, Wardi Y, Melamed B, Sun G, Panayiotou CG (2002) Perturbation analysis for on-line control and optimization of stochastic fluid models. IEEE Trans Autom Control 47(8):1234–1248

    Article  MATH  Google Scholar 

  • Cassandras CG, Wardi Y, Panayiotou CG, Yao C (2010) Perturbation analysis and optimization of stochastic hybrid systems. Eur J Control 16:642–664

    Article  MathSciNet  MATH  Google Scholar 

  • Chen X, Xiao H, Wardi Y, Yalamanchili S (2015) Throughput regulation in shared memory multicore prtocessors. In: Proceedings of the 22nd IEEE Intl. conference on high performance computing (HiPC). Bengaluru

  • Chen X, Wardi Y, Yalamanchili S (2016) IPA in the loop: control design for throughput regulation in computer processors. In: Proceedings of the 13th international workshop on discrete event systems (WODES 2016). Xi’an

  • Franklin GF, Powell JD, Emami-Naeini A (2015) Feedback control of dynamic systems. Prentice Hall

  • Hammarlund P, Martinez AJ, Bajwa AA, Hill DL, Hallnor E, Jiang H, Dixon M, Derr M, Hunsaker M, Kumar R, Osborne RB, Rajwar R, Singhal R, D’Sa R, Chappell R, Kaushik S, Chennupaty S, Jourdan S, Gunther S, Piazza T, Butron T (2014) Haswell: the fourth-generation intel core processor. IEEE Micro 34(2):6–20

    Article  Google Scholar 

  • Hennessey JL, Patterson DA (2012) Computer architecture: a quantitative approach. Morgan Kaufmann

  • Ho YC, Cao XR (1991) Perturbation analysis of discrete event dynamic systems. Kluwer Academic Publishers, Boston

    Book  MATH  Google Scholar 

  • Lancaster P (1966) Error analysis for the Newton-Raphson method. Numer Math 9:55–68

    Article  MathSciNet  MATH  Google Scholar 

  • Levy K, Vaźquez-Abad FJ (2010) Change-point monitoring for online stochastic approximations. Automatica 46:1657–1674

    Article  MathSciNet  MATH  Google Scholar 

  • Lohn D, Pacher M, Brinkschulte U (2011) A generalized model to control the throughput in a processor for real-time applications. In: 2011 14th IEEE International symposium on object/component/service-oriented real-time distributed computing. Newport Beach

  • Nai L, Xia Y, Tanase IG, Kimy H, Lin CY (2015) GraphBIG: understanding graph computing in the context of industrial solutions. In: SC15 proceedings of the international conference for high performance computing, networking, storage and analysis. Austin

  • Tanase IG, Xia Y, Nai L, Liu Y, Tan W, Crawford J, Lin C-Y (2014) A highly efficient runtime and graph library for large scale graph analytics. In: GRADES14 proceedings of workshop on graph data management experiences and systems. Utah

  • Wang J, Beu J, Behda R, Conte T, Dong Z, Kersey C, Rasquinha M, Riley G, Song W, Xiao H, Xu P, Yalamanchili S (2014) Manifold: a parallel simulation framework for multicore systems. In: Proceedings of the IEEE International symposium on performance evaluation of systems and software (ISPASS)

  • Wardi Y, Seatzu C (2017) Performance regulation in discrete event and hybrid dynamical systems using IPA. Eur J Control. vol. 36, pp. 51–61, also in doi:10.1016/j.ejcon.2017.02.004

  • Wardi Y, Seatzu C, Chen X, Yalamanchili S (2016) Performance regulation of event-driven dynamical systems using infinitesimal perturbation analysis, Nonlinear Analysis: Hybrid Systems, vol 22, pp 116–136, 2016. Also in arXiv:1601.03799v1 [math.OC]

  • Woo SC, Oharat M, Torriet E, Singhi J, Guptat A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the ISCA 22nd annual international symposium on computer architectures, (ISCA’95). Santa Margherita Ligure

Download references

Acknowledgments

Research supported in part by the NSF under Grant Number CNS-1239225.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yorai Wardi.

Additional information

This article belongs to the Topical Collection: Special Issue on Performance Analysis and Optimization of Discrete Event Systems

Guest Editors: Christos G. Cassandras and Alessandro Giua

Appendix

Appendix

This section provides a quantitative description of the instruction-flow in the OOO-cache high-level model described at the beginning of Section III.

Denote by I i , i = 1, 2, … , the instructions arriving at the instruction queue in increasing order. Let u denote the clock rate, or frequency, and let τ := u −1 be the clock cycle. Denote by a i (τ) the arrival time of I i relative to the arrival time of I 1, namely a 1(0) := 0, and let ξ i be the clock counter at which I i arrives. Then, a i (τ) = ξ i τ. Denote by α i (τ) the time at which execution of I i starts, and let β i (τ) denote the time at which execution of I i ends.

We next describe a way to compute α i (τ). Consider first the case were I i is a computational instruction. If all of its required variables are available at its arrival time then α i (τ) = a i (τ) + τ. On the other hand, if I i has to wait for such variables, let k(i) denote the index (counter) of the instruction last to provide such a variable, then α i (τ) = β k(i)(τ) + τ. Next, if I i is a memory instruction, then α i (τ) is the time it starts a cache access. If the memory queue is not full at time a i (τ), then α i (τ) = a i (τ) + τ. On the other hand, if the memory queue is full at time a i (τ), let (i) denote the index of the instruction at the head of the queue, then, α i (τ) = β (i)(τ) + τ.

To compute β i (τ), consider first the case where I i is a computational instruction. Let μ i denote the number of clock cycles it takes to execute I i . Then, β i (τ) = α i (τ) + μ i τ. On the other hand, if I i is a memory instruction, let ν i denote the number of clock cycles it takes to perform a cache attempt. If the cache attempt is successful and the variable is found in cache, then β i (τ) = α i (τ) + ν i τ. If the variable is not in cache, the instruction is directed to the memory queue. Its transfer there involves a small number of clock cycles, m i , hence it arrives at the queue at time α i (τ) + ν i τ + m i τ. The memory queue is a FIFO queue whose service time represents an external-memory access, which is independent of the core’s clock. Denote by S i the sojourn time of I i at the memory queue. Then β i (τ) = α i (τ) + ν i τ + m i τ + S i + τ.

Finally, the departure time of I i from the instruction queue, denoted by d i (τ), is \(d_{i}(\tau )=\max \) \(\left \{\beta _{i}(\tau ),d_{i-1}(\tau )\right \}+\tau \). Given a control cycle consisting of N instructions, the throughput is defined as N/d N (τ). Since u = τ −1, we can view the throughput as a function of u and denote it by y(u). A more detailed discussion of the model can be found in Wardi et al. (2016).

Concerning the IPA derivative \(\frac {\partial y}{\partial u}\), Ref. Wardi et al. (2016) has described a recursive algorithm for its computation in real time, and that algorithm was used in the simulations described in Section 3. It is based on the facts that y = N/d N (u) and u = 1/τ which imply (after some algebra) that

$$ \frac{\partial y}{\partial u}=\frac{1}{N}\left( \frac{y}{u}\right)^{2} \frac{\partial d_{N}}{\partial\tau}. $$
(16)

Assuming that y is measured in real time, it remains to compute the term \(\frac {\partial d_{N}}{\partial \tau }\) in Eq. 16. This can be done by a recursive procedure for real-time computations of the terms \(\frac {\partial d_{i}}{\partial \tau }\), i = 1, 2, … , N, as described in detail in Wardi et al. (2016), pp. 24–25.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, X., Wardi, Y. & Yalamanchili, S. Instruction-throughput regulation in computer processors with data-center applications. Discrete Event Dyn Syst 28, 127–158 (2018). https://doi.org/10.1007/s10626-017-0254-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10626-017-0254-9

Keywords

Navigation