Skip to main content

Scheduling for Better Energy Efficiency on Many-Core Chips

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 2015, JSSPP 2016)

Abstract

Many-core chips are especially attractive for data center operators providing cloud computing service models. With the advance of many-core chips in such environments energy-conscious scheduling of independent processes or operating systems (OSes) is gaining importance. An important research question is how the scheduler of such a system should assign the cores to the OSes in order to achieve a better energy utilization. In this paper, we demonstrate that many-core chips offer new opportunities for extremely light-weight migration of independent processes (or OSes) running bare-metal on the many-core chip. We then show how this intra-chip migration can be utilized to achieve a better performance per watt ratio by implementing a hierarchical power-management scheme on top of dynamic voltage and frequency scaling (DVFS). We have implemented and tested the proposed techniques on the Intel Single Chip Cloud Computer (SCC). Combining migration with DVFS we achieve, on average, a 25–35% better performance per watt over a DVFS-only solution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agerwala, T., Chatterjee, S.: Computer architecture: Challenges and opportunities for the next decade. IEEE Micro 25(3), 58–69 (2005)

    Article  Google Scholar 

  2. Barroso, L.A., Gharachorloo, K., McNamara, R., Nowatzyk, A., Qadeer, S., Sano, B., Smith, S., Stets, R., Verghese, B.: Piranha: a scalable architecture based on single-chip multiprocessing. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA 2000, pp. 282–293. ACM, New York (2000)

    Google Scholar 

  3. Borkar, S.: Thousand core chips-a technology perspective. In: Proceedings of the 44th Annual Design Automation Conference, DAC 2007, pp. 746–749. ACM, New York (2007)

    Google Scholar 

  4. Burd, T.D., Brodersen, R.W.: Energy efficient CMOS microprocessor design. In: Proceedings of the Twenty-Eighth Hawaii International Conference on System Sciences, vol. 1, pp. 288–297 (1995)

    Google Scholar 

  5. Cai, Q., González, J., Magklis, G., Chaparro, P., González, A.: Thread shuffling: combining DVFS and thread migration to reduce energy consumptions for multi-core systems. In: Proceedings of the 17th IEEE/ACM International Symposium on Low-power Electronics and Design, ISLPED 2011, pp. 379–384. IEEE Press, Piscataway (2011)

    Google Scholar 

  6. Ebi, T., Faruque, M., Henkel, J.: Tape: Thermal-aware agent-based power econom multi/many-core architectures. In: IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers, ICCAD 2009, pp. 302–309 (2009)

    Google Scholar 

  7. Ghiasi, S.: Aide De Camp: asymmetric multi-core design for dynamic thermal management. PhD thesis, Boulder, CO, USA (2004). AAI3136618

    Google Scholar 

  8. Herbert, S., Marculescu, D.: Analysis of dynamic voltage/frequency scaling in chip-multiprocessors. In: 2007 ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), pp. 38–43, August 2007

    Google Scholar 

  9. Howard, J., Dighe, S., Hoskote, Y., Vangal, S., Finan, D., Ruhl, G., Jenkins, D., Wilson, H., Borkar, N., Schrom, G., Pailet, F., Jain, S., Jacob, T., Yada, S., Marella, S., Salihundam, P., Erraguntla, V., Konow, M., Riepen, M., Droege, G., Lindemann, J., Gries, M., Apel, T., Henriss, K., Lund-Larsen, T., Steibl, S., Borkar, S., De, V., Van der Wijngaart, R., Mattson, T.: A 48-core IA-32 message-passing processor with DVFS in 45 nm CMOS. In: IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010, pp. 108–109, February 2010

    Google Scholar 

  10. Ioannou, N., Kauschke, M., Gries, M., Cintra, M.: Phase-based application-driven hierarchical power management on the single-chip cloud computer. In: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT 2011, pp. 131–142. IEEE Computer Society, Washington, DC (2011)

    Google Scholar 

  11. Isci, C., Buyuktosunoglu, A., Cher, C.Y., Bose, P., Martonosi, M.: An analysis of efficient multi-core global power management policies: maximizing performance for a given power budget. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39, pages 347–358. IEEE Computer Society, Washington, DC, USA (2006)

    Google Scholar 

  12. Kim, W., Gupta, M.S., Wei, G.Y., Brooks, D.: System level analysis of fast, per-core DVFS using on-chip switching regulators. In: IEEE 14th International Symposium on High Performance Computer Architecture (HPCA 2008), pp. 123–134, February 2008

    Google Scholar 

  13. Kumar, R., Tullsen, D.M., Ranganathan, P., Jouppi, N.P., Farkas, K.I.: Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, ISCA 2004, p. 64. IEEE Computer Society, Washington, DC, USA (2004)

    Google Scholar 

  14. Li, J., Martinez, J.F.: Power-performance implications of thread-level parallelism on chip multiprocessors. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2005), pp. 124–134, March 2005

    Google Scholar 

  15. Ma, K., Li, X., Chen, M., Wang, X.: Scalable power control for many-core architectures running multi-threaded applications. In: Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA 2011, pp. 449–460. ACM, New York (2011)

    Google Scholar 

  16. Meisner, D., Gold, B.T., Wenisch, T.F.: Powernap: Eliminating server idle power. In: Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIV, pp. 205–216. ACM, New York (2009)

    Google Scholar 

  17. Meisner, D., Wenisch, T.F.: Dreamweaver: architectural support for deep sleep. In: Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pp. 313–324. ACM, New York (2012)

    Google Scholar 

  18. Meng, K., Joseph, R., Dick, R.P., Shang, L.: Multi-optimization power management for chip multiprocessors. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT 2008, pp. 177–186. ACM, New York (2008)

    Google Scholar 

  19. Olukotun, K., Nayfeh, B.A., Hammond, L., Wilson, K., Chang, K.: The case for a single-chip multiprocessor. In: Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS VII, pp. 2–11. ACM, New York (1996)

    Google Scholar 

  20. Rangan, K.K., Wei, G.Y., Brooks, D.: Thread motion: fine-grained power management for multi-core systems. In: Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA 2009, pp. 302–313. ACM, New York (2009)

    Google Scholar 

  21. Rotem, E., Mendelson, A., Ginosar, R., Weiser, U.: Multiple clock and voltage domains for chip multi processors. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pp. 459–468. ACM, New York (2009)

    Google Scholar 

  22. Vajda, A.: Multi-core and many-core processor architectures. Programming Many-Core Chips, pp. 9–43. Springer, New York (2011)

    Chapter  Google Scholar 

Download references

Acknowledgments

This work was supported, in part, by BK21 Plus for Pioneers in Innovative Computing (Dept. of Computer Science and Engineering, SNU) funded by the National Research Foundation (NRF) of Korea (Grant 21A20151113068), the Basic Science Research Program through NRF funded by the Ministry of Science, ICT & Future Planning (Grants NRF-2015K1A3A1A14021288 and NRF-2008-0062609), and by the Promising-Pioneering Researcher Program through Seoul National University in 2015. ICT at Seoul National University provided research facilities for this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bernhard Egger .

Editor information

Editors and Affiliations

Appendices

Appendix

A Profiled Workload Benchmark Scenarios

This appendix describes the details of the benchmarks evaluated in this work. Each benchmark scenario consists of two parts:

  • Two or more workload pattern that describe how the workload changes over time.

  • An initial assignment of the workloads to the 48 cores of the exercised Intel SCC.

Each workload pattern (WL), denoted S{1–7} in the tables below, lists the CPU workload for every epoch (10 or 15 s, depending on the benchmark) for the duration of one period (300 s). A workload never stops, it keeps repeating the workload pattern period after period. Note that all workloads are pure CPU-based workloads; memory-based workloads are part of future work.

The core assignment tables below show what workload pattern are assigned to which cores when the experiment starts. In our setup, voltage domain 3 runs various logging and monitoring services and is thus not available for user benchmarks. The power measurements include the power consumed by vdom3 because power is only reported for the entire chip and not for individual voltage domains.

A benchmark ends after a predefined number of seconds (in our example after 300 s). The total progress of each workload is measured externally and thus includes all overheads caused by migration, voltage changes or slowdowns cause by too low frequency settings.

1.1 A.1 Synthetic Benchmark Scenario based on Periodic Workloads

The synthetic benchmark consists of two identical workload patterns shifted in time. Each voltage domain contains workloads of both patterns. The purpose of this benchmark is to demonstrate the potential of combining DVFS with OS migration. The results of this benchmark are shown in Fig. 5.

Workload patterns:

WL

Epoch (1 epoch = 15 s)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

S1

95

95

10

10

95

95

10

10

95

95

10

10

95

95

10

10

95

95

10

10

10

S2

10

95

95

10

10

95

95

10

10

95

95

10

10

95

95

10

10

95

95

10

10

Core assignment:

vdom0

vdom1

vdom3

vdom4

vdom5

vdom7

-

-

-

-

n/a

n/a

-

-

-

-

-

-

S2

-

S2

-

n/a

n/a

S2

S2

S2

-

S2

-

-

-

-

-

n/a

n/a

-

-

-

-

-

 

S1

S2

S1

S1

n/a

n/a

S1

S1

S1

S2

S1

S1

1.2 A.2 Benchmark Scenarios based on Profiled Workloads

The following four benchmarks are based on the usage patterns of Linux and Windows desktop computers. Initially, each voltage domain is loaded with different workload patterns. These benchmarks demonstrate the effect of the proposed technique when applied to a multi-user setup (i.e., virtual desktops of employees on a server machine).

The detailed result of the first benchmark are shown in Fig. 6, and Table 1 lists the combined results for all four benchmark scenarios shown here.

Benchmark 1 (BM1)

Workload patterns:

WL

Epoch (1 epoch = 10 s)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

S1

27

49

31

32

62

77

80

44

0

6

1

1

8

73

87

81

80

91

100

99

89

67

13

52

0

0

10

46

27

86

63

S2

69

57

68

60

55

66

61

63

69

58

56

57

63

59

62

58

57

67

68

64

61

71

78

63

71

82

69

14

0

2

4

S3

28

84

41

12

83

48

55

0

35

69

42

59

17

46

59

49

51

2

46

47

80

40

4

73

41

53

47

18

100

42

45

S4

27

49

31

32

62

77

80

44

0

6

1

1

8

73

87

81

80

91

100

99

89

67

13

52

0

0

10

80

66

56

32

S5

71

53

26

9

34

25

23

38

37

26

96

92

34

41

89

100

100

12

17

30

27

21

31

35

41

84

89

63

100

96

84

S6

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

27

96

63

100

27

0

0

0

0

0

0

0

0

0

0

0

S7

5

4

5

7

2

4

5

6

6

4

100

6

2

4

1

1

0

1

2

2

4

2

2

4

6

6

6

5

2

10

5

Core assignment:

vdom0

vdom1

vdom3

vdom4

vdom5

vdom7

S4

S6

S4

S4

n/a

n/a

S5

S5

S5

S6

S5

S5

S3

S3

S3

S7

n/a

n/a

S3

S3

S4

S4

S2

S2

S2

S5

S2

S2

n/a

n/a

S2

S6

S2

S7

S3

S4

S1

S1

S1

S5

n/a

n/a

S1

S4

S1

S3

S1

S1

Benchmark 2 (BM2)

Workload patterns:

WL

Epoch (1 epoch = 10 s)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

S1

27

49

31

32

62

77

80

44

0

6

1

1

8

73

87

81

80

91

100

99

89

67

13

52

0

0

10

46

27

86

63

S2

82

39

55

42

96

42

100

33

53

20

20

10

11

14

13

11

13

13

1

5

1

0

23

45

61

42

83

83

20

15

3

S3

28

84

41

12

83

48

55

0

35

69

42

59

17

46

59

49

51

2

46

47

80

40

4

73

41

53

47

18

100

42

45

S4

27

49

31

32

62

77

80

44

0

6

1

1

8

73

87

81

80

91

100

99

89

67

13

52

0

0

10

10

15

30

27

S5

71

53

26

9

34

25

23

38

37

26

96

92

34

41

89

100

100

12

17

30

27

21

31

35

41

84

89

63

100

96

84

S6

53

21

52

48

33

92

89

100

39

38

29

41

48

4

64

45

36

31

42

41

42

35

15

80

93

62

10

23

48

32

0

Core assignment:

vdom0

vdom1

vdom3

vdom4

vdom5

vdom7

-

-

-

-

n/a

n/a

-

-

-

-

-

-

S5

S6

S5

S6

n/a

n/a

S3

S6

S4

S5

S4

S5

-

-

-

-

n/a

n/a

-

-

-

-

-

-

S1

S4

S1

S2

n/a

n/a

S1

S2

S2

S3

S1

S3

Benchmark 3 (BM3).

Workload patterns:

WL

Epoch (1 epoch = 10 s)

00

01

02

03

04

05

06

07

08

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

S1

42

77

25

11

34

36

30

14

33

26

22

58

100

52

30

13

15

0

21

39

48

43

40

41

40

42

41

40

39

36

35

S2

45

15

6

27

25

9

64

55

27

28

18

51

46

100

56

20

25

25

12

0

0

0

0

0

0

0

0

0

0

0

0

S3

71

53

26

9

34

25

23

38

37

26

30

23

34

41

39

29

29

12

17

30

27

21

31

35

41

84

89

63

100

96

2

S4

11

22

20

10

27

12

45

100

22

9

4

14

9

43

19

6

17

18

14

21

5

5

5

6

25

16

7

0

0

0

0

S5

42

66

40

67

57

67

66

71

75

72

31

38

59

54

86

80

68

55

95

100

89

85

86

77

64

0

0

0

0

0

0

Core assignment:

vdom0

vdom1

vdom2

vdom4

vdom5

vdom7

S5

-

-

-

n/a

n/a

S5

-

S5

-

S5

-

-

-

S5

-

n/a

n/a

S4

-

S4

-

S4

-

S2

S4

S2

S4

n/a

n/a

-

S3

S2

S3

S2

-

S1

S3

S1

S3

n/a

n/a

S1

S2

S1

-

S1

S3

Benchmark 4 (BM4).

Workload patterns:

WL

Epoch (1 epoch = 10 s)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

S1

27

49

31

32

62

77

80

44

0

6

1

1

8

73

87

81

80

91

100

99

89

67

13

52

0

0

10

46

27

86

63

S2

82

39

55

42

96

42

100

33

53

20

20

10

11

14

13

11

13

13

1

5

1

0

23

45

61

42

83

83

20

15

3

S3

8

20

21

30

80

100

24

50

36

54

83

92

91

73

27

1

0

1

1

1

1

0

1

1

10

1

21

17

33

5

7

S4

27

49

31

32

62

77

80

44

0

6

1

1

8

73

87

81

80

91

100

99

89

67

13

52

0

0

10

10

15

30

27

S5

53

21

52

48

33

92

89

100

39

38

29

41

48

4

64

45

36

31

42

41

42

35

15

80

93

62

10

23

48

32

0

Core assignment:

vdom0

vdom1

vdom2

vdom4

vdom5

vdom7

-

-

-

-

n/a

n/a

-

-

-

-

-

-

S3

S4

S3

S4

n/a

n/a

S3

S4

S3

S4

S3

S4

-

S5

-

S5

n/a

n/a

-

S5

-

S5

-

S5

S1

S2

S1

S2

n/a

n/a

S1

S2

S1

S2

S1

S2

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Kang, C., Lee, S., Lee, YJ., Lee, J., Egger, B. (2017). Scheduling for Better Energy Efficiency on Many-Core Chips. In: Desai, N., Cirne, W. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP JSSPP 2015 2016. Lecture Notes in Computer Science(), vol 10353. Springer, Cham. https://doi.org/10.1007/978-3-319-61756-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-61756-5_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-61755-8

  • Online ISBN: 978-3-319-61756-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics