abstract

Perphon: a ML-based Agent for Workload Co-location via Performance Prediction and Resource Inference

Authors:
Jianyong Zhu

Beihang University

Beihang University
View Profile

,
Renyu Yang

Edgetic Ltd. UK and University of Leeds

Edgetic Ltd. UK and University of Leeds
View Profile

,
Chunming Hu

Beihang University and China National Key R&D Program (2016YFB1000503)

Beihang University and China National Key R&D Program (2016YFB1000503)
View Profile

,
Tianyu Wo

Beihang University

Beihang University
View Profile

,
Shiqing Xue

Beihang University

Beihang University
View Profile

,
Jin Ouyang

Alibaba Group

Alibaba Group
View Profile

,
Jie Xu

Edgetic Ltd. UK and University of Leeds

Edgetic Ltd. UK and University of Leeds
View Profile

SoCC '19: Proceedings of the ACM Symposium on Cloud ComputingNovember 2019Pages 478https://doi.org/10.1145/3357223.3365440

Published:20 November 2019Publication History

SoCC '19: Proceedings of the ACM Symposium on Cloud Computing

Pages 478

ABSTRACT

Cluster administrators are facing great pressures to improve cluster utilization through workload co-location. Guaranteeing performance of long-running applications (LRAs), however, is far from settled as unpredictable interference across applications is catastrophic to QoS [2]. Current solutions such as [1] usually employ sandboxed and offline profiling for different workload combinations and leverage them to predict incoming interference. However, the time complexity restricts the applicability to complex co-locations. Hence, this issue entails a new framework to harness runtime performance and mitigate the time cost with machine intelligence: i) It is desirable to explore a quantitative relationship between allocated resource and consequent workload performance, not relying on analyzing interference derived from different workload combinations. The majority of works, however, depend on offline profiling and training which may lead to model aging problem. Moreover, multi-resource dimensions (e.g., LLC contention) that are not completely included by existing works but have impact on performance interference need to be considered [3]. ii) Workload co-location also necessitates fine-grained isolation and access control mechanism. Once performance degradation is detected, dynamic resource adjustment will be enforced and application will be assigned an access to specific slices of each resources. Inferring a "just enough" amount of resource adjustment ensures the application performance can be secured whilst improving cluster utilization.

We present Perphon, a runtime agent on a per node basis, that decouples ML-based performance prediction and resource inference from centralized scheduler. Figure 1 outlines the proposed architecture. We initially exploit sensitivity of applications to multi-resources to establish performance prediction. To achieve this, Metric Monitor aggregates application fingerprint and system-level performance metrics including CPU, memory, Last Level Cache (LLC), memory bandwidth (MBW) and number of running threads, etc. They are enabled by Intel-RDT and precisely obtained from resource group manager. Perphon employs an Online Gradient Boost Regression Tree (OGBRT) approach to resolve model aging problem. Res-Perf Model warms up via offline learning that merely relies on a small volume of profiling in the early stage, but evolves with arrival of workloads. Consequently, parameters will be automatically updated and synchronized among agents.

Anomaly Detector can timely pinpoint a performance degradation via LSTM time-series analysis and determine when and which application need to be re-allocated resources. Once abnormal performance counter or load is detected, Resource Inferer conducts a gradient ascend based inference to work out a proper slice of resources, towards dynamically recovering targeted performance. Upon receiving an updated re-allocation, Access Controller re-assigns a specific portion of the node resources to the affected application. Eventually, Isolation Executor enforces resource manipulation and ensures performance isolation across applications. Specifically, we use cgroup cpuset and memory subsystem to control usage of CPU and memory while leveraging Intel-RDT technology to underpin the manipulation of LLC and MBW. For fine-granularity management, we create different groups for LRA and batch jobs when the agent starts. Our prototype integration with Node Manager of Apache YARN shows that throughput of Kafka data-streaming application in Perphon is 2.0x and 1.82x times that of isolation execution schemes in native YARN and pure cgroup cpu subsystem.

References

Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. In ACM ASPLOS 2013.Google ScholarDigital Library
Jacob Leverich and Christos Kozyrakis. 2014. Reconciling high server utilization and sub-millisecond quality-of-service. In ACM EuroSys 2014.Google Scholar
Rathijit Sen and Karthik Ramachandra. 2018. Characterizing resource sensitivity of database workloads. In IEEE HPCA 2018.Google ScholarCross Ref

Index Terms

Perphon: a ML-based Agent for Workload Co-location via Performance Prediction and Resource Inference
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

Managing energy, performance and cost in large scale heterogeneous datacenters using migrations
Abstract
Improving datacenter energy efficiency becomes increasingly important due to energy supply problems, fuel costs and global warming. Virtualisation can help to improve datacenter energy efficiency through server consolidation which ...
Highlights
- The existence of a trade-off between overall energy consumption and performance (hence cost).
Read More
Understanding, modelling, and improving the performance of web applications in multicore virtualised environments
ICPE '14: Proceedings of the 5th ACM/SPEC international conference on Performance engineering

As the computing industry enters the Cloud era, multicore architectures and virtualisation technologies are replacing traditional IT infrastructures. However, the complex relationship between applications and system resources in multicore virtualised ...
Read More
SMapReduce: Optimising Resource Allocation by Managing Working Slots at Runtime
IPDPS '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium

Hadoop version 1 (HadoopV1) and version 2 (YARN) manage the resources in a distributed system in different ways. HadoopV1 executes MapReduce tasks in working slots that are statically configured, YARN uses a set of task containers to encapsulate its ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SoCC '19: Proceedings of the ACM Symposium on Cloud Computing
November 2019
503 pages
ISBN:9781450369732
DOI:10.1145/3357223

Copyright © 2019 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 November 2019
Check for updates
Author Tags
Performance Modelling
Resource Inference
Qualifiers
- abstract
- Research
- Refereed limited
Conference

Acceptance Rates
SoCC '19 Paper Acceptance Rate39of157submissions,25%Overall Acceptance Rate169of722submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 302
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Perphon: a ML-based Agent for Workload Co-location via Performance Prediction and Resource Inference

SoCC '19: Proceedings of the ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Managing energy, performance and cost in large scale heterogeneous datacenters using migrations

Understanding, modelling, and improving the performance of web applications in multicore virtualised environments

SMapReduce: Optimising Resource Allocation by Managing Working Slots at Runtime

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Perphon: a ML-based Agent for Workload Co-location via Performance Prediction and Resource Inference

SoCC '19: Proceedings of the ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Managing energy, performance and cost in large scale heterogeneous datacenters using migrations

Understanding, modelling, and improving the performance of web applications in multicore virtualised environments

SMapReduce: Optimising Resource Allocation by Managing Working Slots at Runtime

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media