Abstract:
Cloud services have shifted from monolithic designs to microservices running on cloud-native infrastructure with monitoring systems to ensure service level agreements (SL...Show MoreMetadata
Abstract:
Cloud services have shifted from monolithic designs to microservices running on cloud-native infrastructure with monitoring systems to ensure service level agreements (SLAs). However, traditional monitoring systems no longer meet the demands of cloud-native monitoring. In Alibaba’s “double eleven” shopping festival, it is observed that the monitor occupies resources of the monitored infrastructure and even disrupts services. In this paper, we propose a novel monitoring system named Zero+ for cloud-native monitoring. Zero+ achieves zero overhead in collecting raw metrics using one-sided remote direct memory access (RDMA) and remedies network congestion by adopting a receiver-driven flow control scheme. Zero+ also features a priority queue mechanism to meet different quality of service requirements and an efficient batch processing design to relieve CPU occupation. Zero+ has been deployed and evaluated in four different clusters with heterogeneous RDMA NIC devices and architectures in Alibaba Cloud. Results show that Zero+ achieves no CPU occupation at the monitored host and supports 1\sim 10k hosts with 0.1\sim 1s sampling interval using a single thread for network I/O. Zero+ significantly relieves the incast issue and maintains 80\sim 95\% of bandwidth utilization in several clusters when monitoring 1k hosts. Zero+ also ensures services with high priority accomplish collecting metrics earlier than low priority ones by at least 400 \mu s when monitoring 1k hosts.
Published in: IEEE/ACM Transactions on Networking ( Volume: 32, Issue: 4, August 2024)