1 Motivation

Fueled up by the explosive growth of services in cloud computing environments, traditional predesigned and suit-to-all monitoring tools are not efficient enough for cloud monitoring for three reasons. (1) The number of monitoring target becomes huge, which makes the traditional centralized management structure incapable to efficiently coordinate these dispersed collection agents. Efficient and distributed organization of the monitoring agents are required to ensure the performance of monitoring systems [2]. (2) The volume of data that are disseminated across data centers is large, which incurs much more extra network pressure [4]. To eliminate the extra pressure, tricky strategies, such as dynamic control on monitoring target, monitoring interval, data polling strategy, etc., are necessary to handle collected data [3]. (3) In cloud environments, services are located on infrastructures across different regions, which makes the underlying network structure for data dissemination complex. To ensure fast data access and high availability, new protocols that provide better solution for monitoring data storage and requisition are imperative [5].

Considering the above reasons, it is vital to design monitoring mechanisms according to the characteristics and monitoring requirements of a specific data center in cloud computing environments. In this work, we propose SimMon, a toolkit to simulate monitoring mechanisms and evaluate their effectiveness. SimMon is used in two main scenarios: (1) to test whether a monitoring strategy would work well in a certain data center before it is deployed and run in a real production environment; (2) to compare the results of different strategies and decide which strategy is the most appropriate one for a specific data center.

2 Architecture of SimMon

Figure 1 depicts the architecture of SimMon. It is composed by four main components: network, data storage, data dissemination and strategy control panel.

Fig. 1.
figure 1

Architecture of SimMon

Modeling Network. Network layer is an important consideration since the bandwidth and time costed by data transmission highly rely on underlying network structure and the logical topology of monitoring systems. In order to model the network structure, we simulate the behavior of root switch, aggregation switch and access switch separately and combine them as a layered structure. Apart from the underlying network structure, logical monitoring topology focuses on the organization of monitoring agents. A monitoring system commonly consists of three kinds of agent: collection agent (also called as sensor), federation agent and root agent. Collection agents are hosted on the same virtual machine with the service target to locally collect monitoring measurements. Federation agents are in charge of data organization and processing for a subset of collection agents. Root agents act as central nervous to control the global scheduling strategies of monitoring systems. We define the three kinds of agents to support the design for centralized, tree-based, P2P-based, and hybrid topologies. We adopt an event-based mechanism to control the process on packets transformation and handle the packet loss situation. Latency between nodes is calculated from the underlying network structure and a BRITE-style file that contains delay metrics between each pair of virtual machines.

Modeling Data Storage. In monitoring systems, collected data are usually transferred from collection agents to federation agents and stored in a data repository for future query and analysis. To model the data storage process, we give an interface to simulate different data repositories, such as MySQL and HBase. Concurrently with the support for database simulation, the organization of the storage nodes in a distributed database is also important to reduce the total network bandwidth cost and the chance for resource conflicts. A good algorithm should consider the data volume to be transferred and the resource usage of business-related workload in the data center. Furthermore, in data query process, it is important to find the shortest route to get requested data. We implement a cache-hit strategy to store the data that collected in the most recent period in cache for fast query. To ensure high availability, we design a structure to support users to define different replication strategies. More than one copy of replication of the collected data are stored in replication servers in case of emergency.

Modeling Data Dissemination. In a distributed monitoring system, monitoring data are collected by collection agents and disseminated to federation agents. In the dissemination layer, there are three main processes that may cost extra resources: getting monitoring data, disseminating the data, and receiving the data. In the process of getting monitoring data, we simulate strategies to deploy bunches of sensors intelligently and implement algorithms on precise target selection, accurate collection interval selection and dynamic data preprocessing to reduce the data from source. In the process of disseminating the data, we reduce data dissemination actions by intelligent strategies on load balancing and data polling (data collected by the dispersed collection agents can be pushed to federation nodes passively or be pulled by federation nodes proactively). In the process of receiving the transferred data, we design two protocols: unicast protocol and multicast protocol. An unicast protocol can improve the accuracy of delivered data, while a broadcast protocol brings efficiency for data delivery.

Modeling Strategy Control Panel. Sensors are the source of monitoring data, and they are developed and deployed individually with monitoring systems. Meanwhile monitoring systems should be capable to discover independent sensors and add them into management consoles. There are two main solutions to discover newly installed sensors: event-based announcement from the installed sensors and periodic scan from federation agents or root agents. We implement security and privacy policies by creating subnets for a certain set of sensors. Monitoring systems are expected to send alarms to system administrators when a certain kind of event occurs. To filter out those false alarms, we build an interface to support users to redesign the alarm strategy and compare their results.

3 Implementation

On the design of SimMon, we first adopt the classes that are inherited from CloudSim [1] to build a testbed that contains hosts, switches, virtual machines, and workloads, and the testbed is a simulation of a data center in cloud computing environment. Based on the simulated data center, we develop a new toolkit to support users to build different monitoring mechanisms. We use Java language to implement the simulation toolkit and the toolkit program contains 15890 lines of code in total. Source code of SimMon is available at http://www.cmsci.net/pxlin/simmon.

4 Demonstration

In the demonstrations, we first use SimMon to simulate a cloud data center with 10000 physical servers (PSs), and each PS host 16 virtual machines. The PSs are dispersed in 10 individual small-scale data centers, and they are connected by a tree-based underlying network. Workloads running in the cloud environment are simulated with certain distributions. The simulated cloud data center is the target that we want to monitor. Hence all the monitoring mechanisms are designed based on the data center. We use three examples to demonstrate three common usage scenarios of SimMon.

  • Influence of different topologies in monitoring systems. In this demonstration, we build four monitoring systems with different monitoring topologies: star-based, tree-based, P2P-based, and hybrid. In each monitoring system, we first simulate a data polling strategy that pushes collected data with certain interval to federation agents. We then summarize the total extra cost caused by monitoring systems and compare the influence of different topologies.

  • Data dissemination cost comparison by different polling strategies. In the demonstration, we implement three data polling strategies: push at a certain interval, hybrid push and pull, intelligent exchange between push and pull. Based on SimMon, we compare the extra cost and accuracy of these strategies.

  • Effective alarm reduction by different alarm strategies. In this demonstration, we test three different strategies on producing alarms: alarm on CPU usage, alarm on memory usage and alarm on CPU and memory usage. Based on SimMon, we compare the number of effective alarms that are caused by the three strategies.