A cloud-based triage log analysis and recovery framework

https://doi.org/10.1016/j.simpat.2017.07.003Get rights and content

Abstract

With the development of cloud infrastructure, more and more transaction processing systems are hosted in cloud platform. Log, that usually saves production behaviors of a transaction processing system in cloud, is widely used for triaging production failures. Log analysis of a cloud-based system faces challenges as the size of data increases, unstructured formats emerge, and untraceable failures occur more frequently. More requirements of log analysis are raised, such as real-time analysis, failure recovery, and so on. Existing solutions have their own focuses and cannot fulfill the increasing requirements. To address the main requirements and issues, this paper proposes a new log model that classifies and analyzes the interactions of services and the detailed logging information during workflow execution. A workflow analysis technique is used to fast triage production failures and assist failure recoveries. The failed workflow can be reconstructed from failures in real-time production servers by the proposed log analysis solution. The proposed solution is simulated by using a large size of log data and compared with traditional solution. The experimentation results prove the effectiveness and efficiency of proposed triage log analysis and recovery solution.

Introduction

Originally “triage” is a medical term to separate, sort, sift, or select medical conditions. In software engineering, it assigns a priority and severity level to failures, and it is done periodically, such as every hour, every day, or as necessary. Triage is important for software quality assurance, especially in a cloud environment [1], [2]. Specifically, in a modern Software-as-a-Service (SaaS) system, multiple tenants may share the same SaaS platform, and the same service components stored in database. Each transaction of a tenant is composed by stored service components. Then the composed transaction is compiled and deployed at runtime to execute application. If there is a failure, it is necessary to know the cause of failure. However, each transaction application consists of shared service components. Engineers need to examine log data to identify the failed transaction first. Then they further explore the involved service components of identified transaction to figure out the cause of failure.

For a large-size transaction system, engineers need to examine a large number of logs to identify failures. In most current cases, this examination is processed manually, expensively, and slowly. Furthermore, for a modern cloud-based transaction system, millions of transactions may arrive at the same time. When the log data of each transaction is produced, millions of log data are produced and recorded. Due to concurrent processing, log data arrives in an asynchronous manner, thus log analysis is a challenging task.

As a batch processing framework, Hadoop provides scalability and fault tolerance [3], [4]. Hadoop also supports quick retrieval and searching of log data. So it can be used for log analysis in cloud. Hadoop breaks a large amount of log data into small blocks and distributes them to different nodes of cluster. The performance of Hadoop is affected by frequent writing and reading data. Spark uses in-memory computations to minimize the costs of data transfer [5], [41]. Mavridis and Karatza investigated and compared the performance of real log analysis applications in Hadoop and Spark with real world log data [6]. They developed realistic log file analysis applications in both frameworks and performed SQL-type queries in real Apache Web Server log files. They mainly focused on execution time, resource utilization, and scalability in performance comparison. They also proposed a power consumption model to estimate and compare the cost and power consumption of two frameworks based on CPU, memory, disk and network utilizations.

Many frameworks have addressed logging issues from different aspects, such as Splunk [7], LogStash [8], Unilog [9], IBM SmartCloud Analytics [10], Fluentd [11], Google Analytics [12], XpoLog [13], and Nagios [14]. Existing approaches have their own different focuses. Actually, a large number of log data does not have a pre-defined data model or is not organized in a pre-defined manner. However, it is a challenging task to handle the unstructured, big volume of data mixed with untraceable failures [15]. None of existing systems has a complete solution. To solve the above challenges, this paper proposes a new logging approach that can triage production failures effectively and efficiently.

Logs in cloud are characterized as a big volume, unstructured formats, and a large amount of untraceable failures. Failures and exceptions happen frequently on production unexpectedly as developers continue adding new changes to existing or new systems. In commercial log processing, logs will be usually kept for one month, and then new data will erase 30-day old data. Even if in trustworthy cloud platforms, such as Amazon, and Google, their services still experience constant downtime according to their status reports [16], [17]. Service downtime on production system becomes frequent. However, even one-minute downtime on production can result in various failures and a huge loss. Developers normally put the least priority in coding phase for log that causes the unstructured format along with the heterogeneous deployment of different components in cloud later. As log data accumulates at the rate of hundreds of millions per hour on commercial production site, it becomes complicate to manage.

Most transaction processing systems use service-oriented structure as shown in Fig. 1. For example, a transaction system may have X services, and each transaction often involves Y services on average. Thus not only transaction information is unique for each transaction, but also involved software services are also different. A number of services is selected from database to compose each transaction software in real time. Suppose X=200 and Y=50, then C(50200) combinations of transactions are formed. Moreover, if each transaction has 10 services as minimum, and 180 services as maximum, the number of transactions will be an extremely large number.

Although log data is done at the best effort, the production of log data is not guaranteed. For example, Fig. 2 shows a typical log data. Currently, most companies address this issue by human on-site log examination after software processing to manually determine whether any transaction has failed. This process is cumbersome, expensive, and time-consuming.

Log data can be stored in XML as semi-structure data or even in relational database as structure data. In most cases, log data is not structured as each transaction is different. For example, the buyer (name, address, contact information, ID), seller (name, address, contact information, ID), trading item, transaction amount, transaction time, and the place of transaction. As each unique transaction, log data of different transactions is different. Specifically, each transaction log can have variations in the following aspects.

  • Buyer.

  • Seller.

  • Items involved in a transaction.

  • Payment methods.

  • Discount or promotion code.

  • Currency.

  • Time and place of transactions.

  • Risk; and

  • Compliance.

The same person engaged in two different transactions may result in different risks and compliance issues. There can be multiple reasons for failures as follows:

  • 1.

    Code Exception: Exception happens in the workflow, but log information is still logged.

  • 2.

    Service Outage: Entire or partial service is down, thus no any detail of transactions is logged.

  • 3.

    Media Failure: The physical media of server is malfunctioned, thus there is no log of transactions, or only partial transaction information is logged.

The results can always be categorized as follows:

  • 1.

    Partial log missing;

  • 2.

    Complete log missing.

Most of time, if it is a pure workflow exception and details can be logged, the log is traceable as the interaction logs of services are still there. However, if the log is partial or complete missing, it is difficult to do further diagnosis. In the case of partial log missing, the available detailed information of logs and the available key activities of log entries determine whether it can complete the case.

For complete service missing, it can be:

  • Service is completely down, and no subsequent service is invoked.

  • Only log service is down. Other services are not interrupted, and the subsequent service call is still invoked.

In the latter case, it still needs to continue tracing other services of the workflow. This approach is consistent with data-based identification approach [18] or smart data approach.

The current triage solutions face various difficulties in log analysis, such as large data size, a large number of services and workflows, inconsistent codes, and inconsistent logging behaviors. For example, Fig. 3 shows that two isolated files store the same workflow logs, as one for LoginService, and the other for AuthenticationService.

According to RTRM architecture (details are shown in Section 2), workflows and representing transactions do not enter log data, but those participating services do. Thus, if a workflow fails, it is necessary to trace those logs entered by participating services. The following reasons may cause the issue becomes involved.

  • 1.

    Each service may be used by many transactions;

  • 2.

    The system may execute many transactions at the same time resulting in interleaving log data of these transactions;

  • 3.

    Log data may not be entered, because the transaction fails to complete;

  • 4.

    Log data uses service and workflow IDs in an inconsistent manner. For example, in the same workflow, some services use service IDs, but other services use workflow ID. In this case, the search of workflow ID produces incomplete log data. Fig. 3 illustrates two services, the loginService uses ServiceID, whereas the AuthenticationService uses WorkflowID;

  • 5.

    Service calls may not be explicit among services, thus it needs to scan the 36 GB log file to identify those missing services. For example, LoginService calls AuthenticationService, but the requested information is not available in log. Scanning and querying log data is expensive. As an interactive process, engineers need to identify the items to be searched first. As multiple transactions with shared services may fail in the same failure code, the care must be exercised to distinguish those services that involve in multiple failed transactions;

  • 6.

    Missing log data causes troubles. For example, in the box No. 3 (shown in Fig. 3) of AuthenticaonService, the next service call from AuthenticationService to AuthDBService cannot be found in the log. There are two alternative choices for engineers:

    • Assuming there is no missing call as no log can indicate anything missing. But engineers err in this case;

    • Assuming there is something missing, but engineers need to determine which one is missing.

    In this case, AuthDBService is a good choice as it is involved in authentication. But if multiple similar services are available or maybe other unrelated services are missing, what is the result?

The conventional triage process shown in Fig. 4 has the following steps:

  • Migrate the log from production to off-production servers.

  • Identify related services in workflow.

  • Reconstruction the workflow from log files.

  • Query database and check codes to understand the workflow; and

  • Manually generate reports for the concerned workflow.

After the examination of logs, database, and code bases, it can identify the cause of failure. The next step is to prioritize the issue based on business impacts. The most important business impact is how many similar transactions happen. Engineers need to collect the number of impacted workflow types and the number of incidences to response. Existing queries are mostly based on presetting queries or big data systems like Splunk that only reports at failure code level. As a failure code can have multiple related workflow types. So a good fine-grained understanding of workflow is necessary.

This paper has three main contributions as follows:

  • 1.

    It proposes a failure detection and correction mechanism into a real-time transaction management;

  • 2.

    It proposes a workflow indexing method by using MapReduce;

  • 3.

    The proposed system has been extensively simulated and evaluated by using a large amount of data that is large enough for a large online e-business system.

The paper is organized as follows: Section 2 describes the real-time transaction management (RTRM) structure; Section 3 describes the related work; Section 4 designs the solution of a fast triage; Section 5 evaluates the performance of proposed solution; and Section 6 concludes this paper.

Section snippets

RTRM architecture

A typical online transaction system has the following architectures:

  • 1.

    A RTRM system as shown in Fig. 5 takes user inputs and processes the transaction by using a pool of cloud resources. A RTRM often has a number of participating subsystems such as credit card verification, payment method, risk analysis, fraud prevention, and so on. Some of these subsystems are not under the control of RTRM;

  • 2.

    RTRM is usually structured in a service-oriented manner with tens of thousands of software services that

Related work

As an extension of multi-tenancy architecture (MTA), sub-tenancy architecture (STA) allows tenants to offer services for subtenant developers to customize their applications in the SaaS infrastructure [19]. In a STA system, tenants can create subtenants, and grant their resources (including private services and data) to their subtenants. Comparing with MTA, the isolation and sharing of services and data are more complex in STA [20]. It involves many tenants with different relationships,

Logging solutions

As a common method, the grep function in Unix/Linux environment is used to search related information in a box. However, it only works, when one exactly knows the log file. It becomes time and resource consuming to grep one transaction ID crossing data center in searching for only one set of logs. It is important to model the logs first and then define the operations for triage.

Need for real-time log analysis at transaction times: Majority of log analysis solutions, such as Splunk, LogStash,

Experiment

The experiment runs a large scale simulation with over a million BL to demonstrate the performance of proposed solution in retrieving logs and completing FICs with the transactions (details described in Table 2). At the end, different scenarios of FIC distributions, and exception rates of workflow level are discussed. Finally, the rebuilding algorithm is evaluated based on the inputs.

Fig. 6 shows an architecture view about the logs deployed on cloud. The shopping transaction with a unique

Conclusion

The large size of data, unstructured formats, and untraceable failures, as three unavoidable challenges, seriously affect the log analysis of cloud-based system. The paper addressed the challenges for log analysis in a cloud infrastructure. A triage framework of log analysis was proposed that had four main functions as real-time log analysis, failure recovery, flexible data processing, and white-box log analysis.

The proposed triage solution was described as:

  • It presented a modeling framework

References (41)

  • ZouD. et al.

    Improving log-based fault diagnosis by log classification

    Network and Parallel Computing

    (2014)
  • IBM, Ibm Smartcloud Analytics - Log Analysis,...
  • Fluentd, Open Source Data Collector, 2017,...
  • Google, Google Analytics Official Website, 2017,...
  • Xpolog, Xpolog Log Management - Log Analysis with Log Analytic Search, 2017,...
  • Nagios, Nagios - The Industry Standard in it Infrastructure Monitoring, 2017,...
  • A. Miranskyy et al.

    Operational-log analysis for big data systems: challenges and solutions

    IEEE Softw.

    (2016)
  • Amazon, Aws Service Health Dashboard, 2017,...
  • Google, Google App Engine System Status, 2017,...
  • T. Segaran et al.

    Beautiful Data: The Stories Behind Elegant Data Solutions

    (2009)
  • Cited by (18)

    • Supply- and cyber-related disruptions in cloud supply chain firms: Determining the best recovery speeds

      2021, Transportation Research Part E: Logistics and Transportation Review
      Citation Excerpt :

      Bier et al. (2020), Duong and Chong (2020), Xu et al. (2020), Choi (2021), and Chowdhury et al. (2021) have provided a comprehensive overview of the current methods for mitigating and recovering from disruptions in SCs, transportation, and logistics. Some papers have addressed reactive methods against cyber-related disruptions in the e-health domain (Sahi et al., 2016; Qi et al., 2017). For example, Sahi et al. (2016) discussed disaster recovery plans to guarantee the availability and continuity of e-health systems during a disaster (the recovery plan enables data owners and patients to have complete and safe control over their records).

    • Power system structure optimization based on reinforcement learning and sparse constraints under DoS attacks in cloud environments

      2021, Simulation Modelling Practice and Theory
      Citation Excerpt :

      Due to the insufficient local computing power, there is an urgent need for powerful computing power to process sensor data and maintain the stability of power system. Cloud computing can effectively solve the lack of computing power [14–17]. In cloud computing, tasks can be distributed to multiple cloud nodes for processing in parallel, which effectively increases the data processing speed of power system [18–21].

    • A residual neural network based method for the classification of tobacco cultivation regions using near-infrared spectroscopy sensors

      2020, Infrared Physics and Technology
      Citation Excerpt :

      This section describes the NIR dataset of tobacco leaves, the experiment platform, the network settings, detailed experiments, and result analysis. Following the development of Internet techniques and cloud infrastructure, more and more transaction processing systems have been hosted in cloud platforms [44–46]. Cloud-based platforms can use numerous and powerful computation resources within reasonable costs [47–49] to implement the analysis of large-size tobacco data.

    • Blockchain based consensus checking in decentralized cloud storage

      2020, Simulation Modelling Practice and Theory
      Citation Excerpt :

      A variety of fault tolerant problems in different distributed systems, such as Google file system (GFS) [11], Hadoop distributed file system (HDFS) [12], RAMCloud [13], and fault-diagnosis system for reciprocating compressors [14], are solved by replicated state machines. Cloud computing utilizes multiple copies of data and multiple processes for the same work [15,16]. In order to maintain the state consistency, each replica must be copied in the same order using the same operations.

    • Big data analytics for wireless and wired network design: A survey

      2018, Computer Networks
      Citation Excerpt :

      Aprevi's ARLAS solution provided real-time collection and storage of network logs. Related academic research was presented by the authors in [115–117]. Examining the above solutions, one can note that the majority of the solutions are in the wireless field.

    • Fault-diagnosis for reciprocating compressors using big data and machine learning

      2018, Simulation Modelling Practice and Theory
      Citation Excerpt :

      Therefore, it is always difficult to map related parameters to corresponding faults on reciprocating compressor. With the development of online testing technology [4–6], real-time information needs to be processed by hundreds of engineers in offshore oil fields. The analysis costs are expensive.

    View all citing articles on Scopus
    View full text