Keywords

1 Introduction

The technological advances have led the telecommunications industry to generate huge amounts of data known as Big Data [11]. For example, the AT & T Company reported that an average of 300 million international calls is performed in one normal day using its services [3]. The improper use of these services to perform malicious activities such as fraud in telecommunication services, network intrusion, among others, makes it necessary to analyze the data in real time [12].

A system that performs this type of processing can prevent or at least reduce any possible damage caused by the execution of malicious activities. Regarding a system which allows processing such data, there is no reference model that has been universally accepted, as it happens with the technologies in its first phases of life. However, there are industrial and academic proposals that try to define what are the components that make up a Big Data system, including its relationships and properties [13].

In the last years, the distributed systems have been one of the most used proposals for the efficient analysis of large volumes of data [7]. In this paper, we present a framework that allows the distributed processing of data streams in real time. Our proposal is designed to be able to support any application for data processing, which wants to be executed in a distributed way. The experimental results show how much an application can improve its efficiency using our proposal.

The rest of the paper is organized as follows. The next section describes the related work. Our proposal is presented in Sect. 3. The study case is exposed in Sect. 4. Finally, the conclusions and future works are given in Sect. 5.

2 Related Work

Generally, the distributed systems have a strategy to provide communication among different system modules facilitating the data exchange [14]. Some of these strategies aim to provide reliable and flexible services for asynchronous data exchange.

These include the Java Message System (JMS) [4], Internet Communication Engine (ICE) [6], Common Object Request Broker Architecture (CORBA) [8], or Data Distribution System (DDS) [9]. JMS relies on the Java platform and allows distributed communication using a common API for the development of Java applications based on messages. ICE, CORBA, and DDS use an Object Request Agent (ORB) which consists of an object exchange interface. However, the most advanced of these is DDS that was developed to standardizing the data distribution in different platforms.

There are other more specific strategies, which are basically oriented to real-time processing. For example, the Open RObot COntrol Software (OROCOS) platform [2] for application control provides tools for data-exchange and event-based services. More recent strategies are iLAND [10] which consists in a data distribution service for real-time applications that incorporates a reconfiguration logic.

On the other hand, DREQUIEMI [10] is based on Java Language and the experiences obtained by iLAND. There is also an architecture presented in [1], which is based on Java Language and designed for real-time analysis.

Several of the above strategies are used to integrate distributed enterprise applications [5]. However, none of them is generic enough to assimilate any application that is required to process data streams in real time or very close to it.

There are some frameworks that have been proposed to deal with Big Data such as: Apache Hadoop, Apache Storm, Apache Spark and Apache Flink, which are briefly described as follows. Hadoop [15] is an open source software for reliable, scalable, distributed computing. It was the first one to enable big data processing and uses batch processing.

Spark [16] combines a distributed computing system through computer clusters. It is an open source framework with a very active community; it is a fast tool; it has a comfortable interactive console for developers; and it also has an API to work with big data. Despite this, Spark has some limitations such as: no support for real time processing, it simulates the streaming through micro-batch and each iteration is scheduled and executed separately; it requires lots of memory to run and problems arise when working with small files.

Storm [17] is a distributed framework for real-time data processing. Built to be scalable, extensible, efficient, easy to administer and fault-tolerant.

Flink [18] is a distributed framework to process data efficiently and for general use, just like Spark. Unlike Flink, Spark cannot handle a data set larger than the memory it has available. In addition, Spark works in batches of data unlike Flink which can simulate the data stream.

The previous proposals are very efficient and widely used today. However, in order to use these frameworks, many computational resources are needed, or a large amount of time has to be devoted to creating an entire infrastructure for their use, without taking into account the required programming time. On the other hand, the proposal presented in this article tries to give a quick and efficient solution to a big data problem without using large computing resources.

3 Our Proposal

This section presents a framework that allows to distribute data and manage several instances of the same application running in parallel. As part of the strategy presented in this paper, a rule engine is used as the application to be managed.

The rule engine and the framework are described below.

3.1 Rule Engine

A rule engine (RE) usually evaluates rules expressed with the notation “if X then Y” (\(X \Rightarrow Y\)), where X is an interest conditions set and Y is the action to take when X is fulfilled. The Fig. 1 shows the basic model of an RE for the processing of data stream.

Fig. 1.
figure 1

Basic scheme of a rules engine.

The rule base shown in the Fig. 1 provides persistent storage for a set of rules. When the system begins to process a data stream, immediately the existing rules are evaluated and in the case that the conditions of some rule are fulfilled, the corresponding action “triggers”.

3.2 Proposed Framework

Our proposal consists of a framework that from now on we will refer as Task Manager (TM). Its main function is to manage tasks that must be processed in a distributed way by several instances of the same application. The Algorithm 1 shows the general process followed by our Task Manager.

figure a

The TM initially reads the configuration file, where several parameters are defined by the analyst. Note that the TM runs as a daemon in the operating system, so it is running continuously until it is stopped by the analyst. The next step is to search in the data directory if there is any dataset to be processed. If it does not exist, the TM waits for a time (defined by the analyst in the configuration file) and re-check the directory. In case there is exist a dataset, the TM conforms the task. When multiple datasets exist, the TM selects the oldest (line 3).

After the task is generated, the TM checks if any of the application instances running are available to process a new task. An application is available when you are not processing a task. The TM instances can run on different computers and the location of each one is defined by the analyst in a configuration file. In case there is no available instance of the task manager, the TM waits for a time defined by the analyst and re-check for a task. If an instance of the application is available, the TM allocates the selected task to that instance and registers it (line 5).

When the TM assigns a task, it begins to check its status. To do this, it checks if the application instance to which you assigned a task has ended. An application ends when it is registered as being assigned a task and it is not running. If the task continues to run, the TM waits for the specified time and rechecks if any of the registered instances are finished (lines 6–8).

In the Fig. 2, we show a scheme where we use a rule engine as the application to be distributed. In this case, a task consist of a dataset to be processed and a set of rules to be evaluated. As can be seen, there are two directories from which the Task Manager is fed. If there is exist a dataset, the TM searches in the rules directory for those rules that share the same identifier that the selected dataset and conforms the task. When a task is generated, the TM checks if any of the application instances running are available to process a new task.

Fig. 2.
figure 2

Task Manager operations diagram for processing the Windows Operating System logs by a rules engine.

In this case, if the application instance ended, the TM searches in the alerts database for those that correspond to the identifier of the ended task. If there are alerts, they are reported. Otherwise, it is reported that no alerts were issued for that task.

4 Study Case

As a distributed processing platform, the application (in this case the rule engine) can be located in a cluster with an unrestricted number of processing nodes as well as on a single computer. The experiment included five personal computers (PCs) equipped with a Quad-Core processor and 4 GB of RAM. In one of them, the TM was installed, and in the remaining four the RE was installed.

A test scenario was created in a laboratory with 20 PCs with the Windows 8 operating system installed. In order to show the improvement in terms of efficiency that the proposal can provide. The experiment consists in analyze the Windows event logs generated by each PC (see Fig. 3). For this, an agent was used to capture each new log generated and create a dataset D to send it to the TM data directory. This agent was installed in each of the 20 PCs used in the experiment. The analyzed logs are associated with Windows application, system, and security events.

Fig. 3.
figure 3

Test scenario for the analysis of Windows Operating System event logs.

When the agent is executed for first time, it sends an initial dataset with all the event logs stored by the system and then waits for new logs to be sent to the TM data directory. Taking this into account, it was defined that the size of each dataset to be sent does not exceed a threshold defined by the analyst (in this case 20 000). In this way, the first dataset represents a data flow around 20,000 event logs by PC. Running all the agents the same time once generates 20 datasets to be processed. The time to process each dataset is measured since the dataset is copied into the directory until the TM reports the alerts associated to the dataset processed.

Five rules were designed for the experiments, which evaluate regular expressions on the different fields of the event logs. The rules were created intentionally to match with some test logs in order to validate the operation of the system.

In order to evaluate the advantage of the proposed task manager, two experiments were designed. In the first, an instance of the rule engine was executed, without the TM, and in the second, 4 instances of the distributed rule engine were executed on different PC using the TM.

Figure 4 shows a comparison between the time taken by a rule engine instance, without the TM (experiment 1), to process the generated event logs, regarding to the time taken by the TM with 4 instances of the RE (Experiment  2).

Fig. 4.
figure 4

Comparison of the rule engine processing time with and without Task Manager.

In experiment 1, the time taken to evaluate the five rules created to process the first four sets of generated data (80 000 event log) was 2.5 s. In experiment 2, the generated datasets were analyzed in parallel and the time taken to process the same four datasets was 1.2 s.

The results achieved show that using TM the time taken to process 400 000 event logs was reduced by more than 50% regarding to a single instance of the rule engine.

Following the above idea, it can be estimated that using 8 PCs to increase the distributed processing capacity, one PC per RE, around 800 000 logs could be processed in less than 6 s.

The experiments performed show that our proposal is scalable and has a high efficiency making feasible its application in scenarios where it is necessary to analyze large volumes of data in real time or very close to it. It is important to note, that in some scenarios, the real time is in dependence of the requirements or demand of the user or what is for him a real time analysis.

5 Conclusions

In this paper we have presented a framework called Task Manager for data streams distributed processing.

The study case showed the improvement, in terms of efficiency, of the RE application using our proposal for large-scale filtering of Windows event logs. As can be seen in the experiments performed, the proposed system is scalable, which implies that the computing capacity will be directly related to the number of processing units included. Other aspect to be highlighted is that heterogeneity is not required in the computational units, that is, both conventional PCs and blade clusters can participate. Another important advantages of this tool are its flexibility and its adaptability.

Our next step is to use a dedicated hardware for high performance tasks, which could exponentially increase the performance of the proposal, making it competitive with other distributed systems.