A Framework for Distributed Data Processing

Febrer-Hernández, José Kadir; Herrera Semenets, Vitali

doi:10.1007/978-3-030-33904-3_53

José Kadir Febrer-Hernández¹¹ &
Vitali Herrera Semenets¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11896))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

1817 Accesses

Abstract

Nowadays, the data generated in the telecommunications networks tend to grow exponentially leading to a Big Data challenges, which makes it necessary to discover different ways to safely process this data. The reported strategies aim to provide reliable and flexible services for asynchronous data exchange. The parallel and distributed processing of large volumes of data plays a fundamental role in scenarios that require a response as soon as possible, such as detecting fraud in telecommunications services or carrying out security controls. In this paper, we present a strategy that allows to distribute data and manage several instances of the same application, which are executed in a distributed way. An aspect to be highlighted is that heterogeneity is not required in the computational units, that is, both conventional PCs and blade clusters can participate. Another important advantages of this tool are its flexibility and its adaptability. The data are distributed depending on the workload of the different application instances. Finally, a case study is presented for the distributed processing of the Windows Operating System logs.

You have full access to this open access chapter, Download conference paper PDF

A Study of Various Varieties of Distributed Data Mining Architectures

Distributed Computing Patterns Useful in Big Data Analytics

Novel Approaches for Distributing Workload on Commodity Computer Systems

Keywords

1 Introduction

The technological advances have led the telecommunications industry to generate huge amounts of data known as Big Data [11]. For example, the AT & T Company reported that an average of 300 million international calls is performed in one normal day using its services [3]. The improper use of these services to perform malicious activities such as fraud in telecommunication services, network intrusion, among others, makes it necessary to analyze the data in real time [12].

A system that performs this type of processing can prevent or at least reduce any possible damage caused by the execution of malicious activities. Regarding a system which allows processing such data, there is no reference model that has been universally accepted, as it happens with the technologies in its first phases of life. However, there are industrial and academic proposals that try to define what are the components that make up a Big Data system, including its relationships and properties [13].

In the last years, the distributed systems have been one of the most used proposals for the efficient analysis of large volumes of data [7]. In this paper, we present a framework that allows the distributed processing of data streams in real time. Our proposal is designed to be able to support any application for data processing, which wants to be executed in a distributed way. The experimental results show how much an application can improve its efficiency using our proposal.

The rest of the paper is organized as follows. The next section describes the related work. Our proposal is presented in Sect. 3. The study case is exposed in Sect. 4. Finally, the conclusions and future works are given in Sect. 5.

2 Related Work

Generally, the distributed systems have a strategy to provide communication among different system modules facilitating the data exchange [14]. Some of these strategies aim to provide reliable and flexible services for asynchronous data exchange.

These include the Java Message System (JMS) [4], Internet Communication Engine (ICE) [6], Common Object Request Broker Architecture (CORBA) [8], or Data Distribution System (DDS) [9]. JMS relies on the Java platform and allows distributed communication using a common API for the development of Java applications based on messages. ICE, CORBA, and DDS use an Object Request Agent (ORB) which consists of an object exchange interface. However, the most advanced of these is DDS that was developed to standardizing the data distribution in different platforms.

There are other more specific strategies, which are basically oriented to real-time processing. For example, the Open RObot COntrol Software (OROCOS) platform [2] for application control provides tools for data-exchange and event-based services. More recent strategies are iLAND [10] which consists in a data distribution service for real-time applications that incorporates a reconfiguration logic.

On the other hand, DREQUIEMI [10] is based on Java Language and the experiences obtained by iLAND. There is also an architecture presented in [1], which is based on Java Language and designed for real-time analysis.

Several of the above strategies are used to integrate distributed enterprise applications [5]. However, none of them is generic enough to assimilate any application that is required to process data streams in real time or very close to it.

There are some frameworks that have been proposed to deal with Big Data such as: Apache Hadoop, Apache Storm, Apache Spark and Apache Flink, which are briefly described as follows. Hadoop [15] is an open source software for reliable, scalable, distributed computing. It was the first one to enable big data processing and uses batch processing.

Spark [16] combines a distributed computing system through computer clusters. It is an open source framework with a very active community; it is a fast tool; it has a comfortable interactive console for developers; and it also has an API to work with big data. Despite this, Spark has some limitations such as: no support for real time processing, it simulates the streaming through micro-batch and each iteration is scheduled and executed separately; it requires lots of memory to run and problems arise when working with small files.

Storm [17] is a distributed framework for real-time data processing. Built to be scalable, extensible, efficient, easy to administer and fault-tolerant.

Flink [18] is a distributed framework to process data efficiently and for general use, just like Spark. Unlike Flink, Spark cannot handle a data set larger than the memory it has available. In addition, Spark works in batches of data unlike Flink which can simulate the data stream.

The previous proposals are very efficient and widely used today. However, in order to use these frameworks, many computational resources are needed, or a large amount of time has to be devoted to creating an entire infrastructure for their use, without taking into account the required programming time. On the other hand, the proposal presented in this article tries to give a quick and efficient solution to a big data problem without using large computing resources.

3 Our Proposal

This section presents a framework that allows to distribute data and manage several instances of the same application running in parallel. As part of the strategy presented in this paper, a rule engine is used as the application to be managed.

The rule engine and the framework are described below.

3.1 Rule Engine

A rule engine (RE) usually evaluates rules expressed with the notation “if X then Y” (\(X \Rightarrow Y\)), where X is an interest conditions set and Y is the action to take when X is fulfilled. The Fig. 1 shows the basic model of an RE for the processing of data stream.

The rule base shown in the Fig. 1 provides persistent storage for a set of rules. When the system begins to process a data stream, immediately the existing rules are evaluated and in the case that the conditions of some rule are fulfilled, the corresponding action “triggers”.

3.2 Proposed Framework

Our proposal consists of a framework that from now on we will refer as Task Manager (TM). Its main function is to manage tasks that must be processed in a distributed way by several instances of the same application. The Algorithm 1 shows the general process followed by our Task Manager.

The TM initially reads the configuration file, where several parameters are defined by the analyst. Note that the TM runs as a daemon in the operating system, so it is running continuously until it is stopped by the analyst. The next step is to search in the data directory if there is any dataset to be processed. If it does not exist, the TM waits for a time (defined by the analyst in the configuration file) and re-check the directory. In case there is exist a dataset, the TM conforms the task. When multiple datasets exist, the TM selects the oldest (line 3).

After the task is generated, the TM checks if any of the application instances running are available to process a new task. An application is available when you are not processing a task. The TM instances can run on different computers and the location of each one is defined by the analyst in a configuration file. In case there is no available instance of the task manager, the TM waits for a time defined by the analyst and re-check for a task. If an instance of the application is available, the TM allocates the selected task to that instance and registers it (line 5).

When the TM assigns a task, it begins to check its status. To do this, it checks if the application instance to which you assigned a task has ended. An application ends when it is registered as being assigned a task and it is not running. If the task continues to run, the TM waits for the specified time and rechecks if any of the registered instances are finished (lines 6–8).

In the Fig. 2, we show a scheme where we use a rule engine as the application to be distributed. In this case, a task consist of a dataset to be processed and a set of rules to be evaluated. As can be seen, there are two directories from which the Task Manager is fed. If there is exist a dataset, the TM searches in the rules directory for those rules that share the same identifier that the selected dataset and conforms the task. When a task is generated, the TM checks if any of the application instances running are available to process a new task.

In this case, if the application instance ended, the TM searches in the alerts database for those that correspond to the identifier of the ended task. If there are alerts, they are reported. Otherwise, it is reported that no alerts were issued for that task.

4 Study Case

As a distributed processing platform, the application (in this case the rule engine) can be located in a cluster with an unrestricted number of processing nodes as well as on a single computer. The experiment included five personal computers (PCs) equipped with a Quad-Core processor and 4 GB of RAM. In one of them, the TM was installed, and in the remaining four the RE was installed.

A test scenario was created in a laboratory with 20 PCs with the Windows 8 operating system installed. In order to show the improvement in terms of efficiency that the proposal can provide. The experiment consists in analyze the Windows event logs generated by each PC (see Fig. 3). For this, an agent was used to capture each new log generated and create a dataset D to send it to the TM data directory. This agent was installed in each of the 20 PCs used in the experiment. The analyzed logs are associated with Windows application, system, and security events.

When the agent is executed for first time, it sends an initial dataset with all the event logs stored by the system and then waits for new logs to be sent to the TM data directory. Taking this into account, it was defined that the size of each dataset to be sent does not exceed a threshold defined by the analyst (in this case 20 000). In this way, the first dataset represents a data flow around 20,000 event logs by PC. Running all the agents the same time once generates 20 datasets to be processed. The time to process each dataset is measured since the dataset is copied into the directory until the TM reports the alerts associated to the dataset processed.

Five rules were designed for the experiments, which evaluate regular expressions on the different fields of the event logs. The rules were created intentionally to match with some test logs in order to validate the operation of the system.

In order to evaluate the advantage of the proposed task manager, two experiments were designed. In the first, an instance of the rule engine was executed, without the TM, and in the second, 4 instances of the distributed rule engine were executed on different PC using the TM.

Figure 4 shows a comparison between the time taken by a rule engine instance, without the TM (experiment 1), to process the generated event logs, regarding to the time taken by the TM with 4 instances of the RE (Experiment 2).

In experiment 1, the time taken to evaluate the five rules created to process the first four sets of generated data (80 000 event log) was 2.5 s. In experiment 2, the generated datasets were analyzed in parallel and the time taken to process the same four datasets was 1.2 s.

The results achieved show that using TM the time taken to process 400 000 event logs was reduced by more than 50% regarding to a single instance of the rule engine.

Following the above idea, it can be estimated that using 8 PCs to increase the distributed processing capacity, one PC per RE, around 800 000 logs could be processed in less than 6 s.

The experiments performed show that our proposal is scalable and has a high efficiency making feasible its application in scenarios where it is necessary to analyze large volumes of data in real time or very close to it. It is important to note, that in some scenarios, the real time is in dependence of the requirements or demand of the user or what is for him a real time analysis.

5 Conclusions

In this paper we have presented a framework called Task Manager for data streams distributed processing.

The study case showed the improvement, in terms of efficiency, of the RE application using our proposal for large-scale filtering of Windows event logs. As can be seen in the experiments performed, the proposed system is scalable, which implies that the computing capacity will be directly related to the number of processing units included. Other aspect to be highlighted is that heterogeneity is not required in the computational units, that is, both conventional PCs and blade clusters can participate. Another important advantages of this tool are its flexibility and its adaptability.

Our next step is to use a dedicated hardware for high performance tasks, which could exponentially increase the performance of the proposal, making it competitive with other distributed systems.

References

Basanta-Val, P., García-Valls, M.: A distributed real-time Java-centric architecture for industrial systems. IEEE Trans. Industr. Inf. 10(1), 27–34 (2014)
Article Google Scholar
Bruyninckx, H.: The real-time motion control core of the Orocos project. In: International Conference on Robotics and Automation (ICRA 2003), vol. 2, pp. 2766–2771 (2003)
Google Scholar
Cortes, C., Pregibon, D.: Signature-based methods for data streams. Data Min. Knowl. Disc. 5(3), 167–182 (2001)
Article Google Scholar
Hapner, M., Burridge, R.: Java message service. Technical report, Sun Microsystems Inc. (2002)
Google Scholar
He, W., Da Xu, L.: Integration of distributed enterprise applications: a survey. IEEE Trans. Industr. Inf. 10(1), 35–42 (2014)
Article MathSciNet Google Scholar
Henning, M., Spruiell, M.: Distributed programming with ice. Technical report, ZeroC Inc. (2003)
Google Scholar
Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. J. Parallel Distrib. Comput. 74(7), 2561–2573 (2014)
Article Google Scholar
OMG: The common object request broker (CORBA): architecture and specification. Technical report, Object Management Group (OMG) (1995)
Google Scholar
OMG: Data distribution service for real-time systems. Version 1 edn. Object Management Group (OMG) (2007)
Google Scholar
Valls, M.G., Val, P.B.: Comparative analysis of two different middleware approaches for reconfiguration of distributed real-time systems. J. Syst. Architect. 60(2), 221–233 (2014)
Article Google Scholar
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
Article Google Scholar
Habeeb, R.A.A., Nasaruddin, F., Gani, A., Hashem, I.A.T., Ahmed, E., Imran, M.: Real-time big data processing for anomaly detection: a survey. Int. J. Inf. Manag. 45, 289–307 (2018)
Article Google Scholar
Gupta, S., Kar, A.K., Baabdullah, A., Al-Khowaiter, W.A.: Big data with cognitive computing: a review for the future. Int. J. Inf. Manag. 42, 78–89 (2018)
Article Google Scholar
Teixeira, F.A., Pereira, F.M., Wong, H.C., Nogueira, J.M., Oliveira, L.B.: SIoT: Securing Internet of Things through distributed systems analysis. Future Gener. Comput. Syst. 92, 1172–1186 (2019)
Article Google Scholar
Malik, M., et al.: Big vs little core for energy-efficient Hadoop computing. J. Parallel Distrib. Comput. 129, 110–124 (2019)
Article Google Scholar
Meng, X., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
MathSciNet MATH Google Scholar
Iqbal, M.H., Soomro, T.R.: Big data analysis: apache storm perspective. Int. J. Comput. Trends Technol. 19(1), 9–14 (2015)
Article Google Scholar
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache Flink: stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Committee Data Eng. 36(4) (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Advanced Technologies Application Center (CENATAV), 7a ♯ 21406, Rpto. Siboney, Playa, C.P. 12200, Havana, Cuba
José Kadir Febrer-Hernández & Vitali Herrera Semenets

Authors

José Kadir Febrer-Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Vitali Herrera Semenets
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to José Kadir Febrer-Hernández or Vitali Herrera Semenets .

Editor information

Editors and Affiliations

Uppsala University, Uppsala, Sweden
Ingela Nyström
University of Information Science, Havana, Cuba
Yanio Hernández Heredia
University of Information Science, Havana, Cuba
Vladimir Milián Núñez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Febrer-Hernández, J.K., Herrera Semenets, V. (2019). A Framework for Distributed Data Processing. In: Nyström, I., Hernández Heredia, Y., Milián Núñez, V. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2019. Lecture Notes in Computer Science(), vol 11896. Springer, Cham. https://doi.org/10.1007/978-3-030-33904-3_53

Download citation

DOI: https://doi.org/10.1007/978-3-030-33904-3_53
Published: 22 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33903-6
Online ISBN: 978-3-030-33904-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)