Keywords

1 Introduction

In recent years, the techniques of data storage and processing (e.g. Big Data [1], and Cloud Computing [2] ) have become more appropriate for sensor streaming; hence, the amount of the data accumulates quick and dynamically. In this situation, determining how to precisely make data prediction and evaluation by taking historical data as reference in data prediction problems in this dynamic, uncertain and complicated environment is a huge challenge. In addition, the existing prediction models mainly focus on processing static data; they are not appropriated for dynamic environments and cannot provide feedback regarding real time information into the model. Moreover, there are many sensor distributed in the sensor environment and each sensor stream may have high correlations [3], i.e. relationships with each other. The relationships of these sensor streams may change over time; if a data streams is highly dependent on another data stream; with time change the correlation might have change to another stream. We can recognized this situation as a Concept Drift-like [4] phenomenon; these kinds of correlation changes will raise the error rate of the model, making prediction results unusable.

Hence, in this study we present a new solution which not only can dynamically detect and react to the concept drift between data streams with time changes, but also be appropriate in a real-time dynamic environment. Due to the problem discussed above, the objective is to propose an improved solution that can dynamically analyze concept drift and precisely predict results in a dynamic environment. A dynamic data driven application systems (DDDAS) [5] is a new concept which can use real time online or archival data dynamically integrated within other engineer models and instantly give feedback into a running model. The DDDAS concept offers a concept of dynamic data processing and feedback architecture, as well as enhancing the efficiency and effectiveness of data processing and model architecture. However, determining how to detect the concept drift between each sensor stream is a challenging topic in a dynamic environment. If we cannot solve this issue, the model will raise the error rate as time passes. The dynamic weight majority (DWM) [6] approach may help to solve the problem of concept drift. The DWM algorithm presents an ensemble method with a dynamic weighted voting mechanism. So that we can use multiple combinations of data streams and real-time prediction approaches to realize which data stream’s prediction has the highest weight in each time step. This also means that it has the highest correlation.

In the next section, we describe the concept of dynamic data driven application systems, dynamic weighted majority, and concept drift. In Sect. 3, the modified dynamic weighted majority algorithm based on distributed dynamic data is proposed. In Sect. 4, we show the experimental designs, evaluation, and results. Finally, the conclusions are presented discussed in Sect. 5.

2 Related Works

2.1 Dynamic Data Driven Application System

A dynamic data driven application system (DDDAS) is a real time feedback control system. It is a new paradigm that was proposed by National Science Foundation (NSF) in order to solve the problem in traditional simulations, predictions and measurements. DDDAS provides a model with a more reliable outcome, stable data process and accurate prediction or analysis results. It also allows model dynamically receiving and responding [7]. However, the reason why NSF proposed the DDDAS concept is based on the two main topics: One, in January, 24 ~ 25, 2000 a hurricane struck a major city New Orleans in America, unfortunately the relevant authority could not made the prediction precisely and in real-time. It caused a huge disaster. Second, scientists couldn’t simulate the real wildfire which broke out beside the Los Alamos national lab. After the disaster, researchers found that most of the existing models were unable to capture the instantaneous reactions in a real environment, and that most of those model parameters were fixed and unchangeable. So, these model would cause unexpected errors and be unfit for handling on dynamic changing situations [7].

The DDDAS concept describes the dynamic capability of system processing and control. DDDAS shows real-time data coordination in a runtime system; hence, DDDAS not only provides more in-time statistics, but offer the feedback mechanism for dynamic model enhancing and experiment improving [811]. In recent years, thanks to the improvement in computing (e.g. Cloud Computing, Grid computing) and experiment (e.g. data storage technique), such enhancement speeds up the promotion of DDDAS architecture. The DDDAS architecture mainly includes: the user controller, dynamic visualization interface, dynamic computation and real-time dynamic data gathering modules [12]. These components integrate automatic feedback, measurement, simulation and a control mechanism and work in a dynamic way. Users could use the DDDAS concept model via the dynamic visualization module to interact with other components; real-time dynamic data gathering of responses by modules, and in time data gathering from measurement instruments (e.g. sensors, database); and dynamic computation modules dealing with mathematical models & prediction computation. Due to its architecture, DDDAS provides efficiency and stable ways to handle the real- time situation in real world.

In the past, when facing weather, agriculture and contaminant tracking, we used historical data as a prediction system’s input. However there is a problem; if a model just relies on historical data, it cannot reflect the real situation or provide a real-time feedback mechanism. DDDAS proposed a real-time feedback mechanism to transfer data for a computing model, thereby enhancing the accuracy of the model. In regard to environmental science and agriculture (e.g. Greenhouse Gas emission, River Pollution monitoring, etc.) DDDAS offers the adjusting and changing of parameters; this feature makes the model more scalable [13]. Moreover, it can be used in hurricanes [14], rather than just using numerical data, also DDDAS can also use graphics and sensor data for real-time computing to raise the prediction success rate [15]. Frederica Darema who proposed the DDDAS concept, pointed out that the vision of the DDDAS encompasses more than real-time control. However, there are some challenges requiring future work; one is the uncertainty of computing, which will cause the prediction error. To deal with this problem, finding out how to establish the data correlation is a key to success; in this study we will try to solve the correlation problem and raise the efficiency.

2.2 Dynamic Weighted Majority (DWM)

The Weighted Majority Algorithm (WMA) is part of machine learning. WMA present a pool of prediction algorithms (e.g. a group of classifiers, a group of same or different approaches) without any prior knowledge. Unlike common prediction methods, the WMA makes decisions by group voting; the result is that because of its wider prediction approach to decisions, it makes fewer mistakes compare to a single prediction approach. The process of the algorithm is presented as follows in each trial: Same instance feeding to each prediction algorithm of the pool. Then each prediction is made and the WMA algorithm groups the results. The WMA will make a final prediction by selected most of the results. By running this algorithm, we can not only present a majority voting algorithm but also can know which prediction method best fits in this situation. The Dynamic Weighted Majority (DWM) is based on the Weighted Majority Algorithm (WMA). As discussed above, the WMA presents weighted voting based on the ensemble method. It combines a group of prediction approaches and takes each approach as an ‘expert’ with its own weight; it then compares the result to the single prediction approach, The superiority of this algorithm is that its use of group decision; with this mechanism, it provides more stable and accurate output. As discussed above, based on the foundation of WMA, DWM extends the advantage of WMA, and makes a dynamic change. DWM added a threshold (Ɵ) in the whole algorithm, and its weight is reduced by the multiplicative constant β (β is from 0 to 1).

2.3 Concept Drift

Concept Drift is a phenomenon that occurs a prediction model makes a prediction, but as the time passes, the characteristic or correlation of the data has been changed in unforeseen ways, causing the predictions to become less accurate as time passes [4]. The usually domains in which such changing concepts often happen are customers’ preference and weather prediction. Here, we take weather rainfall prediction as an example. If the model has a good performance in the summer, when the season changes to autumn the parameters of the model will be unable to detect the change. By using the old parameters to predict the autumn rainfall, the model will often make wrong predictions. The types of concept drift can be distinguished either sudden, or gradual. In this study, we focus on the sudden concept drift and apply it in the experiments. The problem of concept drift has become a popular issue in data mining and weather domain and also in the dynamic environment it will happen more often.

3 A Dynamic Data Driven Application System-Based Dynamic Weighted Majority Algorithm

In a dynamic environment the running model will receive data from sensors or the database dynamically, but the traditional prediction algorithm currently lacks the capability of dealing with real time data. As discussed before, the data in the dynamic environment usually have highly correlation between data sets; it means that at each time step the relationship between data sets will changing. If we use some data sets as support predicting the main target data set, with time passing through the relationship between the target data set and the support data sets had changed and algorithm didn’t detect it will cause the increase of prediction error rate. This situation also recognized as a concept drift problem. In the next section we proposed a dynamical algorithm to fit the environment and solve the concept drift problem, as well as determine the correlations between the target data set and each support data set.

3.1 Dynamic Data Driven Application System-Based Dynamic Weighted Majority Model

This study proposes a dynamic weighted majority prediction algorithm based on DDDAS, which comprises three parts. The modified DWM algorithm process will be first introduced. It provides a dynamic framework to control the whole system. Second, a dynamic weight voting, adjusting, and real-time feedback mechanism is proposed to choose the best support data stream at each time. The third part is the dynamic data stream prediction component. It provides a dynamic way to import the historical or real-time data stream nodes into model. For each sensor node, there has two prediction approaches. The data stream node provides dynamic data form the real world for the dynamic data driven approach, and then the prediction approach computes the combination of the historical and real-time data stream.

3.2 Modified Dynamic Weighted Majority Algorithm

The dynamic weighted majority provides a weighted based selection and feedback mechanism. In this section, we present a modified DWM for fitting the assumptions and dynamic environment. The proposed algorithm describes the whole processes of the algorithm. The modified DWM takes every data stream and approach as a combination and sees this combination as an ‘expert’ for making its own predictions. There are four main steps in the modified DWM procedure.

  • Step1: At the initial stage, the algorithm will average all of the experts’ weights and make the prediction.

  • Step2: The algorithm will gather each expert’s result and decide on the global result as the answer according to the highest weight expert.

  • Step3: The modified DWM will make a comparison to the next time’s real result. But if it is the first round of the algorithm, because the weight of each expert is equal, the algorithm will compare each expert‘s result to the real time result.

  • Step4: Lastly, after comparing the results, if the result is close to real data, then the algorithm increase the expert’s weight by formula (1); conversely, if the result is a mistake, then the algorithm decrease the expert’s weight by formula (2). Here, \( {\text{Weight}}_{t} \) represents the weight of the final result’s expert’s weight and α represents the speed of raising the weight. Also, β represents the speed of the decreasing weight.

$$ {\text{Weight}}_{t + 1} = {\text{Weight}}_{t} + {\text{Weight}}_{t} *\upalpha $$
(1)
$$ {\text{Weight}}_{t + 1} = {\text{Weight}}_{t} - {\text{Weight}}_{t} *\upbeta $$
(2)

3.3 Threshold and Feedback Mechanism

By the explanation of the DWM in this section, the model we proposed also provides a feedback mechanism for suiting the dynamic environment. In the real-time situation there will be numerous correlations between data streams; in order to detect the phenomena, we take each stream as an individual expert and gave it a weight. Figure 1 shows the feedback process; θ represents a threshold: if an expert’s weight is below this value, we consider it useless in this time period and make it sleep for a period; after the sleep period, it will return to the prediction pool with an average weight. Through the adjusting of the weight, the model can realize the changing of the weight in the previous loop of the model and provide feedback to the model. In this way, the model will know the correlation between the chosen data and the target data.

Fig. 1.
figure 1

Feedback process

3.4 Dynamic Data Stream Prediction Component

As Fig. 2 shows, each data stream can be seen as an independent part, and shows each part combining the input data stream with two approaches: the autoregressive approach and neural network approach. Then each date stream will face two methods: making its own local prediction and determining the weights. The component acts as a dynamic data driven way, which means if the model needs one of the data streams and prediction approach, the model will trigger the specific one by its weight and make the prediction. Through the data driven approach, the model will leverage the computed sources and times. In this study, we take each stream as a location; each location will have its own approach to make a prediction and have its own weight. If we have three locations, there will be three components like the object we described above. One will be set as the target; the other will be set as the support component.

Fig. 2.
figure 2

Dynamic data stream prediction component

3.5 Summary

The advantage of the proposed method lies in the capability of dynamic driven specific data to enter into the running model, and its ability to recognize the correlations between the support data sets and target data set. A traditional prediction model only provides static data processing; if we put it in a dynamic environment, critical errors might happen in the runtime. Moreover, with time passing, the data relationship might have changed; if a prediction model only relies on specific data for supporting the target data prediction, it will increase the error rate of the model. It is necessary to find the link between data in each timestamp in a dynamic environment. In this study, we proposed a novel dynamic weight majority model based on a dynamic data driven application system to realize the data correlation and provide a solution in a dynamic environment. The proposed model not only provides the capability of real-time data processing, but also detects the data relationship dynamically.

4 Simulation Analysis

The proposed model is implemented in a real-time and dynamic environment, which fits a time flow concept. In the environment, at each time, new data will come through and the model will make the predictions with this new data, rather than just by historical data. In addition, we assume that the concept drift phenomenon could occur at each timestamp. On the other words, the relationship concept between the support data sets and the target data sets could change at each timeslot.

4.1 The Simulation Data Sets

A group of data sets are generated according to specific rules. The data sets are designed to eliminate the uncertainties and testing if the model works or not. We generate a rain data set to validate the model. The basic rule was introduced by MG Lawrence [23]. It provides a simple formula (3) to generate the relative humidity. T represents temperate; \( {\text{T}}_{\text{d}} \) represents dew point. The dew point is the water vapor in the air; we can know the probability of rainfall by judging this benchmark. We then use this formula to create a novel formula (4) to simulate the rainfall per hour. Here X means the timestamp increased in an arithmetic progression. C signifies clouds in the sky and W represents a random seed, to make the data irregular.

$$ {\text{RH }} = \, 100 \, - \, 5 \, ({\text{T}} - T_{d} ) $$
(3)
$$ {\text{Rainfall}}\left( {\text{R}} \right) \, = 100 - 5*({\text{T}} - T_{d} )*{\text{ X}} + {\text{C}} + {\text{W}} $$
(4)

Based on the rule, we generate a data stream with 250 timestamps as the target data stream. Then other support data sets are created to have a correlation between target data sets in specific time periods. In these generated data sets, we presented four correlation transfer sections between 40 ~ 60, 90 ~ 110, 140 ~ 160 and 190 ~ 210 time periods. Here we can see generated target data which have suddenly value changes in the periods mentioned in this paragraph. Afterwards, we generate the other four data sets by also following formula (4)’s rule; moreover we create the time shift correlations between the target data stream and the four data streams. For example, in Fig. 3, the support stream1 we designed has a strong highly relationship with target data set in 40 ~ 60 time period. Therefore, after 30 time stamps, the highly relation shift from support stream1 to support stream 2 are in 90 ~ 110 time period. In these simulation data sets, the last two relationship changes will happened in the 30 time stamp periods.

Fig. 3.
figure 3

Support data set 1 and target data set

4.2 Simulation Results

There are two metrics for evaluating the performance of prediction. The descriptions are as follows:

$$ {\text{Prediction Correct Rates }}\left( {\text{CR}} \right) \, :\frac{\text{Total weight correct times}}{\text{Toatal Prediction Times}} $$
$$ {\text{Total Concept Drift detect Rates }}\left( {\text{CDR}} \right) \, :\frac{\text{Total Relationship change Detected }}{\text{Toatal Realtionship Change}} $$

In this case, we set the parameters: α = 0.01 and β = 0.1. As shown in Fig. 4, we can see that the target data set has a relationship with support streams 2-4, consecutively Fig. 4 shows the weight changing graphic. The result shows that, because of the relationship changing by time, the support stream 1’s weight is raised first for it is generated by algorithm and makes no error. So the weight of the support data set is raised the same as the other support data sets. As previously mentioned, we have four relation changes, and the experiment shows the changing weights of four support streams. This means that the algorithm switches the support set to support the target data set’s prediction and find the concept shift in each time period. It takes 25 mistakes out of 220 h for making predictions. As a result, the CR hits 89 % and the concept drift has been detected; the CDR comes in at 100 %.

Fig. 4.
figure 4

Weight of each support data set

5 Conclusion

In this study, we proposed a Distributed Dynamic Data Driven Application System based on a Dynamic Weighted Majority using the dynamic data sets to construct the correlation and find the concept drift. The algorithm is based on DDDAS; it provides an algorithm with the capability of facing a dynamic environment and the ability to drive the specific data at each time to reduce the computation. Then DWM algorithm presents a dynamic voting way to give each data set a weight to dynamically adjust, in order to find the correlation between data sets and to support the prediction of target data sets. Moreover, this study presents the simulation experiments. The simulation presented that one target data set and four support data sets for supporting data prediction. Three concept drifts are designed between the target data set and the support data sets. This study proposed two metrics to measure the algorithm Prediction Correct Rates (CR) and Total Concept Drift detect Rates (CDR). The result shows that CR = 89 %, and it detected all of the concept drifts. It shows the capability of dynamic data handling and concept drift detecting of the algorithm, and provides a solution for dynamic environment prediction and concept drift detection.