Keywords

1 Introduction

The analysis of huge volumes and fast arriving data is recently the focus of intense research, because such methods could build a competitive advantage of a given company. One of the useful approach is the data stream classification, which is employed to solve problems related to discovery client preference changes, spam filtering, fraud detection, and medical diagnosis to enumerate only a few. Basically, there are two classifier design approaches:

  • build and use”, which firstly focuses on the training a model and then the trained classifier makes a decision,

  • build, use, and improve”, which firstly tries to build a model quickly and then, it uses the trained classifier to make a decision and tries to tune the model parameters continuously.

The first approach is very traditional and may be used only under the assumptions that data used for training and being recognized comes from the same distribution and the number of training examples ensures that model is trained well (i.e., it is not undertrained). Of course, for many practical tasks such assumption could be accepted, nevertheless, many contemporary problems do not allow to approve them and should take into consideration that the statistical dependencies describing the classification task as prior probabilities and conditional probability density functions could change. Additionally, we should respect the fact that data may come very fast, what causes that it is impossible to label arriving examples manually by a human expert, but each object should be labeled by a classifier. The first problem is called concept drift [1] and the efficient methods, which are able to deal with it, are still the focus of intense research, because appearance of this phenomena may potentially cause a significant accuracy deterioration of an exploiting classifier. Basically, the following approaches may be considered to deal with the above problem:

  • detecting the drift and retraining the classifier,

  • rebuilding the model frequently, and

  • adopting the classification model to changes.

In this work, we focus on the data stream without the drift or where the changes are very slow, smooth, and not so significant. The proposed method will employ the model build, use, and improve, i.e., it could be used to classify incoming examples even in the case, if the classifier’s model still requires training. Thus, we concentrate on the family of online learner algorithms that continuously update the classifier parameters, while processing the incoming data. Not all types of classifiers can act as online learners. Basically, they have to meet some basic requirements [2]: (i) each object is processed only once in the course of training; (ii) the memory and processing time are limited, and (iii) the training process could be paused at any time, and its accuracy should not be lower than that of a classifier trained on batch of data collected up to the given time.

In this research we adopt the online learning to the data stream classification, but proposed model also takes into consideration additional information about incoming objects. We assume, that each decision is not independent, but also could depend on the previous classification. This situation is typical for medical diagnosis, especially in the case of chronic disease diagnosis, as hypertension diagnosis and therapy planing, or considered in this work, human acid-base state recognition. For such diagnosis tasks not only the recent observations about patient are taken into consideration while decision is made, but a doctor should also consider a patient history as previous ordered therapies and diagnoses.

There are several works related to the recognition with context. The first work was published by Raviv in 1967 [3], who employed the Markov chain model to the character recognition. The “classification with context” has been promoted especially by Toussaint [4] and Harallick [5]. It has been widely applied to the several domains as image processing or medical diagnosis [6]. The main contribution of this work is the proposition of the novel data stream classifier with context, which makes a decision on the basis of a recent observations about the incoming object and additionally takes into consideration the previous classifications. According to the restrictions of data stream analytics tools, it will use the limited memory and computational time.

2 Probabilistic Model of On-line Data Stream Classification

As we mentioned before, in many pattern classification tasks there exist dependencies among patterns to be classified, e.g., for character recognition (especially in the case if we have a prior knowledge, that the incoming characters form words from a given language, then the probability of character appearance depends of previous recognized sign), image classification, and chronic disease diagnosis (where historical data plays the crucial role for high quality medical assessment), to name only a few.

The formalization of classification task requires successively classified objects are related one to another, but also could take into considerations the occurrence of external factors changing the character of these relationships. Let’s illustrate this task using an example of medical diagnosis. The aim of this task is to classify the successive states of a patient. Each diagnosis could be the basis for the certain therapeutic procedure. Thus, we obtain the closed-loop system, in which the object under observation could be simultaneously subjected to control (treatment) dependent on the classification.

In this research we do not take into consideration the control, which could be recognized as the cause of the concept drift. As distinct from the traditional concept drift model, where drift appears randomly, in this case the drift has deterministic nature. In the future, we are going to extend the model by applying the methods related to so-called recurrent concept drift [7].

Let us present the mathematical model of the online classifier with context [4]. The online classification consists of the classification of a sequence of observations of the incoming objects. Let’s \(x_n \in \mathcal {X}\subseteq \mathbb {R}^d\) denotes the feature vector characterizing the nth object and \(j_n \in \mathcal {M}=\{1,...,M\}\) be its label. The probabilistic approach implies the assumption that \(x_n\) and \(j_n\) are the observed values of a pair of random variables \(\mathbf {X}_n\) and \(\mathbf {J}_n\). Let’s model the interdependence between successive classifications as the first-order time-homogeneous Markov chain. Its probabilistic characteristics, i.e., the transition probabilities and initial probabilities are given by the following formulas, which describe the probability that a given object belongs to class \(j_n\) if the previous decision was \(j_{n-1}\).

$$\begin{aligned} p_{j_n,j_{n-1}}&=P(\mathbf {J}_n=j_n|\mathbf {J}_{n-1}=j_{n-1},\mathbf {J}_{n-2}=j_{n-2},...,\mathbf {J}_1=j_1) \nonumber \\&\qquad \qquad \qquad =P(\mathbf {J}_n=j_n|\mathbf {J}_n=j_n) \;\;\;\; j_n,j_{n-1},j_{n-2},...,j_1 \in \mathcal {M} \end{aligned}$$
(1)

the transition probabilities form the transition matrix

$$\begin{aligned} \mathbf {P}= \begin{bmatrix} p_{1,1}&p_{1,2}&\cdots&p_{1,M} \\ p_{2,1}&p_{2,2}&\cdots&p_{2,M} \\ \vdots&\ddots&\vdots \\ p_{M,1}&p_{M,2}&\cdots&p_{M,M} \\ \end{bmatrix} \end{aligned}$$
(2)

and initial probabilities

$$\begin{aligned} p_{j_1}=P(\mathbf {J}_1=j_1), \;\; j_1 \in \mathcal {M} \end{aligned}$$
(3)

form the vector of initial probabilities

$$\begin{aligned} \mathbf {p}= \begin{bmatrix} p_1&p_2&\cdots&p_M \\ \end{bmatrix}^T \end{aligned}$$
(4)

We also assume that the probability distributions of the random variables \(\mathbf {X}_n\) and \(\mathbf {J}_n\) exist and they are characterized by the conditional probability density functions (CPDFs)

$$\begin{aligned} f_i(x_n), \;\; i \in \mathcal {M}, \; x_n \in \mathcal {X} \end{aligned}$$
(5)

Additionally, the probability density functions \(f(\overline{x}_n|\overline{j}_n)\) exists and the observed stream (sequence of observations) is conditionally independent, i.e.,

$$\begin{aligned} f_n(\overline{x}_n|\overline{j}_n)=\prod _{k=1}^n f(x_k|j_k)=\prod _{k=1}^n f_{j_k}(x_k) \end{aligned}$$
(6)

where \(\overline{x}_n=(x_1,x_2,...,x_n)\) and \(\overline{j}_n=(j_1,j_2,...,j_n)\).

The optimal classifier \(\varPsi ^*\) makes a decision using the following formulaFootnote 1

$$\begin{aligned} \varPsi ^*(\overline{x}_n)=\mathop {\arg \max }\limits _{k \in \mathcal {M}} p(k|\overline{x}_n) \end{aligned}$$
(7)

where the posterior probability

$$\begin{aligned} p(j_n|\overline{x}_n)=\frac{p_{j_n}f_{j_n}(\overline{x}_n)}{\mathop {\sum \limits _{k=1}^M}p_{k}f_{j_n}(\overline{x}_n)} \end{aligned}$$
(8)

could be calculated recursively

$$\begin{aligned} p_{j_n}f_{j_n}(\overline{x}_n)=f_{j_n}(x_n)\mathop {\sum \limits _{j_{n-1}=1}^M}p_{j_n,j_{n-1}}p_{j_{n-1}}f(\overline{x}_{n-1}|j_{n-1}) \end{aligned}$$
(9)

with the initial condition

$$\begin{aligned} p(j_1) f(x_1|j_1) = p_{j_1} f_{j_1}(x_1) \end{aligned}$$
(10)

3 Proposed Algorithm

Let’s notice that the posterior probabilities using as the support functions could be calculated recurrently, but for this calculations a knowledge about the CPDFs is required. We propose the hybrid approach based mostly on non-parametric estimation, which could be used in the chosen build, use and improve model, but in the case of new examples come, it requires to store them, what may break the assumption about the memory limit. Because we have a limited memory at our disposal, we store only a fixed number of the training examples using for CPDF estimation and we update probability characteristics (transition and initial probabilities) continuously. Thus, the assumption about memory limitation is fulfilled. To ensure that the classification model is updated, we add the new incoming examples into training set only in the case, if their labels are available. It is consistent with the assumption that expert is not always available. In the future we can control which of the object should be labeled, e.g., using active learning paradigm [9] as proposed in the previous works of the author [10], where the decision about the labeling of an example is made on the basis of its distance from a classifier decision boundary. In the case, if new example is stored, the oldest example in the dataset is removed to ensure its fixed size of training set. Such a procedure allows us to continuously updating the training and keeping in the memory the examples which are most relevant to the recent concept. The detailed description of the method is presented in Algorithm 1. To calculate support functions we use the procedure inspired by the \(k_n-\)Nearest Neighbor as the CPDFs estimator [11], which pseudocode is presented in Algorithm 2. Such a method is not so computationally efficient, but for the fixed size of the training set its processing time is predicable, what fulfills the condition related to the processing time limitation.

figure a
figure b

4 Experiment

In order to study the performance of proposed algorithm and evaluate its usefulness to the computer-aided medical diagnosis, we carried out the computer experiments on real dataset included observation of the patients suffering from acid-base disorder. The human acid-base state (ABS) diagnosis task is related to disorders in produce and elimination of \(H^+\) and \(CO_2\) by organism. During the course of treatment, recognition of ABS is very important because pH stability of physiological fluids is required. The ABS disorder has a dynamic character and the current patient’s state depends on symptoms, previously applied treatment and classification. The diagnosis of ABS as a pattern recognition task includes the five classes: breathing acidosis, metabolic acidosis, breathing alkalinity, metabolic alkalinity, normal state. The features contains the value of the gasometric examination of blood as \(pCO_2\), concentration of \(HCO_3^+\), pH and applied treatment (respiration treatment, pharmacological treatment, or no treatment). The dataset consists of the sequences of 78 patient observations (each sequence includes from 7 to 25 successive records). The whole dataset includes 1170 observations which came from the Medical Academy of Wroclaw. Results of experimental investigation are presented in Fig. 1.

Fig. 1.
figure 1

Dependencies between the length of the data stream and accuracy for online classifier which does not take into consideration context (denoted as OLC-WC), and online context-based classifier with initial training set size 50 and 70 (denoted OLC-w50 and OLC-w70 respectively).

The accuracy was evaluate not on the basis of a validation set, but we use the schema called “test-and-train” [12]. Basically, firstly the fixed number of the examples are used to build the first prototype of the classifier. Then each incoming object n is used to improve the model, which is evaluated on the next incoming example \(n+1\).

As we can see, the methods which take into consideration context outperform the classifier which recognize objects independently. We can also observe, the methods have also the asymptotic characteristic. The accuracy of the proposed method a quite similar as the performance of the methods previously developer and applied to the problem of the ABS diagnosis [13], but we have to underline, that the proposed approach does not requires so much memory, because only part of the data are stored. Of course we realize that the scope of the experiments was limited, but the preliminary results of the experiments encourage us to continue the work on the proposed methods in the future.

5 Final Remarks

The novel method of online classifier with context was presented in this work. Its performance seems to be very promising, therefore we would like to continue the work on the presented method. In the future we would like to include the method which will be able to take into consideration the model parameter change caused by the applied control. We are going to extend the model by applying the methods related to so-called recurrent concept drift. Additionally, we consider to retrain classifier using the batch mode, i.e., the model will be improved not object by object, by on the basis of collected data chunk, what could decrease the computational load of the algorithm. Finally, we are going to apply the classifier ensemble approach to the proposed method, e.g., by training new individual classifier on the basis of each incoming data chunk or to improve the models of the selected individuals in the classifier pool only.