1 Introduction

Time series play a pivotal role in analysis, description, and classification of phenomena, which give rise to a wealth of time series—temporal data. In a nutshell, they come in a form of a sequence of real numbers reported in discrete time moments. There is a broad spectrum of methods aimed at their analysis, interpretation, and control yielding numeric models of time series.

The algorithms coming from the area of Computational Intelligence (CI) (Bezdek 1992, 1994; Pedrycz 1997) have found a visible position in the literature and yielding a great deal of applications. CI has established itself firmly as a sound discipline given the flexibility, adaptive properties and learning capabilities offered by this technology. The corresponding methods are well developed subsequently resulting in a variety of successful applications. For instance, one can refer here to fuzzy automata and time series (Pedrycz and Gacek 2001), fractal analysis (Kamijo and Yamanouchi 2007), learning from time series (Perkins and Hallett 2010), time series prediction (Castillo and Melin 2007) and various applications to numerous areas such as ECG signal classification (Woei et al. 2007), gesture recognition (Juang and Ksuan-Chun 2005) and underwater protection systems (Tambouratzis and Pazsit 2009). Interestingly enough, these methods are neither transparent to the user nor provide a vehicle to supply with an easily understood description of the signal. In contrast, the aspect of human–centricity of algorithms offers a highly needed feature of delivering description of vast numeric data (time series) in a readable, easily interpretable way. For instance, transparent descriptions of signals or their classification come in the following form:

  • over long time signal exhibits high positive values with moderately low decrease

  • rapid decline of stock market

  • if significantly elevated temperature is observed and associated with rapid changes of pressure, it is very likely there is a highly abnormal behavior of the system

  • a broad range of frequencies is noted in the seismic signal with strongly emphasized high frequencies of the entire spectrum.

In spite of the existing diversity of the descriptions presented above, all of them exhibit some commonality: the descriptions involve granular constructs (shown in italics) that are helpful in time series description and its further easy comprehension by end-users. Likewise when considering classification problems, the outcomes of classification can be articulated at a level of information granules quantifying strength of belongingness (membership) to classes identified in the problem at hand. In the above statements, the granular terms are helpful when realizing mechanisms of signal analysis, interpretation, and classification.

The main objectives of this study is to propose a new way of representing and classifying time series within the framework of Granular Computing and quantifying its performance. We show that information granules are not only the semantically meaningful building blocks of the models of time series—granular time series but they also offer a great deal of design possibilities and flexibility. Those are made available because of the inherently nonlinear characteristics of fuzzy sets (used as information granules in the specific design constructs), which could be exploited to endow the resulting granular feature space with high discriminatory capabilities to be effectively used by classifiers. While incorporating some of the representation schemes commonly used in the description and modeling of time series, we move forward by introducing a concept of granular time series and granular time series classifier. In the case of classifiers, we investigate the use of relational classifier whose topology is helpful in carrying out interpretation of classification outcomes and reveal a collection of classification rules.

The paper is organized as follows: We start with a brief discussion on a general flow of granular classification (Sect. 2) where we highlight a role of information granules in the overall processing scheme. In particular, we distinguish among several key topologies of the schemes by highlighting a variety of feature spaces and their granular representations, which give rise to the successive classification alternatives. In Sect. 3, we recall the main ways to form feature spaces for time series as these representation schemes are reported in the literature. A formation of information granules, Sect. 5, with the use of fuzzy clustering is an essential design step and we show how they lead to a granular feature space. In the sequel, granular classifiers exploiting fuzzy relational calculus are introduced (Sect. 6). Here we also elaborate on the design of the fuzzy relations and with this regard stress the need for methods of population-based optimization. Finally, Sect. 7 reports on a comprehensive suite of experiments. When it comes to the algorithmic setup of fuzzy sets, in this study, we use logic operators commonly encountered in fuzzy sets, namely \(t\)-norms and \(t\)-conorms.

2 A general flow of granular classification

Granular classification schemes are comprised of several functional modules and while with this regard there are some similarities with the commonly encountered schemes of pattern classification, we encounter here some significant differences. An essence of the overall scheme proposed in this study can be captured schematically as follows

$$\begin{aligned} \hbox {data }&\rightarrow \hbox {feature space}\rightarrow \hbox {granular feature space}\\&\rightarrow \hbox {interpretation}\rightarrow \hbox {classifier} \end{aligned}$$

and visualized in Fig. 1. First, let us briefly elaborate on the underlying functionalities of the modules present there. We highlight phases where information granularity plays a pivotal role and facilitate interaction with the user.

Fig. 1
figure 1

Granular time series description and classification: an overview of processing

Let us briefly elaborate on the essence of the successive phases of the overall processing scheme. This will help us stress a systematic and coherent development process being proposed in this study as well as highlight its novel facets along with the role and motivation behind the ensuing information technologies exploited in the process.

2.1 Representation of time series

There is a remarkable diversity of representation schemes of time series including those benefiting from the techniques present in CI. One can allude to time series representations completed in the temporal domain (say, AR, ARMA models, neural networks, wavelet-based neural networks, etc.) as well as those carried out in frequency domain (e.g., spectral descriptors of time series). A lot of recent studies reported in the literature serve as a testimony to these developments. The representations formed there are predominantly of numeric character. As a result of the formation of numeric features, one returns a vector of numeric descriptors (features) characterizing the time series and used in consecutive phases of analysis, description, and classification.

2.2 Granular representation of time series

In this phase of the overall process, we form a collection of information granules over the feature space built using the already selected numeric representation scheme. Information granules are designed in various ways however clustering or fuzzy clustering along with their variants accommodating some mechanisms of supervision (as labeling of time series could be available) are commonly considered here.

2.3 Granular description of temporal data

Information granules (clusters) play an essential role as information granules that can be regarded as more abstract and interpretable entities forming a new granular feature space. Not too much research has been completed so far and the entire area is open for vigorous investigations. Here we may encounter a great deal of originality as a number of fundamental questions about the process and an assessment of the resulting information granules have not been posed and need to be carefully addressed. In the formation of information granules and the overall granular representation space we have to look both at the aspects of interpretability and discriminatory properties delivered by the representation space (Al-Hmouz et al. 2013). Let us note that owing to the nonlinearity furnished by information granules (say, in terms of nonlinear membership functions of fuzzy sets), a performance of classification schemes (classifiers) could be enhanced.

Figure 2 depicts a diversity of granular classifiers. While we will be elaborating in more detail on the topology in the later sections, here we stress a fact that there is a diversity of feature spaces involved in the architecture and a variety of ensuing granular feature spaces contributing to the formation of the classifier.

In what follows, we proceed with a detailed description of the successive phases of the general classification scheme.

Fig. 2
figure 2

Selected alternatives in the representation of time series: (a) the same feature space associated with various granular feature spaces, (b) different feature spaces associated with granular feature spaces, (c) different feature spaces where for each of them associated is a collection of granular feature spaces

3 Formation of feature space for time series

There is a plethora of approaches to represent time series (Aznarte and Benitez 2010; Abonyi et al. 2005; Fu 2011; Kasabov and Song 2002) as this has been an area of an intensive research in time series analysis, modeling (Minyoung 2014), and classification (Frank et al. 2013). There are a number of comprehensive review studies as well as comparative analyses. The representation of time series ranges from the schemes that are symbolic to those that result in some purely numeric representations. A general taxonomy of the representation methods, see (Lin et al. 2004) delivers a general view at the area structured in the two main categories such as data adaptive and non-adaptive. In the first class of the representation schemes, we distinguish several main categories including piecewise polynomial, symbolic, trees, singular value decomposition, sorted coefficients, etc. When it comes to the non-adaptive schemes, one can refer here to constructs such as wavelets, random mappings, spectral representations, and piecewise aggregate approximation. The reader may refer to the variety of the existing approaches (Fu 2011; Chen 2011; Das et al. 1998; Fu et al. 2006; Hirano and Tsumoto 2002; Jiang et al. 2007).

In terms of the general view at the interpretation of time series, it is beneficial to distinguish between the representations, which dwell on temporal data (and their transformations including differences of any order) and those which bring the representation of time series into the spectral domain. This perspective is useful as usually the temporal views at the representation of the data are easier to comprehend by humans (and this may directly impact the interpretation of the granular classifiers).

As in the general schemes discussed in Sect. 2, there are a number of phases put together to form the overall architecture of the classification model. It must be stressed that the selection of the representation scheme leading to the formation of the feature space is not unique and this step has to be synchronized with other design activities. In terms of efficiency, it is always a sound development strategy to exploit several commonly utilized alternatives prior to the development of a new, highly specialized feature space.

4 Construction of information granules: a realization of granular feature space through fuzzy clustering

Irrespectively of the way one has decided to use to represent time series, the result of this representation comes as a collection of “\(N\)” time series that can be viewed as a family of \(n\)-dimensional vectors z \(_{1}\), z \(_{2}, {\ldots }\), z \(_{N}\) where \({\mathbf{z}}_k \in {\mathbf{R}}^n\). We form a collection of information granules (clusters) in R \(^{n}\) by carrying out clustering algorithm. Fuzzy clustering such as, Fuzzy \(C\)-Means (FCM) (Bezdek 1981; Pedrycz and Gomide 2007) is one of the commonly used techniques here. The algorithm is well established, carefully analyzed with respect to its optimization capabilities and comes with a wealth of applications discussed in various areas.

Let us recall that the FCM clustering is realized through the minimization of the following objective function

$$\begin{aligned} Q=\sum \limits _{i=1}^c {\sum \limits _{k=1}^N {u_{ik}^m } } \vert \vert \mathbf{z}_k - \mathbf{v}_i \vert \vert ^2 \end{aligned}$$
(1)

Here v \(_{i}\)s are \(n\)-dimensional prototypes of the clusters \(c\), \(i =1, 2,\ldots \),\(c\) and \(U =\) [\(u_{ik}\)] stands for a partition matrix expressing a way of allocation of the data to the corresponding clusters; u\(_{ik}\) is the membership degree of data z \(_{k}\) in the \(i\)-th cluster. The distance between the data z \(_{k}\) and prototype v \(_{ i}\) is denoted by \(\vert \vert \).\(\vert \vert \). The fuzzification coefficient “\(m\)” (assuming values greater than 1.0) expresses the impact of the membership grades on the individual clusters. It implies as certain geometry of the produced fuzzy sets. A partition matrix satisfies two important and intuitively appealing properties

  1. (a)
    $$\begin{aligned} 0<\sum \limits _{k=1}^N {u_{ik} <N} ,\,\quad i=1,2,\ldots ,\,c \end{aligned}$$
  2. (b)
    $$\begin{aligned} \sum \limits _{i=1}^{c} {{u}_{ik} =1} ,\;\;\;{k}=1,2,\ldots ,{N} \end{aligned}$$
    (2)

Let us denote by U a family of matrices satisfying (a, b). The first requirement states that each cluster has to be nonempty and different from the entire set. The second requirement states that the sum of the membership grades should be confined to 1.

The minimization of \(Q\) is completed with respect to \(\mathrm{U} \in \, \) U and the prototypes v \(_{i}\), \({\mathbf{V}}=\left\{ {\varvec{v}_1 ,\varvec{v}_2 ,...\varvec{v}_c } \right\} \) of the clusters. More explicitly, we write it down as follows

$$\begin{aligned} \min \;Q\;\hbox {with respect to U}\in {\mathbf{U}}, \varvec{v}_1, \varvec{v}_2, \ldots , \varvec{v}_c \in {\mathbf{R}}^n \end{aligned}$$
(3)

From the optimization standpoint, there are two individual optimization tasks to be carried out separately for the partition matrix and the prototypes. The first one concerns the minimization with respect to the constraints given the requirement of the form (2), which holds for each data point \(\varvec{x}_{k}\). The use of Lagrange multipliers converts the problem into its constraint-free version. Once solved, the partition matrix is computed as follows

$$\begin{aligned} u_{st} =\frac{1}{\sum \nolimits _{j=1}^c {( {\frac{\vert \vert \mathbf{z}_t - \mathbf{v}_s \vert \vert }{\vert \vert \mathbf{z}_t - \mathbf{v}_j \vert \vert }})^{\frac{2}{m-1}}}} \end{aligned}$$
(4)

\(s=1,2,\ldots ,c;t=1,2,\ldots ,N.\)

The optimization of the prototypes v \(_{i}\) is carried out assuming the Euclidean distance between the data and the prototypes that is \(\left| {\left| {\varvec{z}_k -\varvec{v}_i } \right| } \right| ^2=\sum \nolimits _{j=1}^n {(z_{kj} } -v_{ij} )^2\). The objective function reads now as follows \(Q=\sum \nolimits _{i=1}^c \sum \nolimits _{k=1}^N {u_{ik}^m } \sum \nolimits _{j=1}^n\) \( {(z_{kj} } -v_{ij} )^2\) and its gradient computed with respect to \(\varvec{v}_{i,}\) namely \(\nabla _{v_i} Q\) and set to zero yields the following expression

$$\begin{aligned} v_{st} =\frac{\sum \nolimits _{k=1}^N {u_{ik}^m z_{kt}}}{\sum \nolimits _{k=1}^N {u_{ik}^m}} \end{aligned}$$
(5)

Overall, the FCM clustering is completed through a sequence of iterations where we start from some random allocation of data (a certain randomly initialized partition matrix) and carry out the following updates by adjusting the values of the partition matrix and the prototypes. The iterative process is continued until a certain termination criterion has been satisfied. Typically, the termination condition is quantified by looking at the changes in the membership values of the successive partition matrices. Denote by U(\(t\)) and U(\(t+1\)) the two partition matrices produced in the two consecutive iterations of the algorithm. If the distance \(\vert \vert \) U(\(t+1\))\(-\) U \((t)\vert \vert \) is less than a small predefined threshold \(\varepsilon \) (say, \(\varepsilon =\) 10\(^{-5}\) or 10\(^{-6})\), then we terminate the algorithm. Typically, one considers the Tchebyschev distance between the partition matrices meaning that the termination criterion reads as follows

$$\begin{aligned} {\min \nolimits _{i,k}} \vert u_{ik} (t+1)-u_{ik} (t)\vert \le \varepsilon \end{aligned}$$
(6)

The fuzzification coefficient exhibits a direct impact on the geometry of fuzzy sets generated by the algorithm. Typically, the value of “\(m\)” is assumed to be equal to 2.0. Lower values of m (that are closer to 1) yield membership functions that start resembling characteristic functions of sets; most of the membership values become localized around 1 or 0. The increase of the fuzzification coefficient (\(m\) = 3, 4, etc.) produces “spiky” membership functions with the membership grades equal to 1 at the prototypes and a fast decline of the values when moving away from the prototypes.

The quality of clustering and the choice of the number of clusters can be realized in many ways. Cluster validity indexes are quite commonly encountered here (refer to Bezdek 1981). A reconstruction criterion (Pedrycz et al. 2008) can serve as another indicator quantifying the quality of clusters being used in the processes of information granulation and de-granulation.

As a result of clustering, we obtain \(c\) clusters which are fully described by the corresponding prototypes of the clusters, \(\varvec{v}_1 ,\varvec{v}_2,\ldots ,\varvec{v}_c \) or equivalently fuzzy sets A \(_{1}\), A \(_{2}\),.., A \(_{c}\) whose membership functions are computed as follows

$$\begin{aligned} \hbox {A}_{i} ({\mathbf{z}}) = \frac{1}{\sum \nolimits _{j=1}^{c} {({\frac{{\vert \vert }{\mathbf{z}}-{\mathbf{v}}_{i} {\vert \vert }}{{\vert \vert }{\mathbf{z}}-{\mathbf{v}}_{j} {\vert \vert }}})^{2/({m}-{1)}}} } \end{aligned}$$
(7)

Any vector representing time series in a certain representation space (feature space), say z, results in a collection of membership degrees, \({\mathbf{x}}\in \;\left[ {0,1} \right] ^\mathrm{c},{\mathbf{x}}= \left[ {\mathbf{A}}_1 ({\mathbf{z}}){\mathbf{A}}_2 ({\mathbf{z}})\ldots \right. \) \(\left. {\mathbf{A}}_c ({\mathbf{z}})\right] \) where the membership degrees are computed by looking at the closeness of z with regard to the prototypes.

The quality of the granulation–degranulation process (Pedrycz et al. 2008) being realized with the aid of the clusters can be evaluated.

Let us stress that the granulation of the initial (numeric) representation space exhibits two interesting aspects. First, granulation could reduce the dimensionality of the problem. The space formed here is of dimensionality “\(c\)” where typically \(c<n\). Furthermore the process results in a nonlinear transformation of the original feature space (observe that (7) provides a nonlinear transformation of the numeric representation of signals), which could enhance discriminatory capabilities of the ensuing classifier.

There are two main approaches to the formation of the information granules in the granular feature space:

  1. (a)

    As fuzzy clustering is a method of unsupervised learning, one can look at the data (representations of the series) as not carrying any class labels and form the clusters. Obviously, the number of clusters (\(c)\) has to be equal or greater than the number of classes.

  2. (b)

    Clustering can be realized for time series belonging to the individual classes. Here we consider time series belonging to a given class.

5 Granular classifiers

In this section, we present a design process of the granular classifier starting with a discussion on its architecture (where several alternatives are investigated, their properties are analyzed and the classifiers are provided with their interpretation). We present various ways in which the parameters of the classifier are determined.

5.1 Architecture of the classifier

Given the nature of the granular feature space in which we delineate a collection of information granules as the visible building blocks, we consider here a relational category of classifiers. They are realized in a form of relational mappings between information granules and class assignment. The relational classifiers deliver an interesting aspect of revealing and capturing logic relationships that are dominant for the corresponding classes. Assuming that information granules (fuzzy sets) forming the two granular feature spaces, see Fig. 2b, are denoted as A \(_{i}\), and B \(_{j}\), the relational dependency between the class membership (viz. activation or compatibility) levels of the corresponding information granules in the input space can be described in the form

$$\begin{aligned} \omega = ({\mathbf{A}}_i \times {\mathbf{B}}_j )\hbox { op}\;{\mathbf{R}} \end{aligned}$$
(8)

where \(\omega \) is a \(p\)-dimensional vector of class membership (note that in virtue of using fuzzy sets, we may encounter degrees of membership to individual classes rather than a Boolean “yes–no” binary quantification of the class assignment). Here the symbol “op” denotes a certain relational composition operator, which is used to compose a fuzzy relation of the classifier R with the current granular descriptors of a given time series to be classified while \(\times \) stands for a Cartesian product of the coordinates of the granular feature space. Let us look at more details at (8) by identifying individual variables. For a given time series described in a certain feature space and further giving rise to the corresponding feature vector z. In the sequel the resulting vector positioned in the granular feature space is determined on the basis of the existing prototypes by computing the values of the levels of activation (compatibility) \({\mathbf{A}}_1 ( {\mathbf{z}}){\mathbf{A}}_2 ( {\mathbf{z}}),\,\ldots {\mathbf{A}}_{c1} ( {\mathbf{z}})\) and \({\mathbf{B}}_1 ( {\mathbf{z}}),{\mathbf{B}}_2 ( {\mathbf{z}}),\,\ldots ,{\mathbf{B}}_{c2} ( {\mathbf{z}})\); see also (7). The details of the scheme are portrayed in Fig. 3.

Fig. 3
figure 3

Realization of the relational classifier in case of two granular feature spaces formed for a certain feature space

It is convenient to introduce the following concise vector notation \({\mathbf{x}}=\left[ {{\mathbf{A}}_1 ({\mathbf{z}}){\mathbf{A}}_2 ({\mathbf{z}})\ldots .{\mathbf{A}}_{c1} ({\mathbf{z}})} \right] \) and \({\mathbf{y}}=\,\left[ {\mathbf{B}}_1 ({\mathbf{z}}){\mathbf{B}}_2\right. \) \( \left. ({\mathbf{z}})\ldots .{\mathbf{B}}_{c2} ({\mathbf{z}}) \right] \) describing the representation of the pattern z in the two granular feature spaces. The vector of class membership \(\omega \) is \(p\)-dimensional with “\(p\)” coordinates describing degrees of membership to the corresponding classes. This helps us rewrite (8) in the following form

$$\begin{aligned} \omega = ({\mathbf{x}}\times \hbox {y})\hbox { op}\;{\mathbf{R}} \end{aligned}$$
(9)

We rewrite (9) in terms of the individual coordinates of the components. Note that the Cartesian product is modeled by a certain \(t\)-norm (or the minimum operation, in particular), namely min (\(x_{i}\), \(y_{j})\) or t(\(x_{i}\), \(y_{j})\). Overall we have

$$\begin{aligned} \omega _\lambda =\hbox {op}_{i,j} \left[ {t( {x_i ,y_j }),r_{ijl} } \right] \end{aligned}$$
(10)

\(l=1, 2,{\ldots }, p\) and the aggregation operator realizes a convolution of components of Cartesian product with the corresponding entries of the fuzzy relation (matrix) R.

In the above topology of the classifier, we have considered two granular feature spaces (see Fig. 2) to fully illustrate the underlying processing. This description of the granular classifier could be easily scaled down to a single feature space (as presented in Fig. 1) or extended to the granular feature space of higher dimensionality (with a number of descriptors).

Several alternatives with regard to the composition operators (op), which come with well-articulated logic interpretations will be investigated. In particular, one can look at the two well-known logic-based compositions encountered in fuzzy sets

  1. (a)

    s–t or max–min composition of x \(\times \) y and R

    $$\begin{aligned} \omega = ({\mathbf{x}}\times {\mathbf{y}})\circ {\mathbf{R}} \end{aligned}$$
    (11)

    which in terms of the individual elements of the vectors reads as follows

    $$\begin{aligned} \omega _\lambda = {\min \nolimits _{i,j}} \left[ {\min ( {\min ({x_i, y_j}),r_{ijl}})} \right] \end{aligned}$$
    (12)

    (the min and max operators can be generalized to any \(t\)-norm and \(t\)-conorm).

  2. (b)

    t–s or min–max composition of the complement of x \(\times \) y and the fuzzy relation R

    $$\begin{aligned} \omega =\overline{{\mathbf{x}}\times {\mathbf{y}}} \cdot {\mathbf{R}} \end{aligned}$$
    (13)

    where the over bar stands for the complement operation. Again we express the above relationship in terms of the membership grades as follows

    $$\begin{aligned} {\omega _\lambda } = {\min \nolimits _{i,j}}\left[ {\max (1 - \min ({x_i},{y_j}),{r_{ijl}})} \right] \end{aligned}$$
    (14)

As an illustrative example, let us discuss a two-class problem with a single input x in which the relation R has the following entries (i.e., we confine ourselves to a single granular feature space) [1.0 0.8 0.2 0.1 0.0 ] for class \(\omega _1 \) and [0.2 0.3 0.8 0.9 1.0] for class \(\omega _2 \).

Consider the input x taking on the form [0.1 1.0 0.3 0.2 0.0]. By carrying out the max–min composition, we obtain class membership vector \(\omega \) = [0.8 0.3] whereas the min–max composition (note that here we involve the complement of x, viz. [0.9 0.0 0.7 0.8 1.0]) one has the class membership vector with the entries [0.7 0.3]. Altogether, the obtained class membership vectors show that x belongs to the first class with the membership grades positioned in-between 0.7 and 0.8.

5.2 Construction of the fuzzy relation of the classifier

In this section, we discuss two main ways of developing the fuzzy relation of the classifier.

5.2.1 Gradient-based learning scheme

The gradient-based scheme operates in a supervised mode in presence of some input–output data where the inputs are granular representations of time series x \(_{k}\), \(\varvec{\omega }_{k}\) while the output is a binary vector of class membership target \(_{k}\), \(k\) = 1,2,\({\ldots }\),\(N\). A performance index \(Q\) quantifies a distance between target \(_{k}\) and \(\omega _{k}\) produced by the granular classifier. Typically as sum of squares is considered

$$\begin{aligned} Q=\sum \limits _{k= {1}}^{N} {(\omega _{k} } -{\mathbf{target}}_{k})^{T} (\omega _{k} -{\mathbf{target}}_{k}) \end{aligned}$$
(15)

The update formula for the gradient-based learning is described concisely as

$$\begin{aligned} {\mathbf{R}}( {\mathrm{iter}+1}) ={\mathbf{R}}( \mathrm{iter}) -\alpha \nabla _{\mathbf{R}} Q \end{aligned}$$
(16)

where \(\nabla _\mathrm{R} \hbox {Q}\) is a gradient of \(Q \) computed with respect to R while \(\alpha \) stands for a positive learning rate and the iteration index (iter) goes from 0, 1, 2,... An initial fuzzy relation R(0) is the one, which accumulated the existing experimental evidence in the form of the union of the Cartesian products of input–output data, namely

$$\begin{aligned} R(0) =\bigcup \limits _{k={1}}^N {(x_k \times y_k \times {{\varvec{\omega }}}_{\mathbf{k}})} \end{aligned}$$
(17)

5.2.2 Evolutionary optimization of the classifier

In the design of the classifier, the gradient-based learning minimizes the performance index (15), which, however, is not fully reflective of the performance of the classifier, viz. the classification error (to be minimized) or the classification accuracy (to be maximized). In other words, when minimizing \(Q\), there is no guarantee that the classification error or any other measure typical for the assessment of the classification schemes becomes minimized.

Having this in mind, it is advantageous to look at more advanced methods, which could focus on the minimization of the classification error (maximization of classification rate). With this regard, a use of methods of evolutionary optimization or population-based optimization is advisable given the flexibility of the fitness function supported by the nature of the optimization process.

To make the fitness function fully reflective of the performance of the classifier (so that the classifier can be effectively optimized), we determine the maximal entry of the vector \(\omega _{k}\) for a given Cartesian product \(\mathbf{x}_{k} \times \mathbf{y}_{k}\) along with its location in the vector of class membership and form a binary vector \(\varvec{b}_{k}\) with a single entry set to 1 (the others are set to zero) positioned at the \(j_{0}\)-th coordinate where

$$\begin{aligned} j_0 = \arg _{j=1,2,\ldots ,p} \max \;\varvec{\omega }_{{\mathbf{kj}}} . \end{aligned}$$
(18)

The target \(_{k}\) is a binary vector with a single entry set to 1. The distance of this vector from b \(_{k}\) being different from zero indicates that the \(k{\mathrm{th}}\) pattern has been misclassified. The sum of these distances (with the summation completed for all patterns) denoted by V is regarded as the fitness function; its minimization is equivalent to the minimization of the classification error. In other words, V is a classification error. In the experimental studies we use a Particle Swarm Optimization (PSO) and its choice is motivated by the simplicity, relatively low computing overhead and high effectiveness of the algorithm.

6 Experiments

The experimental studies reported in this section offer a comprehensive view at the essence of the overall classification process and help highlight the main functionalities of the architecture of the classifier.

Proceeding with the realization of the overall scheme whose general structure has been discussed in the previous sections, some implementation aspects have to be decided upon. Among the plethora of existing representation of the time series, a SAX method or more precisely, its initial phase, a so-called Piecewise Aggregate Approximation (PAA) (Lin et al. 2004) is used to represent the original temporal data. The underlying idea is simple and computationally inexpensive. Given the original time series {z \(_{k}\)} of length \(N\), we form time intervals of length T and for each of such intervals, compute an average of samples of the time series thus producing a reduced sequence {z \(_{l}\)}, \(l=1, 2,{\ldots },L\) with L being a rounded off ratio \(N\)/\(T\). Through the PAA method, the original data has been reduced while the reduction rate is determined by the value of \(T\). The reduction is carried out for the original sequence {z \(_{k}\)} with \(\Delta z_l =z_l -z_{l-1}\), where these differences capture the changes in the reduced time series.

Alluding to the structure of the classification scheme discussed in Sect. 3, we concentrate on the one scenario illustrated in Fig. 4. The PAA is applied separately to the series {z \(_{k}\)} producing the reduced (compressed) sequences of averages over the time windows leading to {z \(_{l}\)} and \(\left\{ {\Delta z_l } \right\} \). For these two sets, we form the corresponding granular feature spaces by running FCM with the number of clusters set to c\(_{1}\) (amplitudes) and c\(_{2}\) (changes of amplitude). The Fuzzy \(C\)-Means (FCM) is run with the weighted Euclidean distance and the fuzzification coefficient (\(m\)) set to 2. The number of iterations was set to 100; we experimentally found that this number was sufficient to achieve the convergence of the method.

Fig. 4
figure 4

Classification scheme used in the experimental studies

Let us recall that the three parameters that influence the classification mechanism deal with the length of the time window T and the number of clusters \(c_{1}\) and \(c_{2}\) forming the corresponding granular feature spaces. All of them will be investigated in the experiments. In the suite of the experiments reported in this section, we use publicly available time series (Keogh et al. 2011). In total, 20 data sets have been used. The results presented here are compared with the results reported in the literature and produced by some existing classifiers. Let us note that the experiments are carried out for the data sets, which have already been split into the training sets and testing sets. The details about the time series are covered in Table 1. In particular, when describing the data we include their length and the number of classes as these two values gives a better insight into the nature of the underlying time series and the ensuing difficulties in the classification process.

As far as the Particle Swarm Optimization (PSO) is concerned, the size of population was set to 25 and the algorithm was run for 100 generations. The values of the cognitive acceleration coefficient and the social acceleration coefficient were set to 2.8 and 1.3, respectively. These two values are commonly recommended in the literature [see Carlisle and Dozier (2001)]. The relational classifiers were realized with the use of the \(s\)\(t\) and \(t\)\(s\) compositions with the \(t\)-norm realized as the algebraic product and the \(t\)-conorm specified as the probabilistic sum.

Table 1 Time series data used in the experiments: a summary

Let us discuss the details of the design when considering one of the datasets in Table 1, namely ECG 200. It represents the electrocardiogram measurements of cardiac electrical activities as recorded from electrodes positioned at various locations on the body. In the training and test of ECG200 data set, the signals are labeled as class 1 and class 2 (normal and abnormal, respectively). The plots of the original time series coming from the two classes are shown in Fig. 5.

Fig. 5
figure 5

Examples of ECG time series

For illustration, setting the values \(T\) = 17 (that is \(L\) = 6) the plots of the corresponding PAA representations of the signals are shown in Fig. 6.

Fig. 6
figure 6

ECG 200 time series z\(_{l}\) and differences \(\Delta \)z\(_{l }\) (a) class\(_{1}\) of ECG 200 (b) z\(_{l}\) (c) \(\Delta \)z\(_{l}\)

The prototypes obtained using FCM are shown in Fig. 7. As noted before the three parameters (\(T\), \(c_{1}\), and \(c_{2})\) are the three essential design parameters impacting the performance of the classifier. We carried out a series of experiments to visualize their impact on the classification accuracy. Furthermore we contrast the obtained results before the PSO optimization and after the optimization. To visualize the effectiveness of the PSO, the values of the fitness function for the training set are visualized in Fig. 8.

The fuzzy relational classifier produced the classification accuracy whose values are reported in Table 2. Apparently, the performance of the classifier depends on the selected values of \(T\), \(c_{1}\), and \(c_{2}\). It can be noticed that the highest classification rates for training data and test data set without PSO optimization (no PSO) occur at \(T=5\) regardless of the values of \(c_{1}\) and \(c_{2}\). Furthermore when comparing the classification rates for the specific values of \(c_{1}\) and \(c_{2}\) in each table, the highest values of these rates (in case no PSO optimization has been carried out) in most cases were reported for \(c_{1} = 9\) or \(c_{2} = 9\). The results coming after the optimization of the fuzzy relations R and G coincide with this trend in which the maximum classification rate for training data set when using PSO occur at \(T = 5\), \(c_{1} = 9\) and \(c_{2} = 9\).

Fig. 7
figure 7

Prototypes of ECG 200 time series (class\(_{1})\) (a) \(c_{1}=5\) (b) \(c_{2} =7\) and \(T=17\)

Fig. 8
figure 8

Values of fitness function reported in successive generations of the PSO (ECG 200) (a) \(s\)\(t\) composition, and (b) \(t\)\(s\) composition and \(T = 17\), \(c_{1} = 5\), \(c_{2} =7\)

Table 2 Classification results (training and testing data) obtained for different number of clusters (\(s\)\(t\) composition and \(t\)\(s\) composition)

The optimized fuzzy relation R of the classifier (with the values of the entries above 0.6 marked in bold) and obtained for \(c_{1}\) = 9, \(c_{2}\) = 9 and \(T\) = 5 are reported below:

R=

 

\(\delta \)A1

\(\delta \)A2

\(\delta \mathrm{A}\)3

\(\delta \mathrm{A}\)4

\(\delta \mathrm{A}\)5

\(\delta \mathrm{A}\)6

\(\delta \mathrm{A}\)7

\(\delta \mathrm{A}\)8

\(\delta \mathrm{A}\)9

A1

0.061

0.162

0.056

0.134

0.032

0.140

0.293

0.282

0.291

A2

0.058

0.022

0.047

0.100

0.525

0.003

0.229

0.042

0.067

A3

0.051

0.151

0.790

0.453

0.011

0.012

0.090

0.020

0.078

A4

0.221

0.084

0.056

0.172

0.036

0.226

0.026

0.042

0.009

A5

0.049

0.044

0.012

0.014

0.187

0.109

0.004

0.080

0.202

A6

0.026

0.846

0.139

0.084

0.292

0.143

0.099

0.013

0.261

A7

0.044

0.018

0.315

0.038

0.278

0.218

0.257

0.071

0.136

A8

0.082

0.081

0.001

0.084

0.027

0.106

0.035

0.004

0.314

A9

0.108

0.001

0.026

0.596

0.354

0.596

0.011

0.078

0.384

  1. class \(\omega _1 \)

R=

 

\(\delta \)A1

\(\delta \)A2

\(\delta \mathrm{A}\)3

\(\delta \mathrm{A}\)4

\(\delta \mathrm{A}\)5

\(\delta \mathrm{A}\)6

\(\delta \mathrm{A}\)7

\(\delta \mathrm{A}\)8

\(\delta \mathrm{A}\)9

A1

0.122

0.051

0.057

0.357

0.080

0.014

0.271

0.013

0.100

A2

0.410

0.224

0.318

0.329

0.220

0.185

0.233

0.571

0.036

A3

0.164

0.003

0.565

0.008

0.061

0.185

0.115

0.148

0.086

A4

0.189

0.067

0.424

0.056

0.068

0.295

0.057

0.013

0.000

A5

0.013

0.506

0.419

0.101

0.028

0.424

0.036

0.124

0.012

A6

0.358

0.194

0.086

0.428

0.133

0.059

0.510

0.199

0.063

A7

0.137

0.249

0.382

0.253

0.026

0.100

0.507

0.215

0.078

A8

0.090

0.062

0.022

0.512

0.002

0.176

0.652

0.393

0.191

A9

0.081

0.489

0.023

0.241

0.145

0.254

0.174

0.230

0.029

  1. class \(\omega _{2}\)

The most significant rules inferred from this relational classifier (identified with the highest entries of the fuzzy relation R) are listed below. Note that each rule comes with the associated entry of the fuzzy relation.

Table 3 Classification results accuracy values obtained for time series

class \(\omega _{1}\):

  • If A\(_{6}\) and \(\delta \)A\(_{2}\) then class \(\omega _{1}\) (0.846)

  • If A\(_{3}\) and \(\delta \)A\(_{3}\) then class \(\omega _{1}\) (0.790)

  • If A\(_{9}\) and \(\delta \)A\(_{4}\) then class \(\omega _{1}\) (0.596)

  • If A\(_{9}\) and \(\delta \)A\(_{6}\) then class \(\omega _{1}\) (0.596)

class \(\omega _{2}\):

  • If A\(_{8}\) and \(\delta \)A\(_{7}\) then class \(\omega _{1}\) (0.652)

  • If A\(_{6}\) and \(\delta \)A\(_{7}\) then class \(\omega _{1}\) (0.510)

  • If A\(_{3}\) and \(\delta \)A\(_{3}\) then class \(\omega _{1}\) (0.565)

  • If A\(_{2}\) and \(\delta \)A\(_{8}\) then class \(\omega _{1}\) (0.571)

G=

 

\(\delta \)A1

\(\delta \)A2

\(\delta \mathrm{A}\)3

\(\delta \mathrm{A}\)4

\(\delta \mathrm{A}\)5

\(\delta \mathrm{A}\)6

\(\delta \mathrm{A}\)7

\(\delta \mathrm{A}\)8

\(\delta \mathrm{A}\)9

A1

0.990

0.879

0.412

0.946

0.816

0.668

0.760

0.576

0.712

A2

0.461

0.878

0.704

0.934

0.990

0.959

0.315

0.824

0.699

A3

0.445

0.877

0.70

0.876

0.868

0.342

0.855

0.685

0.740

A4

0.829

0.930

0.555

0.748

0.817

0.589

0.813

0.845

0.623

A5

0.955

0.474

0.187

0.324

0.885

0.811

0.478

0.948

0.856

A6

0.990

0.640

0.484

0.609

0.910

0.906

0.502

0.705

0.826

A7

0.600

0.853

0.658

0.130

0.820

0.829

0.787

0.468

0.995

A8

0.790

0.979

0.494

0.167

0.873

0.836

0.534

0.447

0.474

A9

0.643

0.488

0.915

0.622

0.808

0.972

0.632

0.937

0.810

  1. class \(\omega _{1}\)

G =

 

\(\delta \)A1

\(\delta \)A2

\(\delta \mathrm{A}\)3

\(\delta \mathrm{A}\)4

\(\delta \mathrm{A}\)5

\(\delta \mathrm{A}\)6

\(\delta \mathrm{A}\)7

\(\delta \mathrm{A}\)8

\(\delta \mathrm{A}\)9

A1

0.480

0.749

0.846

0.501

0.969

0.789

0.766

0.986

0.696

A2

0.894

0.623

0.819

0.708

0.903

0.942

0.857

0.816

0.334

A3

0.983

0.899

0.533

0.487

0.841

0.922

0.238

0.002

0.753

A4

0.910

0.687

0.866

0.903

0.962

0.770

0.934

0.500

0.997

A5

0.948

0.946

0.934

0.324

0.992

0.974

0.767

0.948

0.762

A6

0.832

0.967

0.972

0.734

0.922

0.128

0.585

0.699

0.793

A7

0.631

0.812

1.000

0.934

0.837

1.000

0.460

0.930

0.704

A8

0.978

\(\mathbf 0.159 \)

0.950

0.949

0.585

0.855

0.723

0.819

0.833

A9

0.479

0.660

0.820

0.547

0.710

0.833

0.900

0.029

0.640

  1. class \(\omega _{2}\)

The lowest entries of G are shown in boldface (below 0.2). The highest classification rates (0.94) for the test set are obtained for the parameters: \(c_{1}\) = 9, \(c_{2}\) = 7, \(T\) = 5) which are different from parameters produced by optimizing R and G (\(c_{1}\) = 9, \(c_{2}\) = 9, \(T\) = 5). This can be inferred to low number of time series signals used in the training and test sets (100).

Table 3 summarizes the values of accuracy of the fuzzy classifier (using relations R and G). These results are compared with the classification outcomes produced by other commonly used classifiers for the publicly available data sets (Keogh et al. 2011). The results shown for these classifiers were reported in (Keogh 2006). The number of clusters \(c_{1}\), \(c_{2}\) and the length of the time window \(T\) were selected individually for each dataset in such a way that lead to the highest values of the classification accuracy. In all experiments, the ranges of the values for these parameters were set as follows \(c_{1}\) is in the range [2, 9], \(c_{2}\) is in the range [2, 9] and the range for \(T\) is [2, 50] .

The italic scores concern the accuracies of the fuzzy classifier in cases they outperformed other state-of-art classifiers. The entries of the table marked in bold show the cases when the relational classifiers perform quite competitively vis-a-vis the classifiers with the highest accuracies. Overall, we stress that data representation plays a crucial rule to achieve high classification rates, especially when dealing with classification data with a high number of classes. In this case, the variables \(\Delta \)z\(_{k }\) and z\(_{k }\) might not be fully representative as this is visible for the data set such as Adiac, 50 words, and FaceAll.

7 Conclusions

The proposed architecture of granular classifiers, which in a synergistic manner exploit the technologies of Computational Intelligence open an avenue for a comprehensive analysis and design of temporal data. First, the way in which feature spaces and their granular versions are formed is beneficial in the formation of spaces of high discriminative properties. The granulation of information realized both in time (through the formation of time slices—temporal windows) and space (in the feature spaces) becomes of importance when the relational format of the classifier is an interesting alternative given its underlying logic nature so that the transparent, easy to interpret format of the mapping from the collection of information granules to class labels becomes a design asset forming a collection of rules.

While the proposal has been implemented in a certain specific setting in terms of the feature spaces and the form of classifiers, it is needless to say that it offers a genuine wealth of design possibilities (in terms of the building of feature spaces, their granulation, and the construction of the classifiers), which are definitely worth exploring. There is also an interesting issue of building temporal information granules (instead of the commonly used uniform granulation of time series encounter in most of the existing studies).