1 Introduction

In the era of big data, processing complex information that belongs to a high-dimensional space is a common challenge for the research community (Angelov and Kasabov 2005; Trevisan et al. 2010; Kelly et al. 2010; Angelov et al. 2010; Angelov and Yager 2013; Trevisan et al. 2014). Novel artificial intelligence techniques have been developed that can learn complex big data and produce accurate results. These techniques process the information available using the same mechanisms involved in memory formation and synaptic plasticity in the human brain and are called spiking neural network (SNN) techniques (Maass 1997; Gerstner 1995, 2001; Gerstner et al. 2012). They have already been applied to data modelling, pattern recognition and predictive systems of complex temporal or spatio-temporal data (e.g. Ghosh-Dastidar and Adeli 2007; Kasabov et al. 2016; Capecci et al. 2015; Espinosa-Ramos et al. 2017; Marks 2017) demonstrating to be an effective alternative to traditional machine learning techniques, and their application in the field of bioinformatics and system biology research is gaining interest rapidly.

Bioinformatics data, such as gene expression data, possess a high level of complexity in shape and size. This data consists of thousands of genes measured for a small number of samples. Problems of high-dimensional/scarce data sets, such as gene expression data, are numerous, and they are mainly connected with high variance. Besides, over-fitting becomes difficult to avoid, outliers are critical and noise in the data generates real issues. To tackle these problems, we need to consider both data augmentation and feature selection approaches. Data augmentation is used to address sample size planning and feature selection to identify a relevant subset of genes that carry the information needed to study a particular event.

The interest around gene expression data, beyond their modelling complications, arouses on their ability to induce protein synthesis, and their implication in several pathologies, when external and/or genetic factors alter their composition and behaviour (Kumar and Mann 2009; Pertea and Salzberg 2010; Ezkurdia et al. 2014). By identifying which genes are turned on in a particular cell, in which amount and when, can help us uncover the process behind the cell life, processes and understanding of how diseases work (Shen et al. 2012). In the era of system biology, the most popular techniques used to quantify transcriptome data are: DNA hybridization-based microarrays and next-generation sequencing (RNA-Seq) technologies (Panda et al. 2003; Wang et al. 2008). Even though RNA-seq technique has gained increasingly popularity for measuring transcriptome of organisms (Mortazavi et al. 2008), DNA microarrays are still widely used for targeting the identification of already known common allele variants, and public databases, such as the National Center for Biotechnology Information (NCBI, make this data available to the entire community. The transcriptome data provided can be considered representative of the entire community, as genome sequence similarity over individuals is really high (about 99.9%), with the same genes in the same location (Feuk et al. 2006). People are unique by phenotype; however, every human being shares the same genetic blueprint (Shen et al. 2012).

While the measurement and analysis of steady-state microarray data are routine, the analysis of time-series gene expression profiling is of growing interest, as they can shed some light on the complex relationships between genes and how they work together over time, to determine adaptive phenotype and transcription factors that are associated with certain diseases.

Here is where SNN techniques can play a pivotal role, as they are able to identify and process only relevant changes encoded in the raw data, and learn how to reliably tell these changes apart. By extracting meaningful patterns from raw time-series data, we aim at understanding the mechanisms involved in the regulatory expression of genes.

In our study, we used time-series microarray data measured while studying the elicitation of eczema in the human skin. After modelling the data to obtain a desirable sample size and a relevant subset of genes, we classify and analyse the data using an evolving SNN model. To evaluate the significance of the classification results, we compare them with traditional machine learning methods. In addition, we derive the interaction between genes over time, by analysing the intensity of the information transmitted between clusters of genes represented as a gene interaction network (GIN). This is a novel approach that could lead to biologically realistic models and to a better understanding of the phenomenon of study. More importantly, by applying this methodology, we aim at identifying the clinical implication behind the regulatory processes involved in the expression of certain genes over time.

The rest of the manuscript is organized as follows: Sect. 2 describes in detail the materials and methods used to carry out the study, whereas Sect. 3 examines experimental results. Finally, Sect. 4 closes with conclusions and future works.

2 A general description of the proposed methodology

The methodology is graphically represented in Fig. 1. In this section, we explain the general procedures of this methodology and in the following Sect. 3 we present the methodology in its computational and algorithmic details and demonstrate it on a case study example.

2.1 Dimensionality reduction

Gene expression profiling includes the expression of thousand of genes, however among those there are some that show minor variation in their profile. This implies that not all genes are differentially expressed over time. Since we are interested only in those that exhibit temporal interdependency, we should remove unregulated genes from the original data. This can be done by filtering the data that show low variance (Sebastiani et al. 2003; Kohane et al. 2002). Also, the dynamic phenomena behind temporal gene expression data can be derived by permutation entropy (Sun et al. 2010), as low Entropy characterises genes with small fluctuation across time points. Thus, genes that express low entropy with an absolute value lower than a certain percentile can be filtered as well (Kohane et al. 2002). This problem, coined as the curse of dimensionality (Verleysen and Franois 2005), is well known in machine learning. It provokes poor performance of algorithms, and due to their high complexity associated with data having a large number of dimensions/features, frequently makes the target function unnecessarily complex, and may lead to an overfitted model as long as often the data set lies on the lower dimensionality manifold.

2.2 Data augmentation

In order to properly train and test a classifier, the size of both training and testing samples need to be taken into account. Small sample size affects the ability of the model to find a solution of multi-class ill-posed problems.

Dealing with sample-size planning is a central part of the model design especially when statistically independent samples are limited and expensive to obtain (Beleites et al. 2013). This problem is major for high-throughput sequencing, where shape and size of gene expression data represent complex phenomena, such as the molecular mechanisms involved in gene expression and the sub sequentially synthesis of proteins, consisting of hundreds of genes measured for a small number of samples.

This issue has already been tackled using empirically fitted learning curves (Mukherjee et al. 2003) and parametric probability models (Dobbin and Simon 2006; Dobbin et al. 2008) to estimate the training set needed to classify gene expression data. Also, learning curves of trained data have been used for train and testing size planning by taking into account the confidence interval for an expected sensitivity (Beleites et al. 2013). The latter found that a reasonable precision in the validation of a classifier can be achieved with a size between 75–100 samples.

Sample size is therefore and important aspect of the experiments that needs to be considered. Simulated data can be produced based on the real data set using a neural network, such a simple autoencoder, to generate a compact representation of data from the empirical one. Autoencoder approaches have been demonstrated to be straightforward and more efficient when compared to other data augmentation techniques (Ferreira et al. 2017). Autoencoders are low level neural networks that consist of an encoder and a decoder. The input data are fed into an encoder, where parameterized non-linear transformations are used to convert the data into a new representation; then, the decoder reconstruct this data into the original input by non-linear transformations (DeVries and Taylor 2017). This makes autoencoders able to predict new data in a compact way (Møller 1993; Olshausen and Field 1997; The MathWorks Inc. 2018a).

2.3 Feature selection

Each microarray is a high-dimensional space that describes the expression of thousands of genes at the same time. The number of feature-space (the genes) directly influences the number of training data needed to ensure the correct functioning of the classifier. To solve this issue, we need to apply feature selection techniques to establish which genes plays a pivotal role in the regulation of the process being studied. This will in turn adequately describe the distribution of the data and avoid over-fitting of the model.

Features selection techniques for static bioinformatics data use either filters, wrappers or embedded methods combined with artificial neural network (ANN) or support vector machine (SVM) techniques to rank and score a set of features (McLachlan et al. 2005a; Causton et al. 2009; Roffo et al. 2017; Huang et al. 2018). Filter-based techniques rank features by extracting inherent properties from the data and they are independent from the classifier used. A popular filter algorithm is the Laplacian score (LS-FS) feature selection (He et al. 2006), which uses a nearest neighbour graph to evaluate the relevance of a feature according to its power of preserving locality. Recently, another graph-based feature selection algorithm, called the eigenvector centrality feature selection (EC-FS) (Roffo et al. 2017) has been proposed. Wrapper and embedded methods depend on the training of an SVM classifier to rank a set of features. A popular embedded method is feature selection via concave minimization (FSV) (Bradley and Mangasarian 1998).

With the increase of temporal bioinformatics data, other works have been proposed that attempted to engineering features taking the temporality of gene expression into consideration (Argyriou et al. 2007; Tomašev et al. 2015; Radovic et al. 2017). Radovic et al. (2017) developed the temporal minimum redundancy—maximum relevance (TMRMR) feature selection algorithm for temporal gene expression data. TMRMR takes into account the temporal component of the features while measuring their relevance and redundancy by means of a greedy search based on a mutual information quotient (MIQ) criterion (Tapia and Perez 2013). In Radovic et al. (2017), TMRMR is compared with popular approaches for static feature selection and the C variant of TMRMR algorithm (TMRMR-C) outperformed most the other methods. Additionally, it was already applied as a temporal feature selection approach for classifying transcriptome data using SNN Koefoed et al. (2018).

2.4 Evolving SNN model: the NeuCube architecture

To model and analyse the dynamic behaviour of a complex biological process, such as the interaction between thousands of genes over time, we have designed a novel methodology based on the NeuCube SNN architecture. NeuCube is a SNN computational modelling framework that constitutes a set of tools for spatio-temporal data modelling and understanding. NeuCube has been implemented in the Knowledge Engineering and Discovery Research Institute of the Auckland University of Technology (Kasabov 2012; Chen et al. 2013; Kasabov 2014; Tu et al. 2014; Kasabov et al. 2016).

The NeuCube general framework is comprised of a spike-time encoding module, to encode continuous value input information into spike trains; a recurrent 3D SNN module; and an evolving SNN as an output classification module (Kasabov 2014). When compared with traditional statistical and machine learning methods, the NeuCube has also demonstrated improvement of classification results (Kasabov and Capecci 2015); a better analysis of the spatial and temporal activity modelled in the network (Marks 2017); and better understanding of the data and the processes that are measured through it including the genetic processes that regulate the data (Capecci et al. 2015; Espinosa-Ramos et al. 2017).

3 A detailed presentation of the proposed methodology and a case study illustration

3.1 SNN architecture

Figure 1 shows a graphical representation of the novel SNN methodology designed for time-series gene expression data analysis.

Fig. 1
figure 1

The figure shows the gene expression data modelling procedure and the proposed SNN methodology for time-series gene expression data analysis. The SNN methodology can be divided into three main modules: input data encoding module; a 3D SNN cube module; an output module for classification and a module for knowledge extraction from the data by means of GIN analysis

This SNN system comprises four major modules:

  • Gene expression data modelling.

  • Input module, where the transcriptomics data is first modelled and then encoded into trains of spikes, each spike representing a change in the gene expression over time.

  • 3-D SNN cube (SNNc) module (the Cube), where the encoded time-series data is entered and learned.

  • Output module, where data is classified and new knowledge discovered by means of a GIN.

In the following section, this procedure is explained in detail.

3.2 Transcriptomics data description

As a case study for our problem, gene expression data collected during elicitation of allergic contact dermatitis (ACD) over time has been used. Data was obtained from the publicly available Gene Expression Omnibus (GEO) repository of functional genomics data [NCBI GEO (Edgar et al. 2002; Barrett et al. 2012) accession GSE6281; (Pedersen et al. 2007)]. Data was collected from a control group, and a group of people that had shown an inflammatory response to a nickel patch test. Expression profiling was analysed by hybridization of high density oligonucleotide arrays obtained after skin biopsies of 7 nickel-allergic female subjects, and 5 non-allergic female controls. Biopsies were taken over four time periods: 0 h, 7 h, 48 h and 96 h. The control group did not show eczema at any time-point, while nickel allergic patients reacted with eczema at 48 and 96 h only. Samples were analysed using microarray technology that measured transcript levels in biopsies.

3.3 Gene expression data modelling

To extract the relevant number of genes that interact with each other over time and develop a computational model of GIN, we need to evaluate how relevant the original features of the model are, when using classification accuracy as an objective function to the problem. To achieve this, we filtered the original data first, to remove those genes with low variance and low Entropy. Secondly, we applied feature selection to both the static and the temporal component of the data.

3.3.1 Data filtering

Firstly, data was normalised to the \(\log _2\) to measure proportional changes only of gene expression over time, as they are considered biologically more relevant (McLachlan et al. 2005a). Especially, \(\log _2\) transformations are widely recommended in bioinformatics literature, as these translate to a binary representation of fold changes of microarray-detectable expression levels. Normalised data was filtered by low variance and Entropy lower than the 30th percentile. Figure 2 shows the expression profile variance before and after the filtered were applied to the original data.

Fig. 2
figure 2

Dimensionality reduction of gene expression data. Top: expression profile variance of the original data. Middle: expression profile after filtering genes with low variance. Bottom: expression profile after filtering the data with entropy level below the 30th percentile

3.3.2 Data augmentation

In our experiments, data was augmented twice using an autoencoder NN:

  • firstly, we increased and balanced the number of samples from the original data to achieve a reasonable precision in the classifier validation Beleites et al. (2013);

  • secondly, the time-point of data collection was augmented to be able to encode enough spikes to stimulate the SNN model.

Data was divided into four subsets, according to the time of collection (i.e. 0, 7, 48 and 96 of hours nickel exposure). An autoencoder, implemented in the Matlab’s Statistics and Machine Learning Toolbox (2018a), was used to augment and balance the data to obtain a size of 50 samples per class and per time point from the original 7–10 available. This means that we generated a total of 100 samples per time point by combining real and simulated data. This size is necessary to achieve a reasonable precision in the validation of the model (Beleites et al. 2013). The autoencoder was trained using a scaled conjugate gradient back-propagation.

The performance ability of the neural network was measured by calculating the mean squared error with L2 and sparsity regularizers to control that the minimum gradient was reached by the model at a given epoch (Møller 1993; Olshausen and Field 1997).

Figure 3 shows an example of the application of the autoencoder method to generate synthetic data to increase the number of samples.

Fig. 3
figure 3

Augmentation of the static component of the data. A neural network, an autoencoder, is applied to the original data to generate a compact representation of the data. Top: in red is plotted the log2 of the relative expression level of nickel allergic subjects with no nickel exposure. Bottom: in green is plotted the synthetic data generated. (Color figure online)

The temporal component of the data was augmented from the original four data points (0 h, 7 h, 48 h and 96 h) up to 20 time-points by using an autoencoder, as explained in the previous section (Fig. 4).

Fig. 4
figure 4

Augmentation of the temporal component of the data. Top: plotted in red is the log2 of the relative expression level of a sample belonging to a nickel allergic subject with no nickel exposure. Bottom: in green is plotted the data augmented by its temporal component from 4 time points to 19 time points. (Color figure online)

3.3.3 Feature selection

Features were also selected, at two different stages of the experimental process:

  • firstly, a combination of three popular feature selection methods [LS-FS (He et al. 2006), EC-FS (Roffo et al. 2017) and FSV (Bradley and Mangasarian 1998) algorithms] was used to subtract the most relevant features out of the static component of the data;

  • secondly, a recently proposed temporal features selection technique [the TMRMR-C approach )Radovic et al. 2017)] was used to select the sub-set of genes responsible for the temporal regulation of the molecular mechanisms involved in ACD.

Popular feature selection approaches based on filters, wrappers, and embedded methods were applied to the static component of the augmented data, which was originally divided into four subsets, according to the time of collection (i.e. 0, 7, 48 and 96 of hours nickel exposure).

Experiments were carried out by using the publicly available feature selection code library (FSLib) (Roffo 2018; Roffo et al. 2015, 2017). For each subset, we reduced the number of features from 344,34 to 200 by applying the LS-FS, EC-FS and FSV algorithms. In the original paper (Roffo et al. 2017) it is reported that, when applied to microarray data, these methods generated the best accuracy results with this number of features. The quality of each subset of features was estimated by means of classification accuracy of a Linear-SVMs classifier trained with 80% of randomly partitioned data and tested on the remaining 20%. The first 200 highest ranked features by LS-FS, EC-FS and FSV approaches were intersected to assess the validity of the selected genes. These three approaches were only used, as they scored the highest accuracy for all four subsets of data (i.e. time 0–96 h). A total of 151 unique features resulted to intersect across the subsets. Finally, data was sorted according to the time of collection to study the regulation of gene expression patterns over time. As it has been found that about 30–50 genes are the optimum number for studying a molecular process (Beer et al. 2002) and many authors worked out that the accuracy of a classifier improved around these values and, similarly as we aim to achieve, it was used as a reference to draw network graphs of features interacting with each other (Radovic et al. 2017; Roffo et al. 2017; Koefoed et al. 2018). We applied the TMRMR-C approach to identify the features that out of the selected not only influenced the gene expression at every occurrence of data collection, but also the genes that are temporally interdependent. Once the temporal component was augmented, the applied TMRMR-C method resulted in the selection of 45 genes across the 151 available. These genes were: MICU3, RPGRIP1L, PLGLB1, ZEB1, MGP, SCN3A, GNG11, SLITRK6, PAK3, AW590588, TCEAL7, NCOA2, ANKRD29, MMRN1, LOC101930404, PDE4DIP, EIF4E3, COL28A1, DSEL, ENPP5, PWAR6, LRRC2, MPPED2, CNTN4, AI222435, LOC653602, PDCD4, WISP1, CLIC5, ABCC9, FMO3, TMEM255A, SNX16, EBF1, RFTN2, C1QTNF7, SUSD5, EPHA3, FIGN, AU156772, RBMS3, LEPROT, BF224430, LOC100507376 and SIM1.

The subsequent experiments were performed on the final data set of \(20\times 45\times 100\) (time-points/features/samples).

3.4 Encoding and mapping of data into the SNN cube for unsupervised learning

3.4.1 Encoding algorithm

The gene expression data was first sorted as a sequence of vectors, which was fed to the SNN cube encoding module. Every vector was converted into spike trains by applying a binarization algorithm. Several algorithms can be used depending on the data and the scope of the study. In this case, we have chosen the Bens Spiker Algorithm (BSA) (Schrauwen and Van Campenhout 2003; Nuntalid et al. 2011), which produces unipolar spike trains with two states (firing or non-firing). BSA encodes only positive spikes to emulate excitatory processes observed in glutamatergic synapses. This algorithm applies a finite reconstruction filter (FIR) and for every time \(\tau\) calculates two error metrics as follows:

$$\begin{aligned} \sum _{k=0}^{M} abs(s(k+\tau )-h(k))\quad and\quad \sum _{k=0}^{M} (s(k+ \tau )). \end{aligned}$$
(1)

If the first error is smaller than the second minus a threshold, then fire and subtract the filter from the input, otherwise there is no spike. BSA also depends from a threshold, which needs to be optimised (as explained in the Sect. 3.5.2). Figure 5 shows an example of data encoded into spikes by using BSA binarization procedure.

Fig. 5
figure 5

Encoding of the signal generated by one of the features/genes into a spike train by BSA algorithm. In this example, only sample one of MICU3 gene is shown

3.4.2 Spiking neurons mapping

The encoded spike trains are fed into a cube of e.g. \(10\times 10\times 10\), leaky integrate-and-fire (LIF) neurons. Input features are mapped as proposed in Tu et al. (2014, 2017), as this method optimises the mapping of the 45 input variables into the SNN cube using a graph matching algorithm. The graph matching algorithm was implemented by constructing two weighted graphs (i.e. graphs with branches/edges that have numerical weigh assigned) as follows:

  • the neuron similarity graph (NSG), where the coordinates of the input neurons are calculated as \(V_{NSG} = {(x_i, y_i, z_i)|i=1,\ldots ,v}\) and connected to their k neighbour input neurons by the multiplicative inverse of the Euclidean distance determined by their three coordinates (x, y, z).


  • the time-series similarity graph (SSG), where the set of spike trains of all features is \(V_{SSG} = {(s_i)|i=1,\ldots ,v}\) with

    $$\begin{aligned} s_i = \sum _{k=1}^{N_i} \delta (t-t_k) \end{aligned}$$
    (2)

    where s is the temporal sample, v temporal variable, t represents the observed time length of each sample and \(\delta\) is the Dirac delta function. In this graph, every spike density function is connected to the k neuron with the highest correlated neuron and the graph branches are weighted according to their similarity. The similarity between spike trains is measured by either using spike density correlation or maximum spike coincidence.

  • After constructing the two NSG and SSG weighted graphs, the graph matching technique is used to determine the optimal mapping between them. The vertex and edge similarity are defined as:

    $$\begin{aligned}&exp(-|\widetilde{d}(i_{NSG})-\widetilde{c}(i_{SSG})|^2/2\sigma ^2_n); \nonumber \\&\quad i_{NSG},i_{SSG} =1,\ldots ,v \end{aligned}$$
    (3)

    where \(i_{NSG}\) and \(i_{SSG}\) represents the graph edge weight vertices of the NSG and SSG graphs respectively, while \(\widetilde{d}(i_{NSG})\) and \(\widetilde{c}(i_{SSG})\) represent the weight to all other vertices.

    $$\begin{aligned}&exp(-|a^{NSG}_{ij}-a^{SSG}_{kl}|^2/2\sigma ^2_e);\nonumber \\&\quad i,j,k,l =1,\ldots ,v \end{aligned}$$
    (4)

    where \(a^{NSG}_{ij}\) and \(a^{SSG}_{kl}\) represent the graph edge weights for both NSG and SSG, \(\sigma _n\) and \(\sigma _e\) are parameters that control the affinity between neurons and edges, respectively.

  • The factor graph matching (FGM) algorithm Zhou and De la Torre (2012) is used to find the permutation matrix P that minimizes the following objective function:

    $$\begin{aligned} ^{min}_{P} || A_n - P A_s P^T || ^{2}_{F} \end{aligned}$$
    (5)

    where \(A_n\) and \(A_s\) are the adjacent matrix to the NSG and SSG graphs and \(||*||_F\) denotes de Frobenius matrix norm.

Optimising this process improved the results greatly, as the mapping influenced the learning, and therefore the understanding of the time-series patterns created by the data. The spiking LIF neurons of the Cube are connected using a small-world (SW) connectivity rule. These connections are computed as the multiplication of a random number \([- \ 0.1,+ \ 0.1]\) and the reciprocal of the Euclidean distance d(ij) between pre- (i) and post- (j) synaptic neurons (calculated according to their coordinates. Another parameter, the distance threshold \(D_{thr}\), affects the d(ij). This parameter is calculated as following:

$$\begin{aligned} D_{thr} = max~(d(i,j))~p \end{aligned}$$
(6)

where p is the parameter of the SW connectivity. Twenty percent of these weights are selected to be negative, as in the mammalian brain the number of inhibitory neurons is found to be about 20–30% (Capano et al. 2015), while 80% are positive/excitatory connections weights.

Finding the appropriate initial structural connectivity in a network of spiking neurons is important, as it allows the SNN model to accurately learn from the data and capture patterns of functional connectivity from it. This preserves the spatio-temporal relationships within the data, which is a significant source of information generally overlooked by other techniques. Unsupervised learning in the Cube is performed to modify the initially set connection weights. The SNNc learns to activate same groups of spiking neurons when similar input stimuli are presented, also known as a polychronization effect (Izhikevich 2006). Hebbian learning rules (Hebb 1949) are applied at this stage, implementing spike-timing-dependent plasticity (STDP)-like algorithms (Song et al. 2000).

3.4.3 Unsupervised learning

The STDP protocol (Hebbian-like rule) is used in the Cube for unsupervised learning. This protocol describes the connection between pre- and post- synaptic neurons as stronger as their activation lasts and repeats in time. This is computed as following:

$$\begin{aligned} w_j(t) = \left\{ \begin{array}{l l} w_j(t-1) \pm \alpha /\varDelta t &{} \quad {\text {t}}_{\text {j}} \ne {\text {t}}_{\text {i}}\\ w_j(t-1) &{} \quad {\text {t}}_{\text {j}} = {\text {t}}_{\text {i}} \end{array} \right. \end{aligned}$$
(7)

where \(\alpha\) is the rate of the STDP learning protocol, \(\varDelta _t\) indicates the time since last spike was emitted by the post-synaptic neuron j. If a pre-synaptic neuron i fires before a post-synaptic neuron j then, its connection weight \(w_{j,i}\) increases, otherwise, it decreases. The STDP protocol defines the connections between neurons as robust as their activation continues and repeats in time. Neurons develop the ability to form new connections in the network that can then be studied. From data towards knowledge, the proposed SNN system can form an associative type of memory that can be analysed. This cannot be achieved with purely statistical or mathematical techniques.

3.5 Data classification with supervised learning

3.5.1 Supervised learning

During supervised training, the same data used for the unsupervised learning is transmitted again in the trained cube, and output neurons are formed and trained to classify the data. To learn and classify the spiking activity of the Cube, we use the dynamic evolving spiking neural network (deSNN) classifier, which allows simple class-based discrimination (Kasabov et al. 2013). The spiking activity generated throughout the learning process can be visualised and used as a biofeedback. In the SNN system designed, supervised learning is performed by using the deSNN classifier proposed in Kasabov et al. (2013). This method associates the rank-order (RO) learning rule (Thorpe and Gautrais 1998) and the STDP protocol. In other words, after training each sample is associated to an output neuron that is connected to each and every other neuron of the Cube. First of all, the connection weights \(w_{i,j}\) between the input neuron i and the output neuron j are all set equal to zero. Then, they are computed according to the RO learning rule:

$$\begin{aligned} w_{i,j} = mod^{order(i,j)} \end{aligned}$$
(8)

where mod indicates a modulation factor and order(ij) the order of the first incoming spike. Then, the subsequent connection weights will be modified as per the spike driven synaptic plasticity (SDSP) learning rule. Implemented as following:

$$\begin{aligned} w_{i,j}(t)=\left\{ \begin{array}{l l} w_{i,j}(t-1)+drift &{} {\text {S}}_{\text {j}}({\text {t}})=1 \\ w_{i,j}(t-1)-drift &{} {\text {S}}_{\text {j}}({\text {t}})=0 \end{array} \right. \end{aligned}$$
(9)

where drift indicates a parameter used to modify the weights and \(S_i(t)\) is the occurrence of the spikes arriving from neuron i at a time t after the first one was emitted.

According to the literature, both mod and the drift parameter values need to be optimised. In the next section, we report the procedure followed to optimise these and the other parameters of the NeuCube system.

3.5.2 Classification

To investigate whether the system was able to correctly discriminates the two different groups of people, we classified the data into two classes: class one - patients that shown ACD—and class two—control. In total, we had 100 samples, 50 for each class, 20 time steps of data collection and 45 features/genes. Optimisation of the numerous parameters of the SNN system was performed via a grid search method that appraised the combinations between a minimum and maximum range of values randomly assigned to the each parameter to be optimised. The combination with the highest classification accuracy is then selected. This process was execute as a loop, which generated each time a random new cube evaluated using fivefold cross validation method.

Having set the parameter values to use in the experiments, the classification accuracy of the model was assessed by running experiments with randomly selected data with 75% training versus 25 testing. Optimised parameter values and the results obtained with them are reported in Table 1.

Table 1 Classification accuracy percentage obtained with the SNN model

These results evince the ability of SNN system to discriminate between the two classes, which is a good indication that the proposed method can be used for classifying the gene expression time-series data. To validate these results, we run the same experiments using a classical machine learning method for gene data analysis: a SVM classifier.

SVM classifier is another popular method for classification of gene expression data that has been widely proposed in the literature (McLachlan et al. 2005b; Wit and McClure 2004; Causton et al. 2009). It is based on finding a hyperplane that best divides dataset into two classes. It is accurate when classifying small, clean datasets. It is not the best suited for larger and noisier data sets which biological data can be.

Here, we have utilised the error-correcting output codes (ECOC) model (The MathWorks Inc. 2018b) with one-versus-one coding design for a multiclass SVM classifiers, as this model is able to improve classification accuracy, when compared to other multiclass models (Fürnkranz 2002). Parameters of this model have been optimised by using automatic hyperparameter optimization (Fürnkranz 2002; The MathWorks Inc. 2018b), which minimise fivefold cross-validation loss. The optimised SVM classified the data with the same accuracy of the SNN model utilised (99.9% correctly classified observations). This demonstrated the validity of our experiments. However, in addition to that, the proposed methodology offers a better understanding of the data and the processes that are measured through network analysis, which is not possible with other methods, such as SVM. This enables new information and knowledge discovery through meaningful interpretation of the models and the processes that regulate gene expression over time by means of GIN analysis.

3.6 Analysis of the results and knowledge discovery

The learned spiking activity and connectivity generated by the time-series gene expression data can be analysed by means of a GIN. The input genes that are mapped in the cube are used as a source of information to define the centre of a cluster of neurons. Each one of the remaining 1000 unlabelled neurons is assigned to one of these clusters according with the amount of information exchanged during unsupervised learning with each features. The more spikes transmitted between clusters, the greater is the spatio-temporal interaction between genes over time. The obtained GIN represents a connectivity graph, where nodes represent gene variables, and the lines and their thickness define the amount of temporal information exchanged between genes. In other words, the interaction between input variables is captured in the GIN in terms of their changes in time. The information shown in the GIN is used to analyse the complex temporal patterns “hidden” in data, which can be used for the development of new methodologies of gene expression data modelling and understanding.

3.6.1 Information clustering for knowledge discovery

As detailed in Tu et al. (2014, 2017) and extended in Doborjeh et al. (2017), as the input features are the source of information in the SNN cube, each one of these are used to define the centre, number and label of clusters, where the other unlabelled neurons of the Cube are assigned by computing the following:

  • Firstly, according to the network theory (Chung 1997), we compute the normalised adjacent graph G by constructing vertices of input neurons V and edges E weighted by either the spiking activity \(s_{ij}\) or the change in connection weights \(w_{ij}\) between pre-synaptic i and post-synaptic neurons j of the Cube, depending on which information transmission we want to cluster. Stronger \(E_{ij}\) represents stronger interaction and more information exchanged between neurons.

  • Secondly, we compute the probability \(P_v\) that a random neighbour of a vertex \(v \in V\) of the graph G is chosen after n steps of a random walk. The initial distribution \(p^0_v\) of an input neuron’s random walk with initial vertex \(v_i\) is defined as:

    $$\begin{aligned} p^0_v = \left\{ \begin{array}{l l} 1 &{} if v = v_i\\ 0 &{} otherwise \end{array} \right. \end{aligned}$$
    (10)

    and subsequently the probability is calculated as:

    $$\begin{aligned} \forall v \in V : P^{n+1}_v =\sum _{(u,v)\in E} \frac{E_{uv}}{d(u)} P^n_v \end{aligned}$$
    (11)

    where d(u) is the weighted degree of vertex u (i.e. sum of weights of neighbour edges). \(P^n_v\) is calculated until it converges \(P^{n+1}_v = P^n_v\), where every \(p_v\) will belong to an input vertex and each \(v \in V\) to a starting vertex with the highest \(p_v\) membership.

3.6.2 GIN discovery and analysis

After the optimisation procedure, we retained the best SNN model parameters, the initial connectivity and temporal mapping of the input features. Network analysis was carried out by training the entire time series data available for each class separately. This process was iterated 100 times per class, to allow the system to learn from the short-time series data. First, the input features were clustered with other neurons of the cube according to their connection weight value. Consequently, the neural activity generated per class revealed 45 clusters of genes (Fig. 6, left). The proportion of activity generated in the entire 3-D cube is illustrated in the pie chart (Fig. 6, right).

Fig. 6
figure 6

Top: the 3D SNN cube shows the clusters formed by the input gene features with the other neurons of the Cube according to their firing rate. Bottom: the pie chart illustrates the numerical proportion of the clustered activity expressed as a percentage with respect to the entire cube: (left) class one—people affected by ACD; (right) class two—controls. Every gene is shown in a different colour, as indicated in the legend. (Color figure online)

As shown in Fig. 6, there are three/four major clusters obtained for each class. The major component for class one is given by the cluster of gene BF224430 (13% of the total spiking activity generated during learning); followed by OC100507376 and WISP1 (6%).

On the other hand, the major component for class two is given by the cluster of gene COL28A1 (10% of the total spiking activity generated during learning); followed by ENPP5 (7%), MGP and AU156772 (both with 6%). Information about these genes is summarised in Table 2.

Table 2 The table reports information about relevant genes that shows higher spiking activity after learning

The total interaction between the 45 clusters was calculated per class and visualised as GIN (Fig. 7, left class one, right class two). Each cluster of genes is indicated by a node with a different colour. The stronger the interaction between clusters, the thicker is the edge that connects them. These edges are undirected and they have a weight (a numerical value) associated with them. In this case, this value represents the spiking activity exchanged between clusters of genes. This is used to depict the quantitative change in temporal expression that a gene induces over another and/or how closely related two genes are in terms of expression stability and regulation.

Fig. 7
figure 7

GINs obtained per class. The node by the 45 clusters formed by the gene input features. (Left) class one—people affected by ACD; (right) class two—controls. The 45 genes/clusters are indicated in different colours corresponding to each of the respective clusters. The stronger the interaction between genes, the thicker is the line that connects the nodes. (Color figure online)

The GINs are the result of an adjacency matrix, where rows and columns are assigned to the nodes of a network (the clusters formed by the selected genes) and the presence of an edge is symbolised by a numerical value. The weights or thickness of the edges express more complex relationships between clusters of genes, such as the presence or absence of spiking activity between them and the amount of activity exchanged (thickness of the edges). This corresponds in turn to the temporal interaction or regulation that the clusters of genes have produced over time.

This study uses the information available from transcriptome data to classify and analyse the interaction of different genes over time as a GIN. For our experimental case study, the raw gene expression data was first modelled and encoded into trains of spikes. Then, patterns of temporal activity generated during learning were classified, to establish the effectiveness of the SNN methodology to separate different time-series gene expression data into different classes. Then, the patterns of temporal activity generated in the Cube after learning were analysed in terms of spiking activity and connection weight changes and new information was extracted as a GIN to study the interaction of genes expressed over time. The two GINs obtained for each class show significance differences between subjects that are sensitised to nickel versus control subjects. in comparison with GIN of class one, the GIN of class two shows higher spiking activity exchanged between clusters. One of the main actors is gene PDC4, which for class one is strongly connected with LOC100507376, RPGRIP1L and SLITRK6; while for class 2 is strongly interconnected with SUSD5. This gene is especially involved in inhibiting events important in driving invasion and may play a role in apoptosis, which is consistent with the skin cells reaction to the nickel.

4 Conclusions and future work

The analysis of patterns generated by time-series gene expression data constitutes a major goal for the area of bioinformatics and system biology. In this study, we have modelled short-time transcriptome data and revealed meaningful patterns and new knowledge from the genes expressed over time by means of a SNN methodology and a GIN. For the case study data, we have found that the exchange of temporal activity has been dominated by BF224430 and COL28A1 clusters. We have also found that there is PDC4 gene strongly interact with several other genes, especially for class one GIN. According to the information available regarding the nature of these and the other genes, it seems plausible that patients affected by ACD express genes that are revealed in skin and that these genes are also related to cell proliferation, cellular transmission and inhibition, immune response and apoptosis. All of this, appear to be the consequence of the assimilation of nickel in the body and the increased of eczematous reactions over time. We can conclude that, SNNs is a remarkable method of choice when dealing with big transcriptomics data, as it extracts only relevant patterns of temporal activity from the gene expression data. Moreover, the clinical focus of this study has demonstrated the model ability to constitute a valuable tool that can support experts in the area of bioinformatics and system biology, and in understanding and studying the interaction between genes expressed over time. As future work, we would like to highlight the following research lines:

  • Implementation of different feature selection methods for temporal gene expression data, such as the one proposed in Radovic et al. (2017).

  • Studying the possible implication of directionality between temporal interaction of genes in the network.

  • Comparative analysis of the results obtained with the proposed method with the ones obtained by Pedersen et al. (2007) on the same data set.

  • Applying the proposed model with the neuroreceptor dependent plasticity learning rule proposed in Espinosa-Ramos et al. (2017).

  • Developing an automated pipe line of the whole data preprocessing procedure, as an embedded platform to the main model.