Keywords

1 Introduction

Analysis of negative stain TEM images has long been used for the discovery and detection of virus particles. The visualization of virus structures has been possible only after the development of TEM. The study of virus structures through TEM images allows organism-specific reagents for the recognition of pathogenic agent, without any a priori knowledge of pathogens present in a sample [3]. This enables broader examination of viruses, making TEM an inevitable method for virus diagnosis. Negative staining is a preferred technique for observing small particles, such as viruses, in fluids. However, high maintenance costs, level of expertise and time required for manual inspection of the TEM images, and advancement in automated image acquisition process entail the development of automated analysis of TEM images for virus detection.

TEM images provide important information regarding the shape and size of the virus particles, which can be used to group the particles into various classes, for example, viruses like Adeno, Astro and Rota form icosahedral structure, Cowpox, Dengue and Lassa viruses reflect regularity in their structures, whereas Ebola, Influenza and Marburg viruses are highly irregular in shape. Early approaches for virus identification were mostly dependent on the morphology of the particles [1, 2]. But, the morphological information obtained from the TEM images does not proved to be sufficient for classification of the viruses.

Apart from morphology, different viruses exhibit different surface textures when imaged using TEM. This information can be efficiently utilized using various texture analysis techniques for automatic classification of the virus particles from TEM images. Matuszewski et al. [12] have transformed the spatial information of the images into frequency domain using discrete Fourier transform and then, extracted features from the multiple spectral rings formed at the magnitude spectrum of TEM images to differentiate the four icosahedral structured viruses, namely, Adeno, Astro, Rota and Calici. In [17], a method has been proposed based on higher order spectral features, to capture the contour and texture information from the same four icosahedral virus images for the diagnosis. The radial density profile (RDP) has been used in [19] to discriminate the intensity variation between three maturation stages of human cytomegalovirus capsids present in the cell sections of TEM images. The RDP has also been computed based on Fourier magnitude spectrum (FRDP) in [9], which can be interpreted as a generalization of the spectral rings considered in [17]. Moreover, Harandi et al. [6] have mapped the original data to a high-dimensional Hilbert space and then computed the Covariance Descriptors (CovDs) in infinite-dimensional spaces, to address the virus classification problem. Finally, several Bregman divergences have been used to compute the dissimilarities between the resulting CovDs in Hilbert space, among which Jeffreys and Stein divergences based on the SVM have been found to achieve the best result. Recently, a new method, based on convolutional neural network, has been proposed in [7], which transforms a TEM image to a probabilistic map for virus particle detection.

The concepts of local binary pattern (LBP) [16] and RDP (LBP+RDP) have been used in [9] to identify different viruses from TEM images using random forest classifier. Nanni et al. [13] have presented two methods, namely, NewH and Fusion; the former focuses on extracting descriptors from the co-occurrence matrix with the goal of enhancing the performance of Haralick’s descriptors [5], while the latter is an ensemble of local phase quantization variants with ternary encoding. From the results reported in literature, it can be noticed that different virus classes are well described by different textures, which hinders the overall virus detection ability. One of the main problems in virus detection is uncertainty, which may have been originated due to incompleteness in class definition and overlapping class boundaries. An effective paradigm to deal with incompleteness and uncertainty is the theory of rough sets [18]. It provides a mathematical framework to model uncertainties related to the given data.

In this regard, the paper presents a new method for automatic virus classification from TEM images, by judiciously integrating the merits of local texture descriptors and the theory of rough sets. Given a set of virus TEM images, the proposed method first identifies a subset of important features from the original feature set of the local texture descriptors, computed at a specific scale and then forms the relevant feature set corresponding to each class-pair which can appropriately differentiate the pair of virus classes. The theory of rough hypercuboid approach is employed to evaluate the relevance of a texture descriptor in categorizing virus samples into multiple classes. In the proposed approach, the important feature set is selected in such a way that it not only characterizes the virus samples present in the data set, but also the classes to which the samples belong. The set of relevant texture descriptors, selected from the important feature sets, corresponding to each the class-pair is considered to form the feature set for multiple virus classes. The SVM with linear kernel is used to evaluate the performance of the proposed method. The effectiveness of proposed method, along with a comparison with related approaches, is demonstrated on a real-life Virus image database.

2 Proposed Algorithm

The inherent texture properties of virus TEM images are quite different from each other. The inter-class structural similarities, intra-class variations in appearance, uncertainty in virus class definitions, and presence of noise further amplify the difficulty of texture identification. In this context, the proposed method introduces a new approach for classifying viruses from TEM images. It first identifies important features, which effectively capture significant characteristics of each pair of virus classes, from the relevant texture descriptors evaluated at a particular scale, and finally forms the feature set for multiple virus classes.

Suppose, \({\mathbb U}=\{y_1,\cdots ,y_k,\cdots ,y_n\}\) denote the set of n training images of virus particles, where each image \(y_k \in \mathfrak {R}^m\). Let \({\mathbb C}=\{F_1,\cdots ,F_j,\cdots ,F_m\}\) represent the set of m features. So, a virus TEM image \(y_k\) can be represented by a set \(H_k=\{H_{k1},\cdots ,H_{kj},\cdots ,H_{km}\}\), containing m feature values corresponding to \(y_k\), where \(H_{kj}=y_k(F_j)\) represents the value of feature \(F_j\) obtained for image \(y_k\). In the proposed approach, four local texture descriptors are considered, which are LBP: local binary pattern [16], LBP\(^\mathrm{ri}\): rotation-invariant LBP [15], LBP\(^\mathrm{riu2}\): rotation-invariant uniform LBP [15], and CoALBP: co-occurrence of adjacent LBP [14]. So, \(H_k\) represents the normalized histogram, obtained from either of the four local descriptors, for the image \(y_k\). Let, \(H_k\) be arranged in descending order and denoted by \(\tilde{H}_k=\{\tilde{H}_{k1},\cdots , \tilde{H}_{kj},\cdots ,\tilde{H}_{km}\}\) such that \(\tilde{H}_{k1} \ge \tilde{H}_{k2} \ge \cdots \ge \tilde{H}_{km}\), and the corresponding feature indices of \(\tilde{H}_{k}\) is preserved in the set \({J}_k=\{{J}_{k1}, \cdots ,{J}_{kj},\cdots ,{J}_{km}\}\). Assume, each of the TEM images of \(\mathbb {U}\) belongs to one of the c virus classes \(\mathbb {U}/\mathbb {D}=\{B_1,\cdots ,B_i,\cdots ,B_c\}\), where \(\mathbb {D}\) denotes the set containing class label information.

Primarily, significant properties of an image \(y_k\) is represented by the values of \(H_k\), which represents the normalized histogram of \(y_k\). However, the proposed method assumes that not all the feature values of \(H_k\) contribute uniformly in representing the characteristics of the image \(y_k\). Indeed, only a subset of features of \(\mathbb {C}\) can precisely illustrate important properties of \(y_k\), which is defined as the important feature set of \(y_k\) and denoted by \(I_k\), where \(I_k \subseteq \mathbb {C}\). However, each image has it’s own characteristics, which can be reflected by a specific set of important features.

The cumulative sum of first q features of the sorted normalized histogram \(\tilde{H}_k\) is computed to select the important features of \(y_k\) and denoted by \({E}(y_k,q)\), which is defined as the energy function in the current study. Certainly, \({E}(y_k,m)=1\) and \({E}(y_k,q)\in [0,1], \forall y_k \in \mathbb {U}\). It signifies the fraction of total energy, present in \(\tilde{H}_k\), which is retained by the first q features of \(y_k\). Thus, relevant information regarding the properties of the image \(y_k\) can be suitably represented by the energy of \(y_k\), evaluated from \(\tilde{H}_k\). In order to retain a given fraction of energy \({E}_{0}\) of the sample \(y_{k}\), the required number of important features \(d_{k}\) is obtained as the minimum value of q for which \({E}(y_k,q)\) attains \({E}_{0}\) value. The average number of important features \(\overline{d}\) corresponding to the entire set of samples \(\mathbb {U}\) is computed from the individual \(d_{k}\)s. As defined in [10], the set of important features \(I_k\) of the sample \(y_k\) can be obtained from the first \(\overline{d}\) features of sorted histogram \(\tilde{H}_k\), as follows:

$$\begin{aligned} I_k=\{F_j~|~{J}_{kq}=j \quad \mathrm{and} \quad q \le \overline{d}\}. \end{aligned}$$
(1)

Thus, the important feature set \(I_k \subseteq \mathbb {C}\) of the image \(y_k\) consists of only those features which can efficiently describe the inherent properties of \(y_k\).

Now, it is most likely that the samples from a specific virus class will be represented by similar sets of important features, whereas samples from different classes will be represented by different important feature sets. Therefore, the probability of occurrence \(P(F_{j}|B_{i})\) of a feature \(F_j\) in the important sets of samples belonging to a particular class \(B_i\) is obtained and a threshold parameter \(\epsilon \) is introduced to differentiate noisy features from important features. The feature set \(\mathcal {C}(B_i)\) representing the class \(B_i\) is formed with only those features \(F_j\)’s which have \(P(F_{j}|B_{i})\) values greater than or equal to \(\epsilon \). So, the set \(\mathcal {C}(B_i)\) will only contain a feature \({F}_j\) if it is found to be important for most of the samples of \(B_i\) and also reflects the relevant characteristics of the virus class \(B_i\). Let, the significant characteristics of a pair of classes, say \(\{B_{i},B_{r} \}\), be represented by the set \(\mathcal {C}(\{B_{i},B_{r} \})\). So, the set should contain those features which can efficiently reflect the properties of both the classes \(B_{i}\) and \(B_{r}\). Hence, \(\mathcal {C}(\{B_{i},B_{r} \})\) is formed by taking intersection of the two sets \(\mathcal {C}(B_i)\) and \(\mathcal {C}(B_r)\).

Let us consider a set of t modalities \(\mathcal {M}=\{\mathcal {M}_1,\cdots , \mathcal {M}_p,\cdots ,\mathcal {M}_t\}\), where each modality corresponds to a specific texture descriptor evaluated at a particular scale. Given the set of modalities, the proposed method computes the relevance \(\varGamma _p (\{B_i,B_r \}) \) of the feature set \(\mathcal {C}_p(\{B_{i},B_{r}\})\) under modality \(\mathcal {M}_p\) to assess the efficacy of the set in describing important properties of the class-pair \(\{B_{i},B_{r}\}\). In order to quantify relevance, the concept of hypercuboid equivalence partition matrix of rough hypercuboid approach [11] is employed. The relevance \(\varGamma _p (\{B_i,B_r \}) \) of the feature set \(\mathcal {C}_p(\{B_{i},B_{r}\})\) with respect to the pair of classes \(\{B_{i},B_{r}\}\) is obtained as follows [11]:

$$\begin{aligned} \varGamma _p (\{B_i,B_r \}) = 1 - \frac{1}{n_{ir}} \sum _{k = 1}^{n_{ir}}{v_k(\mathcal {C}_p(\{B_{i},B_{r}\}))} \end{aligned}$$
(2)

where, \(n_{ir}\) is the number of samples which belongs to class-pair \(\{B_{i},B_{r}\}\) and

$$\begin{aligned} v_k(\mathcal {C}_p(\{B_{i},B_{r}\})) = \left\{ \begin{array}{ll} 1 &{} \ \text {if } \mathrm{h}_{ik}(\mathcal {C}_p)=1 \ \hbox {and} \ \mathrm{h}_{rk}(\mathcal {C}_p) = 1 \\ 0 &{} \ \hbox {otherwise,} \end{array} \right. \end{aligned}$$
(3)
$$\begin{aligned} \hbox {where}~ \mathrm{h}_{i}(\mathcal {C}_p) = [\mathrm{h}_{ik}(\mathcal {C}_p)]_{1 \times n_{ir}} = \bigcap _{{F}_j \in \mathcal {C}_p} \mathrm{h}_{i}({F}_j). \end{aligned}$$
(4)

Here, \(\mathrm{h}_{ik}({F}_j) \in \{0,1\}\) represents the membership of sample \(y_k\) in i-th equivalence partition induced by the feature \({F}_j\) and is determined as follows:

$$\begin{aligned} \mathrm{h}_{ik}({F}_j)= \left\{ \begin{array}{ll} 1 &{} \ \text {if } \mathrm{L}(B_i) \le y_k({F}_j) \le \mathrm{U}(B_i)\\ 0 &{} \ \hbox {otherwise.} \end{array} \right. \end{aligned}$$
(5)

The interval \([\mathrm{L}(B_i),\mathrm{U}(B_i)]\) is the value range of the feature \(F_j\) for the class \(B_{i}\). So, the feature value \(y_k({F}_j)\) of each object \(y_k\) belonging to the class \(B_i\) corresponding to the feature \(F_j \in \mathcal {C}_p(\{B_{i},B_{r}\})\) falls within the interval \([\mathrm{L}(B_i),\mathrm{U}(B_i)]\), which implies that an equivalence partition is nonempty. The intersection between every two such intervals corresponding to the features of set \(\mathcal {C}_p(\{B_{i},B_{r}\})\) may form an implicit hypercuboid, as indicated by the shaded rectangle in Fig. 1. It encloses the misclassified objects that belong to more than one equivalence partitions with respect to the attribute set \(\mathcal {C}_p(\{B_{i},B_{r}\})\). The relevance \(\varGamma _p (\{B_i,B_r \})\) of the set \(\mathcal {C}_p(\{B_{i},B_{r}\})\), with respect to the pair of classes \(\{B_{i},B_{r}\}\), depends on the cardinality of the implicit hypercuboid.

Fig. 1.
figure 1

Example of rough hypercuboids in two dimensions: two class hypercuboids corresponding to upper approximations of Dengue and Influenza virus classes, along with implicit hypercuboid (boundary region) denoted by the shaded region.

So, it can be observed from (2) that the relevance increases with decrease in the cardinality of implicit hypercuboids. If \(\varGamma _p (\{B_i,B_r \})=1\), then no implicit hypercuboid is formed and both the classes \(B_{i}\) and \(B_{r}\) can be defined precisely using the knowledge of \(\mathcal {C}_p(\{B_{i},B_{r}\})\). On the other hand, if \(\varGamma _p (\{B_i,B_r \})=0\), then both the classes cannot be defined using the information of \(\mathcal {C}_p(\{B_{i},B_{r}\})\). However, if \(\varGamma _p (\{B_i,B_r \}) \in (0,1)\), then \(B_{i}\) and \(B_{r}\) can be approximated using the feature set \(\mathcal {C}_p(\{B_{i},B_{r}\})\). The most relevant feature set \(\tilde{\mathcal {C}}_{ir}\) for the class-pair \(\{B_i,B_r \}\) is formed by considering the relevance value of the feature set \({\mathcal {C}}_p (\{B_{i},B_{r}\})\), corresponding to each of the t modalities, as follows:

$$\begin{aligned} \tilde{\mathcal {C}}_{ir} = \mathop {\mathrm {arg\,max}}\limits _{\mathcal {C}_p(\{B_{i},B_{r}\})} \{\varGamma _p (\{B_i,B_r \})\}. \end{aligned}$$
(6)

Finally, the feature set \(\mathcal {D}\), corresponding to all the virus classes, is formed as

$$\begin{aligned} \mathcal {D}=\bigcup \tilde{\mathcal {C}}_{ir}. \end{aligned}$$
(7)

The SVM is used to predict the texture patterns present in TEM images of test samples of virus particles, based on the final feature set \(\mathcal {D}\) obtained using (7).

3 Performance Analysis

In order to establish the efficacy of the proposed method in classifying the virus particles present in TEM images, extensive experiment is conducted on Virus image database and the corresponding results are presented in this section.

3.1 Algorithms Compared

The proficiency of the proposed method is validated through comparison with several texture descriptors, which are LBP [16], LBP\(^\mathrm{ri}\) [15], LBP\(^\mathrm{riu2}\) [15] and CoALBP [14], obtained at different scales. Here, \(\mathcal {S}_1\): scale 1, \(\mathcal {S}_2\): scale 2, \(\mathcal {S}_3\): scale 3, \(\mathcal {S}_4\): scale 4, \(\mathcal {S}_{123}\): concatenation of \(\mathcal {S}_1\), \(\mathcal {S}_2\) and \(\mathcal {S}_3\), and \(\mathcal {S}_{124}\): concatenation of \(\mathcal {S}_1\), \(\mathcal {S}_2\) and \(\mathcal {S}_4\) are considered. In the current study, the descriptors along with the corresponding scales, are chosen arbitrarily and therefore, any other sets of descriptors are equally compatible to be used in the proposed descriptor selection method. In case of CoALBP, 4-neighborhood is considered, while 8-neighborhood is considered for the rest.

The performance of several existing approaches for virus classification is analyzed with reference to the proposed method, which include Haralick textural features [5], Fourier RDP (FRDP) [9], dominant LBP (DLBP) [10], discriminative features for texture description (DFTD) [4], LBP+RDP [9], NewH [13], Fusion [13] and Jeffreys/Stein [6]. Here, 10-fold cross-validation is performed to evaluate the classification accuracy of related approaches as well as proposed method. The comparative performance of different algorithms is studied through box-and-whisker plots, tables of means, medians, standard deviations, and \(\mathrm{p}\)-values computed through both paired-t and Wilcoxon signed-rank tests, with 95% confidence level. In box-and-whisker plots, the central line on each box represents the median, the upper and lower quartiles are depicted by upper and lower boundaries of the box, respectively. The whiskers are drawn from mean to three standard deviations, so that extreme data points are also included. The outliers are plotted individually, denoted as ‘+’.

3.2 Description of Data Set

The effectiveness of the existing approaches as well as proposed method is studied through evaluation on the real-life Virus data set [8]. The samples of the database are imaged using negative stain TEM. The set includes 15 different virus classes, each of which is represented by 100 TEM images. Different viruses exhibit different structural properties. However, the diameter or cross-section remains almost constant within a particular virus class. Usually, the diameter varies from 25 nm to 270 nm depending on the morphology of the viruses. The virus particles, which are presented in the data set, include Dengue, Cowpox, Ebola, Influenza, Adenovirus, Astrovirus, Rotavirus, Norovirus, Crimean-Congo Haemorrhagic Fever, Lassa, Marburg, Orf, Papilloma, Rift Valley, and WestNile.

Fig. 2.
figure 2

Effect of energy E and threshold \(\epsilon \) on classification accuracy.

Fig. 3.
figure 3

Performance of different local texture descriptors and proposed method.

Table 1. Classification accuracy of various local descriptors and proposed method

3.3 Optimum Values of Different Parameters

In the current study, the feature set \(\mathcal {C}(B_i)\), representing the important properties of the class \(B_i\), is obtained for a particular value of Energy E and threshold \(\epsilon \). So, the values of both the parameters have an influence on the performance of the proposed method in classifying the virus particles into multiple classes.

The value of E is varied from 0.50 to 1.00 at an interval of 0.05, while the value of \(\epsilon \) is varied from 0.00 to 0.70 at an interval of 0.10 and the corresponding classification accuracy of the proposed method is noted in order to obtain the optimum values of the parameters. The SVM classifier with linear kernel is applied on virus TEM images and the effect of the values of E and \(\epsilon \) on 10-fold classification accuracy is reported in Fig. 2. The results depicted in Fig. 2 demonstrate that the accuracy of the proposed method increases with increase in energy E and decrease in threshold \(\epsilon \), and highest classification accuracy is obtained at \({E}=0.95\) and \(\epsilon =0.30\), which validates the concept of energy and threshold defined in the current study. While low energy provides restricted representation of the important feature sets of samples, high energy signifies more information captured from the images, leading to a descriptive representation of the sets. The features, which have occurred in the important feature sets inadvertently with low probability of occurrence, are removed by incorporating the threshold parameter \(\epsilon \) in the algorithm. So, for the proposed approach, the values of parameters E and \(\epsilon \) are fixed at 0.95 and 0.30. Hence, a precise representation of the set \(\mathcal {C}(B_i)\), reflecting the characteristics of the class \(B_i\), is obtained with the features which preserves 95% of total energy by at least 30% samples of each \(B_{i}\).

3.4 Comparison with Existing Approaches

In general, a specific descriptor, obtained at a particular scale, may be used to describe the textural properties of all the TEM images present in the Virus data set. However, the proposed method identifies class-pair specific modality to capture the inherent characteristics of each of the virus classes.

In order to validate the significance of class-pair relevant modalities over uniform modalities, extensive experiment is carried out on Virus data set, considering fifteen modalities corresponding to four local descriptors LBP, LBP\(^\mathrm{ri}\), LBP\(^\mathrm{riu2}\) and CoALBP, evaluated at both individual and concatenated scales. Figure 3 and Table 1 report the classification accuracy achieved by the proposed method as well as various local texture descriptors on the samples of Virus database. The results presented in Fig. 3 and Table 1 reveal that highest mean and median values are achieved by the proposed method, irrespective of descriptors and scales considered. Also, statistical significance analysis reveals that the proposed method attains significantly lower p-values in all the 44 cases. In this regard, it is to be mentioned that the proposed method attains 82.07% accuracy on Virus database with 2327 number of features, selected from the initial feature sets of fifteen modalities corresponding to the four local descriptors, evaluated at four different scales. If the feature sets of all the modalities are concatenated, then 80% accuracy can be achieved with 4280 features. Thus, the proposed method can efficiently identify class-pair specific modalities to describe each pair of classes with lesser number of features.

Finally, the performance of several existing approaches for virus classification is analyzed with reference to the proposed method, which include Haralick textural features [5], FRDP [9], DLBP [10], DFTD [4], LBP+RDP [9], NewH [13], Fusion [13] and Jeffreys/Stein [6]. In these methods, important information is captured from the TEM images to categorize the virus samples into one of the known fifteen virus classes. The performance of the proposed approach with reference to different existing methods is analyzed in Fig. 4. It is evident from Fig. 4 that the methods corresponding to Haralick features, DLBP, DFTD and FRDP exhibit poor performance in recognizing viruses from TEM images. On the other hand, the methods like LBP+RDP, NewH, Fusion, and Jeffreys/Stein show improvement in the performance. However, the proposed method attains highest classification accuracy on Virus data set with respect to all the existing methods.

Fig. 4.
figure 4

Analysis of performance of the proposed approach and existing methods.

4 Conclusion

In the current study, a new method is developed for automatic recognition of virus particles from negative stain TEM images. The main contribution of the paper lies in considering relevant modality corresponding to each pair of classes present in Virus database, rather than considering uniform modalities for all the classes. The relevance of each modality is computed based on hypercuboid equivalence partition matrix of rough hypercuboid approach. A subset of important features is selected for each of the relevant modalities by reducing the impact of both noisy pixels present in a virus image as well as noisy virus images present in a particular virus class. The proficiency of the proposed approach with respect to different existing method is established on real-life Virus image data set.