Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Effective and efficient image analysis plays an important role in computer-aided diagnosis (CAD) in digital pathology. Traditional image analysis algorithms [10, 17] and deep learning based methods [4, 11] have achieved satisfactory performance in many applications, while they usually require numerous labeled pathology images to learn a robust and stable model. Manually labeling images by doctors or pathologists is labor intensive, time-consuming and even error-prone. In addition, the differentiation of many diseases grades relies on the cellular information such as shape, area, nuclear and cytoplasm appearances [18, 20]. Thus individual cell analysis might be significantly beneficial to analyzing pathology images. Although the development of cell detection and segmentation techniques enables the cell information extraction, it is still a very challenging task because (i) one single digitized specimen usually contains hundreds or thousands of cells; (ii) segmented cells usually exhibit low inter-class but high intra-class variations and contain considerable image noise. To address these issues, in this paper, we focus on using a hashing tool to encode each cell into binary codes for pathology image classification.

Hashing is to encode the high-dimensional data into compact binary codes with preserving the similarity among neighbors. It is usually used to retrieve nearest neighbors based on a certain similarity measurement in large-scale databases [7, 15], because of the gain in computation and storage. Over the past few years, many hashing algorithms [9, 12,13,14] have been proposed or developed in the literature. Generally, these algorithms can be classified into two major groups: (i) Unsupervised hashing that explores intrinsic low-dimensional structures of the data to preserve the similarity without any supervision; (ii) Supervised hashing that uses semantic information to assist retrieval and searching. Due to the semantic gap, supervised hashing algorithms are usually employed to classify pathology images. Some supervised hashing algorithms [6, 7, 19, 20] have shown superior classification performance to traditional methods, however, they usually have one or two following major limitations: (i) Directly encoding the whole pathology image rather than individual cells into binary codes for image classification might fail to effectively encode the cellular features; (ii) A large number of labeled data is required to achieve promising accuracy.

Fig. 1.
figure 1

The flowchart of the proposed framework for pathology image classification. (In the hashing histogram, each vector is a single histogram generated by binary codes of all cells in one whole image, one cell corresponds to one position that is obtained based on binary codes, and the number of cells with the same binary codes corresponds to the bin in one histogram vector.)

Motivated by the aforementioned observations, in this paper, we propose a novel cell-based framework that makes use of only few labeled images for disease classification. Specifically, we propose a semi-supervised hashing model to encode each cell into a set of binary codes. The proposed model can utilize the semantic information and meanwhile explore the intrinsic low-dimensional structures of cells. Next, we map the binary codes of cells in one whole image into a single histogram vector, upon which we learn a popular classifier, support vector machine (SVM), for image (disease) classification. Fig. 1 shows the main idea of the proposed framework. Experiments on thousands lung cancer images demonstrate the effectiveness and efficiency of the proposed framework.

2 Methodology

2.1 Semi-supervised Kernel Discrete Hashing

Definitions and Notations: Given data \(\mathbf {X}\in \mathbb {R}^{n\times d}\) that includes \(n_{1}\) labeled \(\mathbf {X}_{l}\times \mathbb {R}^{n_{1}\times d}\) and \(n_{2}\) unlabeled data points \(\mathbf {X}_{u}\in \mathbb {R}^{n_{2}\times d}\), where \(n_{1}+n_{2}=n\). Let \(\phi : \mathbb {R}^{d}\leftarrow \mathcal {H}\) be a kernel mapping from the original space to the kernel space, where \(\mathcal {H}\) is a Reproducing Kernel Hilbert Space (RKHS) with a kernel function \(\kappa (\mathbf {x},\mathbf {y})=\phi (\mathbf {x})^T\phi (\mathbf {y})\). Selecting \(m\) (\(m<<n_{1}\)) data points selected from the labeled data (or using K-means to generate \(m\) points) as anchors and giving a labeled training data point \(\mathbf {x}\), same as [7, 13], its \(k\)-th \((1\le k\le r)\) hashing function is defined as:

$$\begin{aligned} h_{k}(\mathbf {x})=sgn(\sum _{j=1}^m \kappa _{l}(\mathbf {x}_{j},\mathbf {x})a_{kj}-b_{k})=sgn(\mathbf {a}_{k}\bar{\kappa }_{l}(\mathbf {x})), \end{aligned}$$
(1)

where \(\mathbf {x}\in \mathbf {X}\), \(\mathbf {a}_{k}\) is the \(k\)-th row vector of a projection matrix \(\mathbf {A}\in \mathbb {R}^{r\times m}\) and \(b_{k}=\frac{1}{n_{1}}\sum _{i=1}^{n_{1}}\sum _{j=1}^{m}\kappa _{l}(\mathbf {x}_{j},\mathbf {x}_{i})a_{kj}\), which suggests \(\sum _{i=1}^{n_{1}}\bar{\kappa }_{l}(\mathbf {x}_{i})=0\). \(\bar{\kappa }_{l}(\mathbf {x}_{i}) \subset \mathbf {\bar{K}}_{l}\in \mathbb {R}^{m\times n_{1}}\) that is constructed by \(\mathbf {X}_{l}\) and \(m\) anchors. Same as [7], we define \(h_{k}(x_{i})=h_{k}(x_{j})\) if \((\mathbf {x}_{i}, \mathbf {x}_{j})\in \mathcal {M}\) and \(h_{k}(x_{i})\ne h_{k}(x_{j})\) if \((\mathbf {x}_{i}, \mathbf {x}_{j})\in \mathcal {C}\), where \(\mathcal {M}\) is a set that contains the pairs sharing the same class label, and the set \(\mathcal {C}\) contains the pairs with different class labels. Let the \(r\)-bit hash codes of \(\mathbf {x}\) be \(code_{r}(\mathbf {x})=\left[ h_{1},h_{2},\cdots ,h_{r} \right] \). Then if \((\mathbf {x}_{i}, \mathbf {x}_{j})\in \mathcal {M}\), \(code_{r}(\mathbf {x}_{i})\circ code_{r}(\mathbf {x}_{j})=r\); if \((\mathbf {x}_{i}, \mathbf {x}_{j})\in \mathcal {C}\), \(code_{r}(\mathbf {x}_{i})\circ code_{r}(\mathbf {x}_{i})=-r\). The pairwise label matrix \(\mathbf {S}\in \mathbb {R}^{n_{1}\times n_{1}}\) is defined as:

$$\begin{aligned} s_{ij}=\left\{ \begin{matrix} 1 &{} (x_{i},x_{j})\in \mathcal {M} \\ -1 &{} (x_{i},x_{j})\in \mathcal {C} \\ 0 &{} otherwise. \end{matrix}\right. \end{aligned}$$
(2)

Formulation: Based on the pairwise label matrix in Eq. (2), we attempt to learn a projection matrix \(\mathbf {A}\) such that \(code_{r}(\mathbf {x}_{i})\circ code_{r}(\mathbf {x}_{j}) \in \left[ -r,r \right] \), where \(\mathbf {x}_{i}\) and \(\mathbf {x}_{j}\) are labeled data. In addition, we want to enforce the \(r\)-bit hashing code to be mutually uncorrelated so that the redundancy among different bits is minimized, and maximize the information of each bit by \(\sum _{i=1}^{n}h_{k}(\mathbf {x}_{i})=0\) \((1\le k\le r)\). Moreover, we also want to utilize the intrinsic structure information of all training data including the labeled and unlabeled kernelized data. Therefore, we propose an optimization model as follows:

$$\begin{aligned} \begin{array}{cc} \underset{\mathbf {A}}{min}\ \left\| \mathbf {H}^T \mathbf {H}-\mathbf {S} \right\| _{F}^2 -\eta Tr\left\{ \mathbf {A}\mathbf {\bar{K}}\mathbf {\bar{K}}^T\mathbf {A}^T \right\} ,\\ s.t. \ \mathbf {H}\mathbf {H}^T=n_{1}\mathbf {I}_{r}, \mathbf {H}\mathbf {1}_{n_{1}}=0, \end{array} \end{aligned}$$
(3)

where \(\mathbf {H}= sgn(\mathbf {A}\mathbf {\bar{K}}_{l})\) and \(\mathbf {1}_{n_{1}}\times \mathbb {R}^{n_{1}}\) is a column vector with all elements being one. The constraint is only on labeled data because they are more important than unlabeled data. Since the problem in Eq. (3) is highly non-convex, it is difficult to be directly solved. To learn the projection matrix \(\mathbf {A}\), Eq. (3) is usually relaxed by using a symmetric relaxation strategy [13, 15]. Then it becomes:

$$\begin{aligned} \begin{array}{cc} \underset{\mathbf {A}}{max}\ Tr\left\{ \mathbf {A}\mathbf {\bar{K}}_{l}\mathbf {S}\mathbf {\bar{K}}_{l}^T\mathbf {A}^T+\eta \mathbf {A}\mathbf {\bar{K}}\mathbf {\bar{K}}^T\mathbf {A}^T \right\} ,\\ s.t. \mathbf {A}\mathbf {\bar{K}}_{l}\mathbf {\bar{K}}_{l}^T\mathbf {A}^T=n_{1}\mathbf {I}_{r}, \end{array} \end{aligned}$$
(4)

Let \(\mathbf {W}=\mathbf {\bar{K}}_{l}\mathbf {S}\mathbf {\bar{K}}_{l}^T+\eta \mathbf {\bar{K}}\mathbf {\bar{K}}^T\), and then Eq. (4) is equivalent to:

$$\begin{aligned} \begin{array}{cc} \underset{\mathbf {A}}{max}\ Tr\left\{ \mathbf {A}\mathbf {W}\mathbf {A}^T \right\} ,\\ s.t.\ \mathbf {A}\mathbf {\bar{K}}_{l}\mathbf {\bar{K}}_{l}^T\mathbf {A}^T=n_{1}\mathbf {I}_{r}, \end{array} \end{aligned}$$
(5)

Let \(\mathbf {A}\mathbf {\bar{K}}_{l}=\mathbf {Y}\), and then it is easy to obtain \(\mathbf {A}=\mathbf {Y}\mathbf {\bar{K}}_{l}^{T}(\mathbf {\bar{K}}_{l}\mathbf {\bar{K}}_{l}^T)^{-1}\). In practice we obtain \(\mathbf {A}\) by \(\mathbf {A}=\mathbf {Y}\mathbf {\bar{K}}_{l}^{T}(\mathbf {\bar{K}}_{l}\mathbf {\bar{K}}_{l}^T+\epsilon \mathbf {I}_{m})^{-1}\) in order to attain stable solutions. Substituting it into Eq. (5), we have:

$$\begin{aligned} \begin{array}{cc} \underset{\mathbf {Y}}{max}\ Tr\left\{ \mathbf {Y}\mathbf {\bar{K}}_{l}^{T}(\mathbf {\bar{K}}_{l}\mathbf {\bar{K}}_{l}^{T})^{-1}\mathbf {W}(\mathbf {\bar{K}}_{l}\mathbf {\bar{K}}_{l}^{T})^{-1} \mathbf {\bar{K}}_{l}\mathbf {Y}^T\right\} ,\\ s.t.\ \mathbf {Y}\mathbf {Y}^T=n_{1}\mathbf {I}_{r}, \mathbf {Y}\mathbf {1}_{n_{1}}=0. \end{array} \end{aligned}$$
(6)

If we solve Eq. (6) by using ‘relaxation+ rounding’ schemes employed in [15] [7], the discrete matrix \(\mathbf {H}\) and its relaxed continuous matrix \(\mathbf {Y}\) will produce large accumulated quantization errors, thereby decreasing the quality of binary codes. To reduce the errors, similar to [13], we preserve the discrete constraint \(\mathbf {H}=sgn(\mathbf {Y})\) in the objective, and reformulate Eq. (6) as:

$$\begin{aligned} \begin{array}{cc} \underset{\mathbf {Y},\mathbf {H}}{max}\ Tr\left\{ \mathbf {H}\mathbf {\bar{K}}_{l}^{T}(\mathbf {\bar{K}}_{l}\mathbf {\bar{K}}_{l}^{T})^{-1}\mathbf {W}(\mathbf {\bar{K}}_{l}\mathbf {\bar{K}}_{l}^{T})^{-1} \mathbf {\bar{K}}_{l}\mathbf {Y}^T\right\} ,\\ s.t.\ \mathbf {Y}\mathbf {Y}^T=n_{1}\mathbf {I}_{r}, \mathbf {Y}\mathbf {1}_{n_{1}}=0, \mathbf {H}=sign(\mathbf {Y}), \end{array} \end{aligned}$$
(7)

where \(\mathbf {1}_{n_{1}}\in \mathbb {R}^{n_{1}}\) is a column vector with all elements being one. Equation (7) is our proposed semi-supervised kernel discrete hashing model.

Let \(\mathbf {M}=\mathbf {\bar{K}}_{l}^{T}(\mathbf {\bar{K}}_{l}\mathbf {\bar{K}}_{l}^{T})^{-1}\mathbf {W}(\mathbf {\bar{K}}_{l}\mathbf {\bar{K}}_{l}^{T})^{-1} \mathbf {\bar{K}}_{l}\), and then Eq. (7) can be rewritten as:

$$\begin{aligned} \begin{array}{cc} \underset{\mathbf {Y},\mathbf {H}}{max}\ Tr\left\{ \mathbf {H}\mathbf {M}\mathbf {Y}^T\right\} ,\\ s.t.\ \mathbf {Y}\mathbf {Y}^T=n_{1}\mathbf {I}_{r}, \mathbf {Y}\mathbf {1}_{n_{1}}=0, \mathbf {H}=sign(\mathbf {Y}). \end{array} \end{aligned}$$
(8)

Since Eq. (8) is similar to the objective function of KSDH_H in [13], it can be easily solved. To be clear, we present the optimization procedure of Eq. (8) in Algorithm 1, which is named as semi-supervised kernel discrete hashing (SSKDH). It is worth noting that when \(\eta =0\), SSKDH will degenerate to KSDH_H.

figure a

2.2 Hash Codes Based Classification

After learning the projection matrix \(\mathbf {A}\) and encoding each cell into \(r\)-bit binary codes, intuitively, we can classify each cell and then use major voting to attain the label of each image, however, empirically this strategy is usually ineffective for encoded cells. Because the small number of training cells leads to bad cell classification accuracy (For the binary classification problem, the accuracy of one class is usually larger than 50%, while the accuracy of the other class is smaller than 50%). In order to obtain robust and stable classification accuracy, we map all cell information of one image into a \(2^{r}\)-dimensional histogram vector \(\mathbf {v}^{I} \in \mathbb {R}^{2^r}\), which includes cell information to represent the whole image. The \(r\)-bit binary codes of one cell can be transformed into an integer corresponding to one index in the \(2^{r}\)-dimensional vector, and the coefficient in the index is the number of cells with the same integer. For example, given one cell \(x\) with \(r\)-bit binary codes \(\left[ h_{1}, h_{2}, h_{3},\cdots , h_{r}\right] \), it corresponds to the index:

$$\begin{aligned} idx=1+\sum _{k=1}^{r}h_{k}2^{k-1}. \end{aligned}$$
(9)

Suppose that \(p\) cells corresponds to the index \(idx\) in one image, then \(\mathbf {v}^{I}(idx)=p\).

After transforming all labeled images into \(2^{r}\)-dimensional histogram vectors, we normalize each vector to be unit and then employ an SVM to learn model parameters. For clarity, we present the detailed procedure in Algorithm 2, which is named as hashing codes for classification (HCC) in this paper. Note that HCC is different from [11, 20], because HCC essentially adopts SVM to weight each cell in one whole image.

figure b

3 Experiments

In order to evaluate the proposed framework, we conduct extensive experiments on one lung cancer image dataset with two types of diseases: adenocarcinoma and squamous cell carcinoma. We collect the lung cancer images from The Cancer Genome Altas (TCGA) [5], which consists of 610 adenocarcinoma and 630 squamous images. They contain about 589K adenocarcinoma and 1248K squamous cells, respectively. We detect and segment these cells by using the method in [16]. And then crop all cells as image patches with the corresponding labels. Next, we extract 1024 features from each cell by using HOG and GIST descriptors, respectively.

3.1 Experimental Setting

We compare the proposed hashing algorithm SSKDH against four state-of-the-art supervised hashing algorithms: semi-supervised hashing (SSH) [15], kernel supervised hashing (KSH) [7], supervised discrete hashing (SDH) [12] and kernel supervised discrete hashing (KSDH) [13]. After encoding each cell into a set of binary codes, we utilize HCC to classify images. For better comparison, we show the results of support vector machine (SVM) with the major voting (MV). However, we do not display the results of hashing algorithms with MV, because this strategy performs very badly due to the small number of training samples. Moreover, we present the baseline results obtained by directly applying SVM and nearest neighbors (NN) to the holistic high-dimensional features extracted from the whole image. Specifically, we first detect scale-invariant keypoints from the whole pathology image and then employ SIFT [8] and HOG [2] to extract features around these keypoints. Next, both descriptors are encoded as 2000-dimensional histograms using the bag-of-words (BoW) method [1]. For parameters \(\eta \) and \(\epsilon \) in SSKDH, we empirically set \(\eta =10^{-4}\) and \(\epsilon =10^{-4}\); For SVM, we adopt the \(\ell _{2}\)-regularized logistic regression in LIBLINEAR [3] and utilize 5-fold cross validation to obtain the best parameters. To construct kernelized data, we apply K-means to labeled training cells to generate 500 anchors. All hashing algorithms adopt the same kernel type as KSH.

We randomly partition the lung cancer image dataset into two parts: a training set with 210 adenocarcinoma and 230 squamous images and a testing set containing 800 images, which totally consists of around 1198K cells. Then we randomly select 5 images of each class from the training set as labeled ones, and the remaining training images are regarded as unlabeled ones. Next, we randomly select 200 cells from each training image. Hence, there are totally 2K labeled and 86K unlabeled cells. We repeat above process 20 times and calculate the average accuracy.

3.2 Results and Analysis

Table 1 displays the classification accuracy of various algorithms on test images, and their time including training and testing. Note that, training time for SVM, which utilizes features extracted from the whole image, measures only the model parameter learning and does not include the generation of the high-dimensional BoW vector; similarly, training time of the algorithms using cell features does not contain the cost of cell detection and segmentation. Testing time corresponds to the classification of one test image. As we can see, SVM using cell features and major voting to classify images can achieve superior performance to that using features directly extracted from the whole image. It illustrates the strength of cellular features. In addition, hashing algorithms except SSH with HCC obtain higher accuracy than SVM+MV. Among all hashing algorithms, SSKDH attains the best accuracy, the gain is 1.33% and 2.82% over the best competitor on HOG and GIST features, respectively. The main reason is that SSKDH can utilize both semantic (high-level) information from labeled cells and the low-level features from unlabeled ones.

Table 1. Image classification accuracy (%) of different methods with 5 labeled images of each class on 800 test images. (Hashing algorithms encode each cell into 4-bit binary codes. Image means directly extracting the holistic high-dimensional features from the whole image and Cell means extracting features from cells.)

To better show the strength of the proposed framework, especially SSKDH, Fig. 2 shows the performance of hashing algorithms on different number of bits from 2 to 8. It further demonstrates the effectiveness and efficiency of SSKDH and HCC. Note that we set the maximum number of bits to 8 in Fig. 2, because each training image only uses 200 cells, and sometimes more bits would decrease the performance of hashing algorithms with HCC. Additionally, all hashing algorithms can attain the best or sub-best performance at 2-bit, because these images only contains two diseases leading to \(rank(\mathbf {S})=2\).

Fig. 2.
figure 2

Classification accuracy vs the number of bits. (a) Cells represented by HOG features; (b) cells represented by GIST features

4 Conclusion

In this paper, we design a novel cell-based framework using few labeled images for pathology image classification. Firstly, we propose a novel semi-supervised hashing model, which can utilize the cell information in both labeled and unlabeled images, to encode each cell into binary codes. Then we propose a new image encoding method, namely hashing codes based classification, to map binary codes of cells in one whole image into a relatively high-dimensional vector. Finally, we use these vectors and learn an SVM for image classification. Extensive experiments on the lung cancer image dataset demonstrate the effectiveness and efficiency of the proposed framework.