Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Recently, increased interests have been raised in the research community concerning the cell detection problem. A large number of cell detection methods on small images (with around \(10^{4}\) to \(10^6\) pixels) have been proposed [14]. Due to the recent success of deep convolutional neural network in imaging, several deep neural network based methods have been proposed for cell-related applications in the past few years [24]. While these methods have achieved great success on small images, very few of them are ready to be applied into practical whole-slide cell detection, in that the real whole-slide images usually have \(10^{8}\) to \(10^{10}\) pixels. It takes several weeks to detect cells in a single whole-slide image by directly applying the deep learning cell detection methods [24], which is definitely prohibitive in practice.

To alleviate the issue, we hereby propose a generalized distributed deep convolutional neural network framework for the pixel-wise cell detection. Our framework accelerates any deep convolutional neural network pixel-wise cell detector. In the proposed framework, we first improve the forwarding speed of the deep convolutional neural network with the sparse kernel technique. Similar techniques are referred to [5, 6]. In order to reduce the disk I/O time, we propose a novel asynchronous prefetching technique. The separable iteration behavior also suggests needs for a scalable and communication efficient distributed and parallel computing framework to further accelerate the detection process on whole-slide images. We, therefore, recommend an unbalanced distributed sampling strategy with two spatial dimensions, extending the balanced cutting in [7]. The combination of the aforementioned techniques thus yields a huge speedup up to 10,000x in practice.

To the best of our knowledge, the research presented in this paper represents the first attempt to develop an extremely efficient deep neural network based pixel-wise cell detection framework for whole-slide images. Particularly, it is general enough to cooperate with any deep convolutional neural networks to work on whole-slide imaging. Our technical contributions are summarized as: (1) A general sparse kernel neural network model is applied for the pixel-wise cell detection, accelerating the forwarding procedure of the deep convolutional neural networks. (2) An asynchronous prefetching technique is proposed to reduce nearly \(95\,\%\) of the disk I/O time. (3) We propose a scalable and communication efficient framework to extend our neural network to multi-GPU and cluster environments, dramatically accelerating the entire detecting process. Extensive experiments have been conducted to demonstrate the efficiency and effectiveness of our method.

2 Methodology

2.1 Sparse Kernel Convolutional Neural Network

The sparse kernel network takes the whole tile image, instead of a pixel-centered patch, as input and can predict the whole label map with just one pass of the accelerated forward propagation. The sparse kernel network uses the same weights as the original network trained in the training stage to generate the exact same results as the original pixel-wise detector does. To achieve this goal, we involve the k-sparse kernel technique [6] for convolution and blended max-pooling layers into our approach. The k-sparse kernels are created by inserting all-zero rows and columns into the original kernels to make every two original neighboring entries k-pixel away. In [6], however, it remains unclear how to deal with fully connected layers, which is completed in our research. A fully connected layer is treated as a special convolution layer with kernel size set to the input dimension and kernel number set as the output dimension of the fully connected layer. This special convolution layer will generate the exact same output as the fully connected layer does when given the same input. The conversion algorithm is summarized in Algorithm 1.

figure a

2.2 Asynchronous Prefetching

Comparing with other procedures in the whole cell detection process, e.g. the memory transfer between GPU and CPU memory, the disk I/O becomes a bottleneck in the cell detection problem. In this subsection, we describe our asynchronous prefetching technique to relieve the bottleneck of the disk I/O. To reduce frequent I/O operations and, meanwhile, ensure the absence of insufficient memory problems, we propose an asynchronous prefetching technique to resolve this. We first load a relatively large image, referred to as cached image, into memory (e.g., \(4096\times 4096\)). While we start to detect cells on the first cached image tile by tile, we immediately start loading the second cached image in another thread. Thus, when the detection process of the first cached image is finished, since the reading procedure is usually faster than the detection, we’ve already loaded the second cached image and can start detection in the second cached image and load the next cached image immediately. Hence, the reading time of the second cached image, as well as the cached images thereafter, is hidden from the overall runtime. Experiments have exhibited that this technique reduces approximately \(95\,\%\) of the disk I/O time. It achieves an even larger speedup on a cluster since the NFS (Network File System) operation is even more time-consuming and we reduce most of them.

2.3 Multi-GPU Parallel and Distributed Computing

When considering distributed optimization, two resources are at play: (1) the amount of processing on each machine, and (2) the communication between machines. The single machine performance has been optimized in Sects. 2.1 and 2.2. We then describe our unbalanced distributed sampling strategy with two spatial dimensions of our framework, which is a gentle extension to [7]. Assuming \(T=\{(1,1), (1,2), \ldots , (H, W)\}\) is the index set of an image with size \(H\times W\), we aim at sampling tiles of sizes not larger than \(h\times w\).

Unbalanced Partitioning. Let \(S:=\lceil HW/C\rceil \). We first partition the index set T into a set of blocks \(P^{(1)}, P^{(2)}, \ldots , P^{(C)}\) according to the following criterion:

  1. 1.

    \(T = \bigcup _{c=1}^C P^{(c)}\),

  2. 2.

    \(P^{(c')} \bigcap P^{(c'')} = \varnothing \), for \(c' \ne c''\),

  3. 3.

    \(|P^{c}| \le S\),

  4. 4.

    \(P^{(c)}\) is connected.

Sampling. After the procedure of partitioning, we now sample small tiles from C different machines and devices. For each \(c\in \{1,\ldots , C\}\), the \(\hat{Z}^{(c)}\) is a connected subset of \(P^{(c)}\) satisfying \(|\hat{Z}^{(c)}| \le hw\) and \(\hat{Z}^{(c')} \bigcap \hat{Z}^{(c'')} = \varnothing \), for \(c' \ne c''\).

The set-valued mapping \(\hat{Z} = \bigcup _{c=1}^{C} \hat{Z}^{(c)}\) is termed as (Chw)-unbalanced sampling, which is used for fully sampling tile images from the entire image. Note this is not a subsampling process since all the tile images are sampled from the whole slide in one data pass. Since only index sets are transmitted among all the machines, the communication cost is very low in network transferring. This distributed sampling strategy also ensures the scalability of the proposed framework as indicated in Sect. 3.4.

3 Experiments

3.1 Experiment Setup

Throughout the experiment section, we use a variant [4, 8]Footnote 1 of LeNet [9] as a pixel-wise classifier to show the effectiveness and efficiency of our framework. We have implemented our framework based on caffe [10] and MPI. The original network structure is shown in Table 1 (left). The classifier is designed to classify a \(20\times 20\) patch centered at specific pixel and predict the possibility of whether the pixel is in a cell region. Applying Algorithm 1, we show the accelerated network on the right of Table 1, which detects cells on a tile image of size \(512\times 512\). Since the classifier deals with \(20\times 20\) image patches, we mirror pad the original \(512\times 512\) tile image to a \(531\times 531\) image.

Table 1. Original LeNet Classifier (left) and accelerated forward (right) network architecture. M: the training batch size, N: the testing batch size. Layer type: I - Input, C - Convolution, MP - Max Pooling, ReLU - Rectified Linear Unit, FC - Fully Connected

3.2 Effectiveness Validation

Our framework can be applied to any convolutional neural network for pixel-wise cell detection, e.g., [24]. Thus, the effectiveness of our framework highly depends on the performance of the original deep neural networks designed for the small-scale cell detection. In this subsection, we validate the result consistency between our framework and the original work [4]. We conduct experiments on 215 tile images sized \(512\times 512\) sampled from the NLSTFootnote 2 whole-slide images, with 83245 cell object annotations. These tile images are then partitioned into three subsets: the training set (143 images), the testing set (62 images) and the evaluation set (10 images). The neural network model was trained on the training set with the original network described on the Table 1 (left). We then applied Algorithm 1 to transfer the original network into our framework. This experiment was conducted on a workstation with Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10 GHz CPU, 32 gigabyte RAM, and a single Nvidia K40 GPU.

For quantitative analysis, we used a precision-recall-\(F_1\)score evaluation metric to measure the performance of the two methods. Since the proposed method detects the rough cell area, we calculated the raw image moment centroid as its approximate nuclei location. Each detected cell centroid is associated with the nearest ground-truth annotation. A detected cell centroid is considered to be a True Positive (TP) sample if the Euclidean distance between the detected cell centroid and the ground-truth annotation is less than 8 pixels; otherwise, it is considered as False Positive (FP). Missed ground-truth dots are counted as False Negative (FN) samples. We consider \(F_1\) score \(F_1 = 2PR/(P+R)\), where precision \(P=TP/(TP+FP)\) and recall \(R=TP/(TP+FN)\). We report the precision, recall and \(F_1\) score of the original work and our framework in Table 2.

Table 2. Quantitative comparison between original work and our framework

Table 2 also shows the overall runtime (in seconds) and pixel rate (pixels per second) comparison. While our framework produced the same result as the original work, our overall speed was increased by approximately 400 times in small scale images on a single GPU device. This is reasonable since our method reduces most redundant convolution computation among the neighbor pixel patches.

Fig. 1.
figure 1

I/O time comparison among memory, file and proposed asynchronous prefetching modes (in seconds)

3.3 Prefetching Speedup

In this subsection, we validate the effectiveness of the proposed asynchronous prefetching technique. Figure 1 shows the disk I/O time comparison among memory, file and prefetching modes in a whole-slide image (NLSI0000105 with spatial dimension \(13483\times 17943\)). The I/O time is calculated by the difference between the overall runtime and the true detection time. As mentioned in Sect. 2.2, memory mode is slightly faster than file mode in that memory mode requires less hardware interruption invocation. Note that the prefetching technique doesn’t truly reduce the I/O time. It hides most I/O time into the detection time, since the caching procedure and detection occur simultaneously. So for a \(10^{8}\)-pixel whole-slide image, our technique diminishes (or hides) \(95\,\%\) I/O time compared with file mode. This is because the exposed I/O time with our prefetching technique is only for reading the first cached image.

3.4 Parallel and Distributed Computing

In this subsection, we show our experiment results in several whole-slide images. We randomly selected five whole-slide images, in Aperio SVS format, from NLST and TCGA [11] data sets, varying in size, from \(10^{8}\) to \(10^{10}\) pixels. In order to show the efficiency of our methods, we conducted experiments in all five whole-slide images on a single workstation with Intel(R) Core(TM) i7-5930 K CPU @ 3.50 GHz, 64 Gigabytes RAM, 1 TB Samsung(R) 950 Pro Solid-State Drive and four Nvidia Titan X GPUs. Table 3 shows the overall runtime on cell detection in these whole-slide images. On a single workstation, our method is able to detect cells in a whole-slide image of size around \(10^4\times 10^4\) (NLSI0000105) in 20 s. Since the detection result of this whole-slide image includes approximately 200, 000 cells, our method detects nearly 10, 000 cells per second on average on a single workstation, while the original work [4] only detects approximately 6 cells per second, reaching a 1, 500 times speedup.

Table 3. Time comparison on single workstation (in seconds)

The workaround of our method in distributed computing environment is demonstrated on TACC Stampede GPU clustersFootnote 3. Each node is equipped with two 8-core Intel Xeon E5-2680 2.7 GHz CPUs, 32 Gigabytes RAM and a single Nvidia K20 GPU. We show only the distributed results for the last four images from Table 3, since the first image is too small to be sliced into 32 pieces. Table 4 shows that our method detects cells in a whole-slide image (TCGA-38-4627) with nearly \(10^{10}\) pixels within 155.87 s. When directly applying the original work, it takes approximately 400 h (1440000 s) even without considering the disk I/O time. Our method has impressively achieved nearly 10, 000 times speed up compared with naively applying [4]. The linear speedup also exhibits the scalability and communication efficiency, since our sampling strategy reduces most overhead in communication.

Table 4. Time comparison on multi-node cluster (in seconds)

4 Conclusions

In this paper, a generalized distributed deep neural network framework is introduced to detect cells in whole-slide histopathological images. The innovative framework can be applied with any deep convolutional neural network pixel-wise cell detector. Our method is extremely optimized in distributed environment to detect cells in whole-slide images. We utilize a sparse kernel neural network forwarding technique to reduce nearly all redundant convolution computations. An asynchronous prefetching technique is recommended to diminish most disk I/O time when loading the large histopathological images into memory. Furthermore, an unbalanced distributed sampling strategy is presented to enhance the scalability and communication efficiency of our framework. These techniques construct three pillars of our framework. Extensive experiments demonstrate that our method can approximately detect 10, 000 cells per second on a single workstation, which is encouraging for high-throughput cell data. While our result enables the high speed cell detection, our result can expect to benefit some further pathological analysis, e.g. feature extraction [12].