Abstract
In this paper, we present a generalized distributed deep neural network architecture to detect cells in whole-slide high-resolution histopathological images, which usually hold \(10^{8}\) to \(10^{10}\) pixels. Our framework can adapt and accelerate any deep convolutional neural network pixel-wise cell detector to perform whole-slide cell detection within a reasonable time limit. We accelerate the convolutional neural network forwarding through a sparse kernel technique, eliminating almost all of the redundant computation among connected patches. Since the disk I/O becomes a bottleneck when the image size scale grows larger, we propose an asynchronous prefetching technique to diminish a large portion of the disk I/O time. An unbalanced distributed sampling strategy is proposed to enhance the scalability and communication efficiency in distributed computing. Blending advantages of the sparse kernel, asynchronous prefetching and distributed sampling techniques, our framework is able to accelerate the conventional convolutional deep learning method by nearly 10, 000 times with same accuracy. Specifically, our method detects cells in a \(10^{8}\)-pixel (\(10^4\times 10^4\)) image in 20 s (approximately 10, 000 cells per second) on a single workstation, which is an encouraging result in whole-slide imaging practice.
J. Huang—This work was partially supported by U.S. NSF IIS-1423056, CMMI-1434401, CNS-1405985.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Convolutional Neural Network
- Cell Detection
- Original Network
- Deep Neural Network
- Histopathological Image
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Recently, increased interests have been raised in the research community concerning the cell detection problem. A large number of cell detection methods on small images (with around \(10^{4}\) to \(10^6\) pixels) have been proposed [1–4]. Due to the recent success of deep convolutional neural network in imaging, several deep neural network based methods have been proposed for cell-related applications in the past few years [2–4]. While these methods have achieved great success on small images, very few of them are ready to be applied into practical whole-slide cell detection, in that the real whole-slide images usually have \(10^{8}\) to \(10^{10}\) pixels. It takes several weeks to detect cells in a single whole-slide image by directly applying the deep learning cell detection methods [2–4], which is definitely prohibitive in practice.
To alleviate the issue, we hereby propose a generalized distributed deep convolutional neural network framework for the pixel-wise cell detection. Our framework accelerates any deep convolutional neural network pixel-wise cell detector. In the proposed framework, we first improve the forwarding speed of the deep convolutional neural network with the sparse kernel technique. Similar techniques are referred to [5, 6]. In order to reduce the disk I/O time, we propose a novel asynchronous prefetching technique. The separable iteration behavior also suggests needs for a scalable and communication efficient distributed and parallel computing framework to further accelerate the detection process on whole-slide images. We, therefore, recommend an unbalanced distributed sampling strategy with two spatial dimensions, extending the balanced cutting in [7]. The combination of the aforementioned techniques thus yields a huge speedup up to 10,000x in practice.
To the best of our knowledge, the research presented in this paper represents the first attempt to develop an extremely efficient deep neural network based pixel-wise cell detection framework for whole-slide images. Particularly, it is general enough to cooperate with any deep convolutional neural networks to work on whole-slide imaging. Our technical contributions are summarized as: (1) A general sparse kernel neural network model is applied for the pixel-wise cell detection, accelerating the forwarding procedure of the deep convolutional neural networks. (2) An asynchronous prefetching technique is proposed to reduce nearly \(95\,\%\) of the disk I/O time. (3) We propose a scalable and communication efficient framework to extend our neural network to multi-GPU and cluster environments, dramatically accelerating the entire detecting process. Extensive experiments have been conducted to demonstrate the efficiency and effectiveness of our method.
2 Methodology
2.1 Sparse Kernel Convolutional Neural Network
The sparse kernel network takes the whole tile image, instead of a pixel-centered patch, as input and can predict the whole label map with just one pass of the accelerated forward propagation. The sparse kernel network uses the same weights as the original network trained in the training stage to generate the exact same results as the original pixel-wise detector does. To achieve this goal, we involve the k-sparse kernel technique [6] for convolution and blended max-pooling layers into our approach. The k-sparse kernels are created by inserting all-zero rows and columns into the original kernels to make every two original neighboring entries k-pixel away. In [6], however, it remains unclear how to deal with fully connected layers, which is completed in our research. A fully connected layer is treated as a special convolution layer with kernel size set to the input dimension and kernel number set as the output dimension of the fully connected layer. This special convolution layer will generate the exact same output as the fully connected layer does when given the same input. The conversion algorithm is summarized in Algorithm 1.
2.2 Asynchronous Prefetching
Comparing with other procedures in the whole cell detection process, e.g. the memory transfer between GPU and CPU memory, the disk I/O becomes a bottleneck in the cell detection problem. In this subsection, we describe our asynchronous prefetching technique to relieve the bottleneck of the disk I/O. To reduce frequent I/O operations and, meanwhile, ensure the absence of insufficient memory problems, we propose an asynchronous prefetching technique to resolve this. We first load a relatively large image, referred to as cached image, into memory (e.g., \(4096\times 4096\)). While we start to detect cells on the first cached image tile by tile, we immediately start loading the second cached image in another thread. Thus, when the detection process of the first cached image is finished, since the reading procedure is usually faster than the detection, we’ve already loaded the second cached image and can start detection in the second cached image and load the next cached image immediately. Hence, the reading time of the second cached image, as well as the cached images thereafter, is hidden from the overall runtime. Experiments have exhibited that this technique reduces approximately \(95\,\%\) of the disk I/O time. It achieves an even larger speedup on a cluster since the NFS (Network File System) operation is even more time-consuming and we reduce most of them.
2.3 Multi-GPU Parallel and Distributed Computing
When considering distributed optimization, two resources are at play: (1) the amount of processing on each machine, and (2) the communication between machines. The single machine performance has been optimized in Sects. 2.1 and 2.2. We then describe our unbalanced distributed sampling strategy with two spatial dimensions of our framework, which is a gentle extension to [7]. Assuming \(T=\{(1,1), (1,2), \ldots , (H, W)\}\) is the index set of an image with size \(H\times W\), we aim at sampling tiles of sizes not larger than \(h\times w\).
Unbalanced Partitioning. Let \(S:=\lceil HW/C\rceil \). We first partition the index set T into a set of blocks \(P^{(1)}, P^{(2)}, \ldots , P^{(C)}\) according to the following criterion:
-
1.
\(T = \bigcup _{c=1}^C P^{(c)}\),
-
2.
\(P^{(c')} \bigcap P^{(c'')} = \varnothing \), for \(c' \ne c''\),
-
3.
\(|P^{c}| \le S\),
-
4.
\(P^{(c)}\) is connected.
Sampling. After the procedure of partitioning, we now sample small tiles from C different machines and devices. For each \(c\in \{1,\ldots , C\}\), the \(\hat{Z}^{(c)}\) is a connected subset of \(P^{(c)}\) satisfying \(|\hat{Z}^{(c)}| \le hw\) and \(\hat{Z}^{(c')} \bigcap \hat{Z}^{(c'')} = \varnothing \), for \(c' \ne c''\).
The set-valued mapping \(\hat{Z} = \bigcup _{c=1}^{C} \hat{Z}^{(c)}\) is termed as (C, hw)-unbalanced sampling, which is used for fully sampling tile images from the entire image. Note this is not a subsampling process since all the tile images are sampled from the whole slide in one data pass. Since only index sets are transmitted among all the machines, the communication cost is very low in network transferring. This distributed sampling strategy also ensures the scalability of the proposed framework as indicated in Sect. 3.4.
3 Experiments
3.1 Experiment Setup
Throughout the experiment section, we use a variant [4, 8]Footnote 1 of LeNet [9] as a pixel-wise classifier to show the effectiveness and efficiency of our framework. We have implemented our framework based on caffe [10] and MPI. The original network structure is shown in Table 1 (left). The classifier is designed to classify a \(20\times 20\) patch centered at specific pixel and predict the possibility of whether the pixel is in a cell region. Applying Algorithm 1, we show the accelerated network on the right of Table 1, which detects cells on a tile image of size \(512\times 512\). Since the classifier deals with \(20\times 20\) image patches, we mirror pad the original \(512\times 512\) tile image to a \(531\times 531\) image.
3.2 Effectiveness Validation
Our framework can be applied to any convolutional neural network for pixel-wise cell detection, e.g., [2–4]. Thus, the effectiveness of our framework highly depends on the performance of the original deep neural networks designed for the small-scale cell detection. In this subsection, we validate the result consistency between our framework and the original work [4]. We conduct experiments on 215 tile images sized \(512\times 512\) sampled from the NLSTFootnote 2 whole-slide images, with 83245 cell object annotations. These tile images are then partitioned into three subsets: the training set (143 images), the testing set (62 images) and the evaluation set (10 images). The neural network model was trained on the training set with the original network described on the Table 1 (left). We then applied Algorithm 1 to transfer the original network into our framework. This experiment was conducted on a workstation with Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10 GHz CPU, 32 gigabyte RAM, and a single Nvidia K40 GPU.
For quantitative analysis, we used a precision-recall-\(F_1\)score evaluation metric to measure the performance of the two methods. Since the proposed method detects the rough cell area, we calculated the raw image moment centroid as its approximate nuclei location. Each detected cell centroid is associated with the nearest ground-truth annotation. A detected cell centroid is considered to be a True Positive (TP) sample if the Euclidean distance between the detected cell centroid and the ground-truth annotation is less than 8 pixels; otherwise, it is considered as False Positive (FP). Missed ground-truth dots are counted as False Negative (FN) samples. We consider \(F_1\) score \(F_1 = 2PR/(P+R)\), where precision \(P=TP/(TP+FP)\) and recall \(R=TP/(TP+FN)\). We report the precision, recall and \(F_1\) score of the original work and our framework in Table 2.
Table 2 also shows the overall runtime (in seconds) and pixel rate (pixels per second) comparison. While our framework produced the same result as the original work, our overall speed was increased by approximately 400 times in small scale images on a single GPU device. This is reasonable since our method reduces most redundant convolution computation among the neighbor pixel patches.
3.3 Prefetching Speedup
In this subsection, we validate the effectiveness of the proposed asynchronous prefetching technique. Figure 1 shows the disk I/O time comparison among memory, file and prefetching modes in a whole-slide image (NLSI0000105 with spatial dimension \(13483\times 17943\)). The I/O time is calculated by the difference between the overall runtime and the true detection time. As mentioned in Sect. 2.2, memory mode is slightly faster than file mode in that memory mode requires less hardware interruption invocation. Note that the prefetching technique doesn’t truly reduce the I/O time. It hides most I/O time into the detection time, since the caching procedure and detection occur simultaneously. So for a \(10^{8}\)-pixel whole-slide image, our technique diminishes (or hides) \(95\,\%\) I/O time compared with file mode. This is because the exposed I/O time with our prefetching technique is only for reading the first cached image.
3.4 Parallel and Distributed Computing
In this subsection, we show our experiment results in several whole-slide images. We randomly selected five whole-slide images, in Aperio SVS format, from NLST and TCGA [11] data sets, varying in size, from \(10^{8}\) to \(10^{10}\) pixels. In order to show the efficiency of our methods, we conducted experiments in all five whole-slide images on a single workstation with Intel(R) Core(TM) i7-5930 K CPU @ 3.50 GHz, 64 Gigabytes RAM, 1 TB Samsung(R) 950 Pro Solid-State Drive and four Nvidia Titan X GPUs. Table 3 shows the overall runtime on cell detection in these whole-slide images. On a single workstation, our method is able to detect cells in a whole-slide image of size around \(10^4\times 10^4\) (NLSI0000105) in 20 s. Since the detection result of this whole-slide image includes approximately 200, 000 cells, our method detects nearly 10, 000 cells per second on average on a single workstation, while the original work [4] only detects approximately 6 cells per second, reaching a 1, 500 times speedup.
The workaround of our method in distributed computing environment is demonstrated on TACC Stampede GPU clustersFootnote 3. Each node is equipped with two 8-core Intel Xeon E5-2680 2.7 GHz CPUs, 32 Gigabytes RAM and a single Nvidia K20 GPU. We show only the distributed results for the last four images from Table 3, since the first image is too small to be sliced into 32 pieces. Table 4 shows that our method detects cells in a whole-slide image (TCGA-38-4627) with nearly \(10^{10}\) pixels within 155.87 s. When directly applying the original work, it takes approximately 400 h (1440000 s) even without considering the disk I/O time. Our method has impressively achieved nearly 10, 000 times speed up compared with naively applying [4]. The linear speedup also exhibits the scalability and communication efficiency, since our sampling strategy reduces most overhead in communication.
4 Conclusions
In this paper, a generalized distributed deep neural network framework is introduced to detect cells in whole-slide histopathological images. The innovative framework can be applied with any deep convolutional neural network pixel-wise cell detector. Our method is extremely optimized in distributed environment to detect cells in whole-slide images. We utilize a sparse kernel neural network forwarding technique to reduce nearly all redundant convolution computations. An asynchronous prefetching technique is recommended to diminish most disk I/O time when loading the large histopathological images into memory. Furthermore, an unbalanced distributed sampling strategy is presented to enhance the scalability and communication efficiency of our framework. These techniques construct three pillars of our framework. Extensive experiments demonstrate that our method can approximately detect 10, 000 cells per second on a single workstation, which is encouraging for high-throughput cell data. While our result enables the high speed cell detection, our result can expect to benefit some further pathological analysis, e.g. feature extraction [12].
Notes
- 1.
The code is the publicly available at https://github.com/uta-smile/caffe-fastfpbp. We also provide a web demo for our method at https://celldetection.zhengxu.work/.
- 2.
- 3.
References
Arteta, C., Lempitsky, V., Noble, J.A., Zisserman, A.: Learning to detect cells using non-overlapping extremal regions. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7510, pp. 348–356. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33415-3_43
Cireşan, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Mitosis detection in breast cancer histology images with deep neural networks. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 411–418. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40763-5_51
Xie, Y., Xing, F., Kong, X., Su, H., Yang, L.: Beyond classification: structured regression for robust cell detection using convolutional neural network. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 358–365. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24574-4_43
Pan, H., Xu, Z., Huang, J.: An effective approach for robust lung cancer cell detection. In: Wu, G., Coupé, P., Zhan, Y., Munsell, B., Rueckert, D. (eds.) Patch-MI 2015. LNCS, vol. 9467, pp. 87–94. Springer, Heidelberg (2015)
Giusti, A., Cireşan, D.C., Masci, J., Gambardella, L.M., Schmidhuber, J.: Fast image scanning with deep max-pooling convolutional neural networks. arXiv preprint arXiv:1302.1700 (2013)
Li, H., Zhao, R., Wang, X.: Highly efficient forward and backward propagation of convolutional neural networks for pixelwise classification. arXiv preprint arXiv:1412.4526 (2014)
Mareček, J., Richtárik, P., Takáč, M.: Distributed block coordinate descent for minimizing partially separable functions. In: Numerical Analysis and Optimization, pp. 261–288. Springer, Switzerland (2015)
Xu, Z., Huang, J.: Efficient lung cancer cell detection with deep convolution neural network. In: Wu, G., Coupé, P., Zhan, Y., Munsell, B., Rueckert, D. (eds.) Patch-MI 2015. LNCS, vol. 9467, pp. 79–86. Springer, Heidelberg (2015)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Network, C.G.A.R., et al.: Comprehensive molecular profiling of lung adenocarcinoma. Nature 511(7511), 543–550 (2014)
Yao, J., Ganti, D., Luo, X., Xiao, G., Xie, Y., Yan, S., Huang, J.: Computer-assisted diagnosis of lung cancer using quantitative topology features. In: Zhou, L., Wang, L., Wang, Q., Shi, Y. (eds.) MLMI 2015. LNCS, vol. 9352, pp. 288–295. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24888-2_35
Acknowledgments
The authors would like to thank NVIDIA for GPU donation and the National Cancer Institute for access to NCI’s data collected by the National Lung Screening Trial. The statements contained herein are solely of the authors and do not represent or imply concurrence or endorsement by NCI.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Xu, Z., Huang, J. (2016). Detecting 10,000 Cells in One Second. In: Ourselin, S., Joskowicz, L., Sabuncu, M., Unal, G., Wells, W. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016. MICCAI 2016. Lecture Notes in Computer Science(), vol 9901. Springer, Cham. https://doi.org/10.1007/978-3-319-46723-8_78
Download citation
DOI: https://doi.org/10.1007/978-3-319-46723-8_78
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46722-1
Online ISBN: 978-3-319-46723-8
eBook Packages: Computer ScienceComputer Science (R0)