Detecting 10,000 Cells in One Second

Xu, Zheng; Huang, Junzhou

doi:10.1007/978-3-319-46723-8_78

Zheng Xu¹⁸ &
Junzhou Huang¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9901))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

13k Accesses
14 Citations

Abstract

In this paper, we present a generalized distributed deep neural network architecture to detect cells in whole-slide high-resolution histopathological images, which usually hold \(10^{8}\) to \(10^{10}\) pixels. Our framework can adapt and accelerate any deep convolutional neural network pixel-wise cell detector to perform whole-slide cell detection within a reasonable time limit. We accelerate the convolutional neural network forwarding through a sparse kernel technique, eliminating almost all of the redundant computation among connected patches. Since the disk I/O becomes a bottleneck when the image size scale grows larger, we propose an asynchronous prefetching technique to diminish a large portion of the disk I/O time. An unbalanced distributed sampling strategy is proposed to enhance the scalability and communication efficiency in distributed computing. Blending advantages of the sparse kernel, asynchronous prefetching and distributed sampling techniques, our framework is able to accelerate the conventional convolutional deep learning method by nearly 10, 000 times with same accuracy. Specifically, our method detects cells in a \(10^{8}\)-pixel (\(10^4\times 10^4\)) image in 20 s (approximately 10, 000 cells per second) on a single workstation, which is an encouraging result in whole-slide imaging practice.

J. Huang—This work was partially supported by U.S. NSF IIS-1423056, CMMI-1434401, CNS-1405985.

You have full access to this open access chapter, Download conference paper PDF

Accelerated ML-Assisted Tumor Detection in High-Resolution Histopathology Images

An annotation-free whole-slide training approach to pathological classification of lung cancer types using deep learning

Article Open access 19 February 2021

Slideflow: deep learning for digital histopathology with real-time whole-slide visualization

Article Open access 27 March 2024

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Recently, increased interests have been raised in the research community concerning the cell detection problem. A large number of cell detection methods on small images (with around \(10^{4}\) to \(10^6\) pixels) have been proposed [1–4]. Due to the recent success of deep convolutional neural network in imaging, several deep neural network based methods have been proposed for cell-related applications in the past few years [2–4]. While these methods have achieved great success on small images, very few of them are ready to be applied into practical whole-slide cell detection, in that the real whole-slide images usually have \(10^{8}\) to \(10^{10}\) pixels. It takes several weeks to detect cells in a single whole-slide image by directly applying the deep learning cell detection methods [2–4], which is definitely prohibitive in practice.

To alleviate the issue, we hereby propose a generalized distributed deep convolutional neural network framework for the pixel-wise cell detection. Our framework accelerates any deep convolutional neural network pixel-wise cell detector. In the proposed framework, we first improve the forwarding speed of the deep convolutional neural network with the sparse kernel technique. Similar techniques are referred to [5, 6]. In order to reduce the disk I/O time, we propose a novel asynchronous prefetching technique. The separable iteration behavior also suggests needs for a scalable and communication efficient distributed and parallel computing framework to further accelerate the detection process on whole-slide images. We, therefore, recommend an unbalanced distributed sampling strategy with two spatial dimensions, extending the balanced cutting in [7]. The combination of the aforementioned techniques thus yields a huge speedup up to 10,000x in practice.

To the best of our knowledge, the research presented in this paper represents the first attempt to develop an extremely efficient deep neural network based pixel-wise cell detection framework for whole-slide images. Particularly, it is general enough to cooperate with any deep convolutional neural networks to work on whole-slide imaging. Our technical contributions are summarized as: (1) A general sparse kernel neural network model is applied for the pixel-wise cell detection, accelerating the forwarding procedure of the deep convolutional neural networks. (2) An asynchronous prefetching technique is proposed to reduce nearly \(95\,\%\) of the disk I/O time. (3) We propose a scalable and communication efficient framework to extend our neural network to multi-GPU and cluster environments, dramatically accelerating the entire detecting process. Extensive experiments have been conducted to demonstrate the efficiency and effectiveness of our method.

2 Methodology

2.1 Sparse Kernel Convolutional Neural Network

The sparse kernel network takes the whole tile image, instead of a pixel-centered patch, as input and can predict the whole label map with just one pass of the accelerated forward propagation. The sparse kernel network uses the same weights as the original network trained in the training stage to generate the exact same results as the original pixel-wise detector does. To achieve this goal, we involve the k-sparse kernel technique [6] for convolution and blended max-pooling layers into our approach. The k-sparse kernels are created by inserting all-zero rows and columns into the original kernels to make every two original neighboring entries k-pixel away. In [6], however, it remains unclear how to deal with fully connected layers, which is completed in our research. A fully connected layer is treated as a special convolution layer with kernel size set to the input dimension and kernel number set as the output dimension of the fully connected layer. This special convolution layer will generate the exact same output as the fully connected layer does when given the same input. The conversion algorithm is summarized in Algorithm 1.

2.2 Asynchronous Prefetching

Comparing with other procedures in the whole cell detection process, e.g. the memory transfer between GPU and CPU memory, the disk I/O becomes a bottleneck in the cell detection problem. In this subsection, we describe our asynchronous prefetching technique to relieve the bottleneck of the disk I/O. To reduce frequent I/O operations and, meanwhile, ensure the absence of insufficient memory problems, we propose an asynchronous prefetching technique to resolve this. We first load a relatively large image, referred to as cached image, into memory (e.g., \(4096\times 4096\)). While we start to detect cells on the first cached image tile by tile, we immediately start loading the second cached image in another thread. Thus, when the detection process of the first cached image is finished, since the reading procedure is usually faster than the detection, we’ve already loaded the second cached image and can start detection in the second cached image and load the next cached image immediately. Hence, the reading time of the second cached image, as well as the cached images thereafter, is hidden from the overall runtime. Experiments have exhibited that this technique reduces approximately \(95\,\%\) of the disk I/O time. It achieves an even larger speedup on a cluster since the NFS (Network File System) operation is even more time-consuming and we reduce most of them.

2.3 Multi-GPU Parallel and Distributed Computing

When considering distributed optimization, two resources are at play: (1) the amount of processing on each machine, and (2) the communication between machines. The single machine performance has been optimized in Sects. 2.1 and 2.2. We then describe our unbalanced distributed sampling strategy with two spatial dimensions of our framework, which is a gentle extension to [7]. Assuming \(T=\{(1,1), (1,2), \ldots , (H, W)\}\) is the index set of an image with size \(H\times W\), we aim at sampling tiles of sizes not larger than \(h\times w\).

Unbalanced Partitioning. Let \(S:=\lceil HW/C\rceil \). We first partition the index set T into a set of blocks \(P^{(1)}, P^{(2)}, \ldots , P^{(C)}\) according to the following criterion:

1.
\(T = \bigcup _{c=1}^C P^{(c)}\),
2.
\(P^{(c')} \bigcap P^{(c'')} = \varnothing \), for \(c' \ne c''\),
3.
\(|P^{c}| \le S\),
4.
\(P^{(c)}\) is connected.

Sampling. After the procedure of partitioning, we now sample small tiles from C different machines and devices. For each \(c\in \{1,\ldots , C\}\), the \(\hat{Z}^{(c)}\) is a connected subset of \(P^{(c)}\) satisfying \(|\hat{Z}^{(c)}| \le hw\) and \(\hat{Z}^{(c')} \bigcap \hat{Z}^{(c'')} = \varnothing \), for \(c' \ne c''\).

The set-valued mapping \(\hat{Z} = \bigcup _{c=1}^{C} \hat{Z}^{(c)}\) is termed as (C, hw)-unbalanced sampling, which is used for fully sampling tile images from the entire image. Note this is not a subsampling process since all the tile images are sampled from the whole slide in one data pass. Since only index sets are transmitted among all the machines, the communication cost is very low in network transferring. This distributed sampling strategy also ensures the scalability of the proposed framework as indicated in Sect. 3.4.

3 Experiments

3.1 Experiment Setup

Throughout the experiment section, we use a variant [4, 8]^{Footnote 1} of LeNet [9] as a pixel-wise classifier to show the effectiveness and efficiency of our framework. We have implemented our framework based on caffe [10] and MPI. The original network structure is shown in Table 1 (left). The classifier is designed to classify a \(20\times 20\) patch centered at specific pixel and predict the possibility of whether the pixel is in a cell region. Applying Algorithm 1, we show the accelerated network on the right of Table 1, which detects cells on a tile image of size \(512\times 512\). Since the classifier deals with \(20\times 20\) image patches, we mirror pad the original \(512\times 512\) tile image to a \(531\times 531\) image.

Table 1. Original LeNet Classifier (left) and accelerated forward (right) network architecture. M: the training batch size, N: the testing batch size. Layer type: I - Input, C - Convolution, MP - Max Pooling, ReLU - Rectified Linear Unit, FC - Fully Connected

Full size table

3.2 Effectiveness Validation

Our framework can be applied to any convolutional neural network for pixel-wise cell detection, e.g., [2–4]. Thus, the effectiveness of our framework highly depends on the performance of the original deep neural networks designed for the small-scale cell detection. In this subsection, we validate the result consistency between our framework and the original work [4]. We conduct experiments on 215 tile images sized \(512\times 512\) sampled from the NLST^{Footnote 2} whole-slide images, with 83245 cell object annotations. These tile images are then partitioned into three subsets: the training set (143 images), the testing set (62 images) and the evaluation set (10 images). The neural network model was trained on the training set with the original network described on the Table 1 (left). We then applied Algorithm 1 to transfer the original network into our framework. This experiment was conducted on a workstation with Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10 GHz CPU, 32 gigabyte RAM, and a single Nvidia K40 GPU.

For quantitative analysis, we used a precision-recall-\(F_1\)score evaluation metric to measure the performance of the two methods. Since the proposed method detects the rough cell area, we calculated the raw image moment centroid as its approximate nuclei location. Each detected cell centroid is associated with the nearest ground-truth annotation. A detected cell centroid is considered to be a True Positive (TP) sample if the Euclidean distance between the detected cell centroid and the ground-truth annotation is less than 8 pixels; otherwise, it is considered as False Positive (FP). Missed ground-truth dots are counted as False Negative (FN) samples. We consider \(F_1\) score \(F_1 = 2PR/(P+R)\), where precision \(P=TP/(TP+FP)\) and recall \(R=TP/(TP+FN)\). We report the precision, recall and \(F_1\) score of the original work and our framework in Table 2.

Table 2. Quantitative comparison between original work and our framework

Full size table

Table 2 also shows the overall runtime (in seconds) and pixel rate (pixels per second) comparison. While our framework produced the same result as the original work, our overall speed was increased by approximately 400 times in small scale images on a single GPU device. This is reasonable since our method reduces most redundant convolution computation among the neighbor pixel patches.

3.3 Prefetching Speedup

In this subsection, we validate the effectiveness of the proposed asynchronous prefetching technique. Figure 1 shows the disk I/O time comparison among memory, file and prefetching modes in a whole-slide image (NLSI0000105 with spatial dimension \(13483\times 17943\)). The I/O time is calculated by the difference between the overall runtime and the true detection time. As mentioned in Sect. 2.2, memory mode is slightly faster than file mode in that memory mode requires less hardware interruption invocation. Note that the prefetching technique doesn’t truly reduce the I/O time. It hides most I/O time into the detection time, since the caching procedure and detection occur simultaneously. So for a \(10^{8}\)-pixel whole-slide image, our technique diminishes (or hides) \(95\,\%\) I/O time compared with file mode. This is because the exposed I/O time with our prefetching technique is only for reading the first cached image.

3.4 Parallel and Distributed Computing

In this subsection, we show our experiment results in several whole-slide images. We randomly selected five whole-slide images, in Aperio SVS format, from NLST and TCGA [11] data sets, varying in size, from \(10^{8}\) to \(10^{10}\) pixels. In order to show the efficiency of our methods, we conducted experiments in all five whole-slide images on a single workstation with Intel(R) Core(TM) i7-5930 K CPU @ 3.50 GHz, 64 Gigabytes RAM, 1 TB Samsung(R) 950 Pro Solid-State Drive and four Nvidia Titan X GPUs. Table 3 shows the overall runtime on cell detection in these whole-slide images. On a single workstation, our method is able to detect cells in a whole-slide image of size around \(10^4\times 10^4\) (NLSI0000105) in 20 s. Since the detection result of this whole-slide image includes approximately 200, 000 cells, our method detects nearly 10, 000 cells per second on average on a single workstation, while the original work [4] only detects approximately 6 cells per second, reaching a 1, 500 times speedup.

Table 3. Time comparison on single workstation (in seconds)

Full size table

The workaround of our method in distributed computing environment is demonstrated on TACC Stampede GPU clusters^{Footnote 3}. Each node is equipped with two 8-core Intel Xeon E5-2680 2.7 GHz CPUs, 32 Gigabytes RAM and a single Nvidia K20 GPU. We show only the distributed results for the last four images from Table 3, since the first image is too small to be sliced into 32 pieces. Table 4 shows that our method detects cells in a whole-slide image (TCGA-38-4627) with nearly \(10^{10}\) pixels within 155.87 s. When directly applying the original work, it takes approximately 400 h (1440000 s) even without considering the disk I/O time. Our method has impressively achieved nearly 10, 000 times speed up compared with naively applying [4]. The linear speedup also exhibits the scalability and communication efficiency, since our sampling strategy reduces most overhead in communication.

Table 4. Time comparison on multi-node cluster (in seconds)

Full size table

4 Conclusions

In this paper, a generalized distributed deep neural network framework is introduced to detect cells in whole-slide histopathological images. The innovative framework can be applied with any deep convolutional neural network pixel-wise cell detector. Our method is extremely optimized in distributed environment to detect cells in whole-slide images. We utilize a sparse kernel neural network forwarding technique to reduce nearly all redundant convolution computations. An asynchronous prefetching technique is recommended to diminish most disk I/O time when loading the large histopathological images into memory. Furthermore, an unbalanced distributed sampling strategy is presented to enhance the scalability and communication efficiency of our framework. These techniques construct three pillars of our framework. Extensive experiments demonstrate that our method can approximately detect 10, 000 cells per second on a single workstation, which is encouraging for high-throughput cell data. While our result enables the high speed cell detection, our result can expect to benefit some further pathological analysis, e.g. feature extraction [12].

Notes

1.
The code is the publicly available at https://github.com/uta-smile/caffe-fastfpbp. We also provide a web demo for our method at https://celldetection.zhengxu.work/.
2.
https://biometry.nci.nih.gov/cdas/studies/nlst/.
3.
https://www.tacc.utexas.edu/stampede/.

References

Arteta, C., Lempitsky, V., Noble, J.A., Zisserman, A.: Learning to detect cells using non-overlapping extremal regions. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7510, pp. 348–356. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33415-3_43
Chapter Google Scholar
Cireşan, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Mitosis detection in breast cancer histology images with deep neural networks. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 411–418. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40763-5_51
Chapter Google Scholar
Xie, Y., Xing, F., Kong, X., Su, H., Yang, L.: Beyond classification: structured regression for robust cell detection using convolutional neural network. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 358–365. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24574-4_43
Chapter Google Scholar
Pan, H., Xu, Z., Huang, J.: An effective approach for robust lung cancer cell detection. In: Wu, G., Coupé, P., Zhan, Y., Munsell, B., Rueckert, D. (eds.) Patch-MI 2015. LNCS, vol. 9467, pp. 87–94. Springer, Heidelberg (2015)
Google Scholar
Giusti, A., Cireşan, D.C., Masci, J., Gambardella, L.M., Schmidhuber, J.: Fast image scanning with deep max-pooling convolutional neural networks. arXiv preprint arXiv:1302.1700 (2013)
Li, H., Zhao, R., Wang, X.: Highly efficient forward and backward propagation of convolutional neural networks for pixelwise classification. arXiv preprint arXiv:1412.4526 (2014)
Mareček, J., Richtárik, P., Takáč, M.: Distributed block coordinate descent for minimizing partially separable functions. In: Numerical Analysis and Optimization, pp. 261–288. Springer, Switzerland (2015)
Google Scholar
Xu, Z., Huang, J.: Efficient lung cancer cell detection with deep convolution neural network. In: Wu, G., Coupé, P., Zhan, Y., Munsell, B., Rueckert, D. (eds.) Patch-MI 2015. LNCS, vol. 9467, pp. 79–86. Springer, Heidelberg (2015)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Network, C.G.A.R., et al.: Comprehensive molecular profiling of lung adenocarcinoma. Nature 511(7511), 543–550 (2014)
Article Google Scholar
Yao, J., Ganti, D., Luo, X., Xiao, G., Xie, Y., Yan, S., Huang, J.: Computer-assisted diagnosis of lung cancer using quantitative topology features. In: Zhou, L., Wang, L., Wang, Q., Shi, Y. (eds.) MLMI 2015. LNCS, vol. 9352, pp. 288–295. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24888-2_35
Chapter Google Scholar

Download references

Acknowledgments

The authors would like to thank NVIDIA for GPU donation and the National Cancer Institute for access to NCI’s data collected by the National Lung Screening Trial. The statements contained herein are solely of the authors and do not represent or imply concurrence or endorsement by NCI.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, 76019, USA
Zheng Xu & Junzhou Huang

Authors

Zheng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Junzhou Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junzhou Huang .

Editor information

Editors and Affiliations

University College London , London, United Kingdom
Sebastien Ourselin
The Hebrew University of Jerusalem , Jerusalem, Israel
Leo Joskowicz
Harvard Medical School , Boston, Massachusetts, USA
Mert R. Sabuncu
Istanbul Technical University , Istanbul, Turkey
Gozde Unal
Harvard Medical School and Brigham and Women's Hospital, Boston, Massachusetts, USA
William Wells

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, Z., Huang, J. (2016). Detecting 10,000 Cells in One Second. In: Ourselin, S., Joskowicz, L., Sabuncu, M., Unal, G., Wells, W. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016. MICCAI 2016. Lecture Notes in Computer Science(), vol 9901. Springer, Cham. https://doi.org/10.1007/978-3-319-46723-8_78

Download citation

DOI: https://doi.org/10.1007/978-3-319-46723-8_78
Published: 02 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46722-1
Online ISBN: 978-3-319-46723-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Detecting 10,000 Cells in One Second

Abstract

Similar content being viewed by others

Accelerated ML-Assisted Tumor Detection in High-Resolution Histopathology Images

An annotation-free whole-slide training approach to pathological classification of lung cancer types using deep learning

Slideflow: deep learning for digital histopathology with real-time whole-slide visualization

Keywords

1 Introduction

2 Methodology