Keywords

1 Introduction

In digital pathology, computer-aided diagnosis (CAD) has become an essential part of the clinical work with the advent of high-resolution whole-slide imaging technology. The fusion of machine learning (ML) based image analysis algorithms, and digitized histological slides are assisting the pathologists in terms of workload reduction, efficient decision support [10, 23] and interpretability of the results [21]. Given the vast amount of gigapixel-sized whole-slide imaging data, and the need to accelerate the time-to-insight, there is an increasing demand to build automated and scalable pipelines for large-scale and fast image analysis.

Color normalization of stained tissue samples is one of the main preprocessing steps in whole-slide image (WSI) processing [24]. Despite the standardized staining protocols, variations in the staining results are very frequent due to differences in the staining parameters, e.g. antigen concentration and incubation time and temperature, different conditions between slide scanners, etc. [21]. Such color/intensity variations can adversely affect the performance and accuracy of the CAD systems. Therefore, stain normalization techniques have been proposed to generate images with a standardized appearance of the different stains [1, 2, 7,8,9, 11, 15, 17].

In this work, we use the Macenko method [11] to implement a high-performance stain normalization system. The algorithm does not involve intermediate steps that require training of model parameters and is thus computationally less expensive. Our stain normalization system is based on an optimized multi-core implementation of the singular-value decomposition-based method (SVD) in [11]. In addition, to support the processing of high-magnification images, where the size of the image may not fit in the CPU memory (e.g., a 40X magnification WSI of 160 k \(\times \) 80 k pixels corresponds to 37.5 GB of data in RGB), we devise an iterative multi-batch implementation. Furthermore, we design 2 system flavors: single-node and distributed multi-node versions. The latter offers a scalable solution that enables large-scale and high-speed processing of high-resolution WSIs using a cluster of nodes with multi-core CPUs. Finally, our implementation supports multiple image formats (e.g., .svs, .tiff, .ndpi) which enables the evaluation of stain normalization on datasets generated by different scanners.

Our contributions are the four-fold as follows:

  1. (a)

    A high-performance implementation of the Macenko algorithm [11] that enables processing of gigapixel WSI (magnification 40X and beyond);

  2. (b)

    A distributed architecture that uses computing power of a cluster to further accelerate the stain normalization workload;

  3. (c)

    A pipeline that supports multiple execution modes, i.e., single- or multi-node execution, depending on the image size and the system resources, e.g., available RAM, cluster nodes (machines);

  4. (d)

    An evaluation of the proposed system on WSI datasets generated by different scanning systems thus having different image formats and characteristics.

In the next sections, we present the architecture and implementation aspects of our novel stain normalization system. We discuss the optimization steps and the role of the various parameters in the runtime and accuracy of the algorithm.

2 Stain Normalization of Whole-Slide Images

The stain normalization method presented in [11] belongs to the class of unsupervised normalization methods. The algorithm estimates first the hematoxylin and eostin (H&E) stain vectors of the WSI of interest by using an SVD approach on the non-background pixels. Second, the algorithm applies a correction to account for the intensity variations due to noise. The algorithm is based on the principle that the color of each pixel (RGB channels) is a linear combination of the two stain vectors which are unknown and need to be estimated.

As a reference implementation of the Macenko algorithm [11] we use a publicly available MATLAB implementation [19]. We outline the algorithmic steps in Algorithm 1. Additional details are available in [22].

figure a

3 High-Performance Stain Normalization Architecture

3.1 Optimized Multi-core Architecture

We develop two optimized single-node implementations: (a) single-batch and (b) multi-batch. The single-batch implementation is intended to minimize the processing time on a single-node, while the multi-batch implementation enables processing of 40X WSI when their size is larger than RAM of a single node (Table in Fig. 4(c)). Both implementations follow the steps shown in Algorithm 1. Due to multiple optimizations, we reorganize the steps as in Fig. 1.

Fig. 1.
figure 1

Mapping of the Macenko Algorithm 1 steps to our optimized implementation

To enable single-batch processing of 40X images on state-of-the-art machines with 64 GB of RAM, only RGB pixel values are permanently stored in memory. Steps 1–2 of Algorithm 1 are executed multiple times, in the processing blocks A, B and D in Fig. 1, because these blocks operate on non-background pixels in OD space. Step 1 is executed also in block E, because this block requires all pixels in OD space. Step 1 is performed using a 256-entry lookup-table that greatly speeds up the logarithm calculation. Moreover, Step 2 removes the background pixels which significantly reduces both the processing and the memory load. In the multi-batch implementation, batches of RGB pixel values are only temporarily stored in RAM and are read from the file system in the blocks A, B, D, and E.

To speed up the CPU processing we perform the following optimizations:

  1. (a)

    In block A, during SVD calculation, the covariance matrix is calculated using the property that the element (ij) of the matrix, \(\varSigma _{ij}=\frac{1}{N^2}(\sum _{p}{x_{p,i}x_{p,j}}-\sum _{p}{x_{p,i}}\sum _{p}{x_{p,j}})\), requires only the sums of OD components.

  2. (b)

    In blocks B and D, which are benchmarked as the most time-consuming steps, partial sorting is performed to find the percentiles from Steps 6 and 10. This partial sorting runs 2–3x faster compared to full sorting. In the multi-batch implementation, individual batches are partially sorted and then combined using an optimized merging function to calculate the global robust extremes.

  3. (c)

    For the exponential function in the processing block B, we use the fast exponentiation library [4] since it performs 5–10x faster compared to the corresponding function in the standard C library.

  4. (d)

    Since the processing blocks A, B, D, and E perform many independent operations on individual pixels, their execution is parallelized across all available CPU threads using the OpenMP library [13].

Given that the processing blocks B and D are the most time-consuming due to the difficulty of parallelizing the sorting operation, we propose another optimization that is using a Monte Carlo sampling technique [5]. In this method, a sample of non-background pixels is randomly chosen from the set of all non-background pixels in order to estimate the required robust extremes from Steps 6 and 10. Despite different methods for estimating the population percentiles [18], estimating the variance of the percentile estimates is unreliable, thus making the analytical estimation of the required sample size difficult [3]. Therefore, we derive the optimal sample size based on empirical results in Sect. 5.

3.2 Distributed Architecture

We present a novel distributed implementation of the Macenko algorithm. This implementation is useful when the WSI size is significantly larger than the RAM of a single node. The distributed solution uses all the optimizations presented in Sect. 3.1. We provide 2 multi-node implementations: (1) single-batch (the local image partition is fully read once) and (2) multi-batch (the local image is split into batches which temporarily reside in RAM and read multiple times when needed). For inter-node communication we use the MPI library [12].

Figure 2 shows an overview of the multi-node setup. The WSI typically resides in a file system shared across all nodes. We split the WSI into partitions WSI\(_{i}\), \(i\in \{0,1,2,3\}\), and assign each partition to a separate node. Each node reads only its assigned partition. All nodes run the optimized Macenko algorithm on their partitions in parallel. Some processing steps need synchronization across nodes, e.g. Steps 6 and 10 in Algorithm 1. Thus one node needs to aggregate the relevant pixel data and compute, e.g. the global robust percentiles. We call this node master. Figure 2 shows a cluster of 4 nodes interconnected via a network, each node being assigned a part of the image. All nodes are slaves and node 0 is also master. We use this cluster setup for the experiments run in Sect. 5.

Fig. 2.
figure 2

Cluster setup and image partitioning scheme.

To describe the image partitioning, we define the partitioning element granularity as being a tile (stripe) for the single-batch implementation and a batch (e.g., a set of tiles/stripes) for the multi-batch implementation. The tile (stripe) is a rectangular contiguous part of the image, e.g., a tile is a 256 \(\times \) 256-pixel image region. Let N be the number of nodes (indexed from 0). The image is composed of elements (tiles or batches) that form a grid with C columns and L rows. The total number of elements E is equal to CxL. Each element e can be uniquely described by 2 coordinates c and l, where \(c\in \{0, ...,C-1\}\) and \(l\in \{0, ... ,L-1\}\). Given c and l, we compute the element index as \(e(c,l) = l\cdot C + c\). The partitioning across nodes is performed by assigning element e(cl) to node \([e(c,l)\mod N]\). Figure 2 shows a partitioning example, where \(N=4\), \(C=5\) and \(L=3\). This partitioning scheme is preferred over assigning contiguous sets of elements to each node in order to reduce the node imbalance due to input content. For example for 2 nodes and an image in which the top half is background, the contiguous scheme would assign an empty image to node 0. In this case, there will be a large processing imbalance between the 2 nodes, impacting the scalability of the distributed system. For the multi-node single-batch implementation, given \(t_{H}\) the tile height, \(t_{W}\) the tile width, H the WSI height and W the WSI width, all variables expressed in pixels, \(C=\lceil {W/t_W}\rceil \) and \(L=\lceil {H/t_H}\rceil \). For the multi-node multi-batch implementation, the user defines a batch size \(B_{size}\) as being the maximum number of pixels that can reside in RAM at any given time. Assuming square batches, then a batch will have the width and height equal to \(\lfloor {\sqrt{B_{size}}}\rfloor \), which implies that \(C=\lceil {W/{\lfloor {\sqrt{B_{size}}}\rfloor }}\rceil \) and \(L=\lceil {H/{\lfloor {\sqrt{B_{size}}}\rfloor }}\rceil \).

Fig. 3.
figure 3

Communication stages across nodes at different stages of the stain normalization algorithm.

Figure 3 shows the communication flow across nodes at different algorithm stages. First, each node reads its partition, runs Steps 1–2 of Algorithm 1 and computes the local OD sums and number of non-background pixels. Next, we run an MPI reduction phase to compute on all nodes the global number of non-background pixels and OD sums. Then, each node runs Steps 3–5 and sends its local vectors of angles to the master, where the global 1\(^{st}\) and 99\(^{th}\) percentiles of the projected angles are computed. The master then sends these percentiles to all slaves so that they can run Steps 7–9. The master computes the global 99\(^{th}\) percentiles of the stain concentrations based on the local concentration vectors sent by the slaves. The global percentiles are sent to all slaves which are used for normalization. Finally each node writes its image partition to a file.

3.3 Mode Selection Pipeline

Figure 4 shows the pipeline architecture for implementing mode selection depending on the image size and the system resources. Given the input image size, which depends on the image format and the targeted magnification, e.g., 10X, 40X, etc., and the system resources in terms of number of available nodes and main memory capacity per node, the optimal execution mode is selected. For example, if the image fits in the memory of a single node, then mode 1 in the table shown in Fig. 4(c) is selected. Otherwise, mode 2 is selected and the stain normalization engine will process the image in batches. If a cluster of nodes is available,mode 3 or 4 will be selected depending on whether the data partitioned among the different nodes fits in RAM or not, respectively.

Fig. 4.
figure 4

High-level diagram of the mode selection pipeline depending on image size and hardware resources. Data (in table) refers to either the full WSI or the image partition assigned to the node.

4 Whole-Slide Image Datasets

To analyze the performance and scalability of our pipeline, we used H&E-stained WSIs from 4 datasets. The first one is part of TUPAC MICCAI 2016 [20] and provides breast WSIs for prediction of tumor and proliferation scores. These WSIs are in Aperio format, single-file pyramidal tiled TIFF (.svs), with JPEG compression scheme. The CAMELYON16 dataset [6] is part of the ISBI challenge on cancer metastasis detection in lymph node. These slides are in Philips format, single-file pyramidal tiled TIFF or BigTIFF (.tif) with non-standard metadata and JPEG compression scheme. The remaining 2 datasets are proprietary but we used them to test the flexibility of our pipeline to different WSI formats. One dataset provides slides in Ventana format, single-file pyramidal tiled BigTIFF with non-standard metadata. The other dataset contains slides in Hamamatsu format, single-file TIFF-like format (.ndpi) with proprietary metadata. All the slides include 2.5X, 10X and 40X magnifications, except for the fourth dataset that includes 2.5X and 10X magnifications only. A de facto community standard for reading various WSI formats is OpenSlide C library that provides a simple interface to WSIs [14]. However, we wrote the proprietary functions that rely on the existing API from libTIFF and BigTIFF standard libraries for reading and writing of supported WSI formats that allow significantly faster read times as shown in Table 1 for various image formats and magnifications. In addition, our read functions have lower memory requirements compared to OpenSlide library functions.

Table 1. Image read speed-up of our implementation compared to OpenSlide

5 Experimental Results

40X WSI Processing. Figure 5 shows three images generated by different scanners, which we will further refer to as Scanner A, B and C, respectively. The top row shows the original images and the bottom row shows the normalized versions after running the normalization pipeline of Sect. 3. Each panel in Fig. 5 corresponds to a 1024 \(\times \) 1024-pixel region of the image at 40X magnification. While the input images (top row) show significant color variation, the normalized ones show more uniform color contrast as a result of the normalization process.

Fig. 5.
figure 5

Input images from different datasets and scanners (a)–(c) and normalized output images (d)–(f) after applying the stain normalization pipeline (Color figure online)

Fig. 6.
figure 6

Experimental results showing the performance of the proposed stain normalization pipeline on WSI from different datasets and magnification factors (Color figure online)

Single-Node Results. Figures 6(a)–(d) present measurements of the processing time of the single-node system. Figures 6(a)–(b) show the total processing time, including the time to read the images, as a function of the image size for the single- and multi-batch implementations in double-logarithmic scale. The different colors correspond to images in 2.5X, 10X, and 40X magnification, while the different markers correspond to the different datasets. The multi-batch implementation exhibits a moderate time increase that is mainly attributed to the multiple image reading overhead. A number of 4 batches has been used in all single-node multi-batch measurements. Figures 6(c)–(d) show the corresponding processing time only as a function of the number of non-background pixels. For each magnification, the measurements scale almost linearly with the number of non-background pixels. The processing time shows an offset for higher magnifications due to the initial memory allocation and final image normalization steps that are performed on the whole image.

Single-Node vs. Reference Results. Table 2 reports the average execution time over all images for all datasets in 10X magnification using the single-node implementation described in Sect. 3.1, compared with the original MATLAB code in [19] and reference C++ implementation based on it using OpenSlide library for reading of the images. All measurements have been collected on a single node with a 10-core Intel® i7-6950X CPU at 3 GHz and 64 GB of RAM. Our implementation achieves a speed-up factor larger than 8 for 10X images.

Table 2. MATLAB vs. single-node implementations (averaged for 10X images)
Fig. 7.
figure 7

Performance of distributed implementation based on the number of nodes for different image types and 40X magnification. (Color figure online)

Multi-Node Results. Figures 7(a)–(b) show the scalability results of the multi-node single-batch and multi-batch implementations, respectively. Measurements have been collected on a cluster of nodes with an 8-core Intel® Xeon® E5-2630v3 CPU and 64 GB of RAM. We show aggregated processing and read time as a function of the number of nodes for 40X magnification images. Each point and color represent an image and a dataset, respectively. For the single-batch implementation the speed-ups when increasing the number of processing nodes compared to the single-node measurements are as follows: 1.70X (2 nodes), 2.29X (3 nodes) and 2.80X (4 nodes). For the multi-batch implementation we measure the following average speed-ups compared to the single-node multi-batch measurements: 1.87X (2 nodes), 2.60X (3 nodes) and 3.30X (4 nodes). The reasons for the sub-linear scaling are the communication overhead of Blocks B and D in Fig. 3, and aggregation of intermediate results (merging of the vectors) in Blocks 6* and 10* in Fig. 3.

Fig. 8.
figure 8

Euclidean distance of ODM and the relative error of the robust maximum of the individual stain concentrations for various sampling rates.

Table 3. Processing time for different sampling rates, single-node single-batch implementation (40X images)

Single-Node Pixel Sampling Results. Figures 8(a)–(b) show the Euclidean distance (E\(_\text {d}\)) of ODM and the relative error of the robust maximum of the individual stain concentrations (maxC\(_\text {h}\) and maxC\(_\text {e}\)), respectively, between the sampling and no-sampling results. The measured E\(_\text {d}\) values of the individual stains are almost negligible compared to typical values such as the ones reported in Table 1 of [1] with a sampling rate of non-background pixels as low as 0.01%. Similarly, the relative error of the robust maximum estimation drops below 1% for a sampling rate of 1%. The higher sampling rate value in case of robust maximum estimation is required to offset the error aggregation from ODM estimation. Nevertheless, the increase in total processing time with respect to the sampling rate is negligible even for rates as large as 1%, as shown in Table 3. The average processing time (without read time) for different sampling rates and for the single-node single-batch implementation (no-sampling) is shown in Table 3. Thanks to the sampling method, the overall processing time is reduced by a factor of 1.3–9x depending on the image type, for 40X magnification. This effectively reduces the processing time to normalization time yielding this implementation as fast as methods that apply a fixed reference normalization template regardless of the input image [16].

Best Single-Batch Results. Table 4 shows the average speed-ups across all 40X magnification datasets of the single-node single-batch (with 1% sampling), four-node single-batch (with and without sampling) when compared to the single-node single-batch implementation without sampling. By combining the sampling technique with the distributed pipeline we attain an average speed-up of 3.44 and 5.19 vs. the single-node with and without sampling, respectively.

Table 4. Single-batch implementations (averaged for 40X images)

6 Conclusions

We built a fast and scalable pipeline to enable large-scale stain normalization of high-resolution histopathological whole-slide images. Our pipeline uses a highly optimized low-level engine that performs the required image processing functions and is based on a distributed computing architecture that is scalable in both image size and number of computing nodes. The presented pipeline tackles the memory and runtime bottlenecks of high-magnification images and enables the preprocessing of large datasets, which is a critical prerequisite for any ML framework applied to biomedical images. Our next steps involve: (a) performance evaluation of ML frameworks applied to stain normalized images, (b) automation of the pipeline mode selection, and (c) automated batch distribution based on the number of non-background pixels for load balancing across nodes.