Keywords

1 Introduction

Automated classification of blood cells made significant progress in recent years: novel methods and applications provide clinical experts with tools that yield quantifiable results in short time [1, 2]. Part of this is the analysis of white blood cells (leukocytes) found in peripheral blood. Since these are relatively sparse and show great contrast to background and red blood cells (erythrocytes), segmentation is straight forward. However, as only mature cells are present in healthy peripheral blood, classification of those is less challenging compared to bone marrow, where more cell types in various degrees of maturation are present. This makes distinguishing between those cell types a challenging task even for humans.

As the distribution of cell-classes reflects the hematopoietic process of bone marrow and is severely affected by diseases such as leukemia, a tool for reliable, automated classification and counting would be a great support for medical experts. Such an application could provide impartial and reproducible results as a basis for diagnostic decisions. Even today, the tedious task of counting cells is still performed manually by medical experts.

Previous research on the classification of hematopoietic cells [1] suggested a pipeline for the classification of white blood cells in blood smear images. The authors made use of morphological features, which they can extract based on a segmentation of cell cytoplasm and nucleus. Due to the higher density of cells, this approach is not feasible with bone marrow images, as morphological operations for segmentation require a better separation between individual cells. Other specialized features based on automatically segmented regions described by shape, color and texture yielded a good separation of hematopoietic cells [3].

In a more recent work, convolutional neural networks have been proposed for the classification of hematopoietic cells [2]. They used images obtained from blood smears, which make it possible to conveniently detect and segment the white cell lineage. Based on texture features, three classes of cells were separated with a Support Vector Machine, followed by a classification of three other cell types using features from a neural network and a Random Forest Classifier. This method achieves good results in blood smear images.

As opposed to research on the analysis of white blood cells in blood smear images, research on the classification of hematopoietic cells in bone marrow is rare and often only possible under certain constraints or simplifications: for example, manually segmented images of cropped cells from different maturity stages of the granulopoiesis are given as prior knowledge [4]. Based on the segmentation of nucleus and cytoplasm, it is possible to craft specialized features that make classification possible.

To the best of our knowledge, classification has not been researched in detail for a wider range of hematopoietic cells in bone marrow images without prior knowledge such as manual segmentations. In this paper, we compare and evaluate classical machine learning pipelines with feature extraction and classification for their suitability to perform this challenging task. Classification is performed on images with a centered hematopoietic cell each – without providing anyfurther details such as segmentation of cytoplasm and nucleus or removal of neighbouring cells. Next to classical machine learning, we also consider modern classification methods based on convolutional neural networks.

2 Dataset

Bone marrow samples are acquired and processed by the Department of Hematology, Oncology, Hemostaseology and Stem Cell Transplantation of the University Hospital, RWTH Aachen University. By applying Pappenheim staining, cell nuclei and different types of cytoplasm can be more easily distinguished. The samples were scanned at the Institute of Pathology of the University Hospital, immersed in oil with \(63{\times }\) magnification. Medical experts of the Department of Hematology annotated representative excerpts and assigned a label to every cell and cell-like artifact.

For the following evaluations, we used patches of size \(200 \times 200\) pixels containing bone marrow hematopoietic cells. While most cells fit completely into a patch of this size, some cell types – e.g., megakaryocytes – are slightly cropped. Neighboring cells are often present as well.

Fig. 1.
figure 1

Three examples of each cell class as well as their sample size in the dataset.

Our dataset comprises \(2478\) patches with leukocytes separated into the classes denoted in Fig. 1. Additionally, we have 3331 cell-like artifacts and non-leukocytes.

3 Methods

This paper provides a thorough analysis of various algorithms to classify hematopoietic cells in microscope images of stained bone marrow, which are briefly presented in the following paragraphs.

3.1 Classical Machine Learning Algorithms

Classical approaches comprise a feature extraction and a classification stage. The feature extractor provides an abstract representation of the image that the classifier uses to predict its class.

Feature Extraction. In a preliminary evaluation on a smaller subset, we compared several broadly applicable and openly available feature extractors, namely Dense SIFT, HOG, LBP and Daisy. Since we perform classification purely on image data without prior knowledge of segmentations of nucleus and cytoplasm, shape and contour features were not applicable. To have a manageable number of tests, we focused on Histogram of Oriented Gradients (HOG) [5] and Local Binary Pattern (LBP) [6] as representative, well-performing features for the final evaluation presented in this work. The Histogram of Oriented Gradients feature is computed for every color channel with 50 pixels per cell and 4 cells per block. We also applied the same descriptor to an image transformed into polar coordinates (SHOG). Local Binary Pattern is computed on every color channel as well, using the uniform variant, which is rotation and grayscale invariant.

In an attempt to combine classical and deep machine learning approaches, it is also possible to use the output of a convolutional neural network as a feature vector [7]. To this end, we extracted features from the output of the convolutional part of a ResNet18 and a ResNet152 (cf. Sect. 3.2) pretrained on ImageNet. Every feature descriptor was normalized to zero-mean and unit-variance prior to training.

Classification. Based on feature descriptors, a classifier is able to separate two or more classes. Each of the features presented in Sect. 3.1 is fitted using a Linear Support Vector Machine (SVM), [8] a Radial Basis Function Support Vector Machine (RBF) [9], a Random Forest Classifier (RFC) [10], and the AdaBoost Classifier (ABC) [11]. In addition to annotated training data, those algorithms require the user to define several hyper-parameters such as a kernel and a penalty parameter for SVMs or the number of decision trees in a Random Forest.

3.2 Deep Neural Networks

Neural networks with a large number of parameters have become a powerful tool for classification tasks for all kinds of medical image processing and analysis tasks [7]. A deep neural network in its simplest form has several convolutional layers (the output of which can be used as a feature vector as mentioned above) followed by a few fully connected layers (serving as the classification part of the network). A popular model for image classification tasks is the ResNet, a deep residual network [12]. It is common practice to initialize the weights of these networks based on a pre-training on a larger image database such as ImageNet. This not only drastically reduces the amount of training data necessary for successful network training, it also leads to quicker convergence.

For our evaluation, we chose two variants of ResNet: ResNet18 and ResNet152. As with the training of classical classifiers, a set of hyper-parameters needs to be fine-tuned to achieve high precision. These include, for example, learning rate and data batch size.

We used an Adam optimizer for training these networks, as it showed superior performance compared to Stochastic Gradient Descent (SGD) in preliminary tests. Furthermore, we turn one crossvalidation fold of the training set into a validation set to use the validation error as an early stopping criterion.

4 Experimental Setup

Each classifier is evaluated in several experiments: binary classification between white blood cells and other cells (experiment named leukocytes), classification of the grade of maturity of neutrophilic cells (ignoring other cell types, experiment named neutrophilic cells), and classification between all mentioned cell types (ignoring non-leukocytes, experiment named all).

For each test, we performed a three-fold crossvalidation to obtain the overall results. In every fold, we performed hyper-parameter optimization with six-fold crossvalidation as an inner loop. We also performed very basic data augmentation by using random rotation and a slight translation of up to 10 pixels.

Hyper-parameter Optimization. Every classifier was optimized in terms of its hyper-parameters by using random search over an appropriate range of parameter values. While being able to test \(150\) hyper-parameter sets with classical models, time constraints only allowed the evaluation of \(10\) sets with deep learning approaches.

We optimized the linear SVM with respect to the penalty term and the class weights and the radial SVM additionally with respect to the gamma term. For Random Forest we considered the number of estimators, class weights, and maximum depth, and for AdaBoost the number of estimators and learning rate. All of these hyper-parameters were randomly chosen and not algorithmically optimized. We selected these particular sets of parameters for optimization due to their significant impact on training process and classifier quality.

m-score. A significant class-imbalance renders the commonly used accuracy-score infeasible, as smaller classes can be completely misclassified while still having a high overall accuracy. To mitigate this, we used the sum of the logarithmic per-class F1-scores, called m-score in the remainder of this paper, as a measure for the selection of the best performing hyper-parameter set: \(m = \sum _{n}^{N} \log {(\text {f1}_n)}\) with \(N\) classes and \(\text {f1}_n\) the F1-score of class \(n\). To prevent negative infinity if at least one class has an F1-score of zero, we capped the individual F1-scores at \(0.01\).

Implementation. The project is implemented in Python, using Scikit-Learn for the classical pipeline and Pytorch for the deep neural networks. Both pretrained models were obtained from the Torchvision package. Model training was performed on a single GeForce GTX 1080 Ti GPU each.

5 Results

The results are shown in Fig. 2: at a first glance it is obvious that all features, classifiers and networks achieve the best results in the binary classification task of distinguishing between leukocytes and cell-like artifacts. Particularly the commonly used image features HOG and LBP reach acceptable scores. Using outputs of the ResNet’s convolutional parts did not perform equally well. Training a deep neural network directly, however, yielded excellent results.

Fig. 2.
figure 2

(a)–(c): results of all three experiments. F1-Scores are marked with a square (bottom horizontal axis), the m-score is marked with an ‘x’ (top horizontal axis). Grouped by feature descriptor, from bottom to top there are the classifiers SVM (blue), RBF (red), ABC (green), RFC (orange). On the top there are the ResNet18 (lower) and the ResNet152 (upper). (d): confusion matrix for ResNet18 in the neutrophilic cells experiment. (Color figure online)

Likewise, neural networks performed significantly better than classical methods in both the neutrophilic cells and the all experiment. They did not only achieve higher scores, the individual class scores were also more closely together. Both experiments are demanding for general purpose descriptors and standard classification algorithms: in almost every single test, at least one class – usually one of the underrepresented classes – is completely misclassified. Neural networks, however, still provide valuable results.

The confusion matrix corresponding to classification of neutrophilic cells with a ResNet18 shows that most of the errors are just one class off when considering the ongoing maturation process.

6 Discussion

The lower performance of ResNet features combined with classical classifiers might be explained by the fact that it was pre-trained on ImageNet – a database of images very different from microscope images.

Considering the larger impact of feature descriptors versus the choice of the classification algorithm, one advantage of deep learning becomes visible: not only the classifier is adapted to the specific problem, but also the part of the neural network that corresponds to the feature extraction. This is in line with previous work, where researchers hand-crafted more descriptive features – a task that is inherently covered by neural networks. This is also supported by the confusion matrix, which indicates that the residual network learned significant, descriptive features for this special use-case – even though no prior information about the order of cell classes was available during training.

It can also be inferred from the results that neural networks are better suited for the classification of multiple classes: they still achieved superior results in these cases while compared to classical machine learning methods. Already the more shallow ResNet18 yielded sufficient results without requiring further layers.

We are aware that better performance could be achieved with manual fine-tuning and hand-crafted features, as several other researchers have successfully shown at least for peripheral blood images. However, the same can be said for deep learning approaches: custom loss-functions, smarter data-augmentation, dedicated network architectures, additional training time, and optimisation of hyper-parameters can surely further increase performance.

Furthermore, this study did not evaluate the amount human error in ground truth annotations. Further research about the inter- and intra-rater reliability is needed.

7 Conclusion and Outlook

In this paper, we provide a comparison of several common approaches for the classification of hematopoietic cells. We considered a wide range of white cells in different stages of maturity in the bone marrow, which is a very challenging scenario. We showed that while most classical machine learning algorithms perform poorly, deep neural networks yield promising results.

Particularly the fact that they perform better in multi-class classification is important for the automated analysis of bone marrow smear images, which show an increased number of cell types. It stands to reason that more specialized architectures will be able to yield results accurate enough for medical analysis based on images of bone marrow instead of peripheral blood smears. This paper provides a baseline evaluation for yet to come improvements to neural networks.