1 Introduction

This study aims to leverage deep learning (DL) and DCNNs [1,2,3,4,5,6, 8, 19, 20] for prescreening of Lyme disease [9,10,11,12,13,14,15]. Lyme disease is the most common vector-borne disease in the United States, with over 300,000 new cases annually. Borrelia burgdorferi is the causative bacterial agent of Lyme disease, and it is transmitted through the bite of an infected tick into the skin of the affected individual. Infection progresses through three stages, advancing from skin-limited disease to disseminated disease affecting the nervous, cardiac, and rheumatologic systems. In the majority of cases, the initial skin infection is manifested by a round or oval red skin lesion called erythema migrans (EM), which is a direct result of bacterial infection of the skin and marks the first stage of Lyme disease. Treatment with oral antibiotics is highly effective in early, uncomplicated cases. Therefore, recognition of EM is crucial to early diagnosis and treatment, and ultimately, prevention of potentially devastating long-term complications.

Erythema migrans typically occurs 1 to 3 weeks after the initial tick bite and expands centrifugally by as much as a centimeter per day. Classically, the lesion will also display central clearing as it expands, leading to the hallmark bull’s-eye rash of Lyme disease. However, many individuals will not display this finding and the majority of individuals are unable to recall a tick bite, making early diagnosis challenging. EM usually persists for weeks during which its visual recognition is the primary basis for the clinical diagnosis of early Lyme disease. Following this early period, untreated EM usually disappears or progresses to disseminated disease through the spread of infection through the bloodstream. Diagnosis of early Lyme disease is usually made based on clinical signs and symptoms and history of potential exposure to ticks, due to the lack of reliable serologic blood testing early in the disease course [9, 10]. Blood tests are insensitive during the early phase of infection and are not recommended because of the high false negative rate at the time of initial EM presentation. Only 25 to 40% will have positive results during the acute phase of infection. Direct detection of bacteria in blood or biopsy samples can be performed, but are generally unavailable in non-research settings and not practical due to the time required for results [11].

The clinical diagnosis of early Lyme disease and EM is still a challenge. This is because EM may take on a variety of appearances besides the characteristic ring-within-a-ring, or bull’s-eye rash. The majority (80%) of EM lesions in the US lack the central clearing [13] of the stereotypical bull’s eye lesion and appear uniformly red or bluish red (Fig. 1). Thus, they are often mistaken for a spider bite or bruise. A small percentage (4–8%) of skin lesions have a small central blister, which may lead to the incorrect diagnosis of shingles (herpes zoster) [14]. Approximately 20% of patients have multiple skin lesions arising from the spread of infection through the bloodstream, which often have an atypical appearance. Atypical skin lesions are often misdiagnosed, which results in delayed diagnosis and treatment and increases risk of long-term complications.

Fig. 1.
figure 1

(sources: left: https://commons.wikimedia.org/wiki/Category:Erythema_migrans; right: JHU)

Examples of EM with atypical (top) and classic bull’s-eye (bottom) presentations.

Previous studies have shown that the general population does not correctly identify EM skin lesions that lack the classic bull’s-eye appearance and misidentify this condition approximately 80% of the time. As 80% of skin lesions do not have the bull’s eye appearance [15], this means that approximately 60% of all EM lesions may be misdiagnosed by patients (80% of 80%). Machine-based prescreening of skin lesions associated with Lyme disease has the potential to identify a high percentage of both typical and atypical lesions, thereby decreasing the incidence of misdiagnosis of early Lyme disease.

Prior to 2012 and the demonstration of significant improvement in object recognition performance on ImageNet via the use of DCNNs (AlexNet [4]), object classification in computer vision was largely based on applying traditional classifiers to hand-engineered image features [18]. DCNNs have replaced these approaches for both computer vision and medical imaging tasks (e.g. [1, 2]), and recently, they have been successfully used for performing a number of medical imaging diagnostics, including identifying skin cancer [12]. To the best of our knowledge, however, Lyme disease detection from skin lesions has only been addressed thus far using classical ML approaches [16].

This study aims to expand on prior state of the art with the following novel and salient contributions: (a) we develop a novel, carefully clinician-annotated dataset called Lyme1600, which includes over 1600 images with several types of fine-grained annotations for skin lesions, mostly focused on EM, but also including other confuser lesions and clear/unaffected cases; this dataset size is over two orders of magnitude larger than prior non-public datasets previously studied (such as [16] having 143 images), and (b) we develop a baseline DCNN approach that achieves a significant performance improvement over prior state of the art, and demonstrates substantial agreement with human clinician annotations. We make the DCNN model for this classifier publicly available; it can potentially be used by others for fine tuning and transfer learning for addressing classification of other types of skin affects including skin cancer lesions.

2 Methods

Problem Statement:

We pose the problem as a 2-class classification problem, classifying images into patients that have EM (Lyme disease) vs. individuals that have no skin lesions or another skin condition, including confounding skin lesions. The main confusers that are considered in this second class include cases of herpes zoster (HZ), also known as shingles. HZ was used as the principal confuser with the rationale that the main application envisioned here is a pre-screening tool, possibly implemented as a smartphone application, that could help individuals self-identify and screen lesions suspicious for Lyme disease. An acute onset rash, such as HZ, might prompt an individual to suspect Lyme disease and seek medical attention. This application is targeted towards such individuals for whom such a tool would provide a means of disambiguation.

Data

As an annotated, and publicly available dataset for the study of machine prescreening of Lyme disease and EM is not available and as there is a paucity of clinical images having the associated consent and approval required for use in this research, an image dataset was created using publicly available images extracted from the web. This strategy was motivated by a recent study [12] on skin cancer where online images were also successfully leveraged—after careful annotation—for generating DL classification models of referable skin cancers. The online images of skin lesions leveraged in this study principally include EM, herpes zoster, other non-Lyme skin lesions, and normal skin. Such images were mined from online sources, after which clinicians (J.A., A.R., and E.N.) were tasked with carefully annotating the images based on the visual appearance and the estimated size of the skin lesions. Clinicians were asked to do a whole image classification first using a high level labeling of the pathology, followed by a fine grained annotation that included the type of specific EM that was present (e.g. simple vs. diffuse). Additional curation steps included a machine-based removal of full or near duplicates, followed by human assessment for the presence of duplicates and the removal of inappropriate images. Following this, a subset of images was selected to include images with moderate to high probability of depicting EM or herpes zoster (and other confounding skin lesions). Images with a low probability of EM or HZ diagnosis were excluded from the dataset. In the end, a 2-class partitioning of those images into affected (C0) and unaffected (C1) classes was performed (Table 1).

Table 1. Class balancing and characteristic table

DL Approach:

Recent advances in DL performance have been realized via a number of factors including the development of large labeled datasets, the availability of markedly increased computational power via graphic processing units, and various algorithmic improvements. DCNNs, used here, form feature representations at increased levels of abstraction via multiple layers of processing [1, 2] and solve discriminative problems (e.g. classification). Here, a DCNN takes a skin image as input and produces probabilities that the image belong to one of several specific classes of pathologies (EM vs. no EM here) as output. Our study uses the ResNet50 [8] DCNN architecture. ResNet was originally conceived as a means of producing deeper networks and include specific design patterns such as bottleneck and skip connections that make the output of upstream layer directly available to downstream layers. Our implementation used the Keras and TensorFlow frameworks. We used transfer learning and fine-tuned the original ResNet50 weights using the skin classification problem addressed herein. We used stochastic gradient descent with Nesterov momentum = 0.9 for training, with initial learning rate set to 1E-3. The training scheme used an early stopping approach, which terminates training after 10 epochs of no improvement of the validation accuracy. We used a categorical cross entropy loss function. Dynamic learning rate scheduling was also used, in which we multiplied the learning rate by 0.5 when the training loss did not improve for 10 epochs. A batch size of 32 was used. Data augmentation was used and included horizontal flipping, blurring, sharpening, and changes to saturation, brightness, contrast, and color balance. We are making the DCNN model, with trained weights, available at https://github.com/neil454/lyme-1600-model.

N-Fold Validation:

The datasets were further subdivided into training and testing subsets. We used a K-fold cross-validation method, with K = 5, where four folds were employed for training and one fold was used for testing (with rotation of the folds for 5 runs). One training fold was further equally subdivided into two parts, with one used for validation and stopping conditions. In sum, the train/validation/test partition distribution was 70%/10%20%, respectively.

Performance Metrics:

The performance metrics used in this study included accuracy, F1, sensitivity, specificity, PPV (Positive predicted value), NPV, (Negative predicted value) and kappa score, which discounts chance agreement [7]. Since any classifier trades off between sensitivity and specificity, to compare methods, we used ROC (receiver operating characteristic) curves, showing detection probability (sensitivity) vs. false alarm rate (100% - specificity) and AUC (area under curve) was computed.

3 Results

Results of experiments are shown for applying the above method to the data partitioned using 5-fold cross validation. Table 1 shows the class partitions, Table 2 the resulting metrics, and Fig. 2 the resulting ROC curve. Results show promising accuracy of 93.04%. The ROC curve shows that one can operate with 90% sensitivity and above while having a specificity ranging in the 75% to 85% range, a tradeoff which suggests a potential for deployment as a pre-screener. Kappa score of 0.7549 also demonstrates substantial agreement with the human-annotated gold standard.

Table 2. Performance metrics for five-fold cross validation
Fig. 2.
figure 2

ROC curve for the proposed pre-screener

4 Discussion

Data sets of EM rashes with annotation for research or teaching purposes are not currently widely available. Only one large study of EM rash characteristics in the United States from 2002 has been done. Physician review of images in that dataset reported an unexpected diversity in the appearance of EM lesions, with only 10% of lesions having the classic central clearing and ring-within-a-ring target appearance [17]. The photos of EM lesions from that study had not been analyzed further using computerized approaches. To our knowledge, only one other study of computer-assisted detection of EM has been reported in the literature [16]. That study [16] used machine learning methods including boosting, SVM, naïve Bayes, and neural nets (but not DL) applied on hand-designed image features, and was tested with a smaller dataset of 143 EM rash images. Reported accuracies ranged from 69.23% to 80.42%. These results are a testimony to the difficulty in addressing the problem of how to discern between the varied presentations of the EM lesions. By comparison, our results, performed on a much larger dataset, and images taken ‘in the wild’, show notable enhancements in performance.

Because of the lack of publicly available labeled datasets for EM ML studies, the use of photographs from online image banks was made necessary in this study in order to obtain an adequate number of images, particularly as we addressed a less common condition such as erythema migrans. In doing so, our work followed the approach of a recently published high-impact study investigating detection of skin cancer using DCNNs [12], which also exploited online images to produce a curated training dataset and corresponding model. While our dataset is still being developed with new types of confounding pathologies and lesions such as tinea corporis, our goal is to release it in the future once the study has completed procuring all examples of confusing lesions and all annotation has been done. In the meantime, we are making the classification model available online.

One limitation of the current dataset includes the fact that individuals with dark skin are underrepresented. In addition, certain characteristics inherent to online images, such as variability in viewpoint/angle, lighting, and photo resolution, made the problem more challenging. At annotation time, the inability to verify the skin lesion through inspection at different angles or magnification in order to estimate the size of the skin lesion in some cases was an issue. However, images for which there was significant ambiguity or uncertainty in diagnosis due to these factors were excluded. We were also limited in our ability to verify diagnoses through corroborating clinical and laboratory data. However, this limitation is mitigated by the fact that diagnosis of both EM and the principal confuser considered here, HZ, are primarily clinical—that is, the diagnosis of these conditions relies primarily on visual inspection and suspicion. There is no universally accepted “gold standard” diagnostic test for Lyme disease given the variable reliability of serologic testing and the impractical nature of culture identification of the organism in the clinical setting. Meanwhile, the gold standard for diagnosis of herpes zoster consists of PCR or culture detection of varicella zoster virus from skin lesions, but this is usually not performed for diagnosis given the characteristic clinical appearance and symptoms associated with the rash.

In sum, considering all of the elements above, our study was able to substantially advance the state of the art in automated Lyme prescreening with DL models that have significant promise for clinical deployment as pre-screeners. Such an application would prove to be of great utility given the challenges of diagnosing Lyme disease at an early stage when treatment is effective and can prevent the otherwise serious long-term complications associated with advanced Lyme disease. Based on our results, an application using DL is likely more sensitive than patient self-assessment and may even be more accurate than diagnosis by a general non-specialist physician, who would ordinarily serve as the screening gatekeeper for acute onset rashes such as EM. Given the frequent under-diagnosis of EM, the use of automated detection would be beneficial by increasing the number of patients who seek further medical assessment for EM rashes and minimizing the number of cases that go unevaluated and undiagnosed, with an expected positive effect on patient morbidity. Future work will involve studying multi-class problems such as also trying to separately identify the HZ and other confounding classes, which may lead to improved performance for the 2-class EM problem.

5 Conclusion

We make several contributions to automated EM and Lyme disease detection: we develop the first carefully clinician-annotated large dataset for the study of ML-based diagnostics of Lyme disease, including cases of affected, confuser, and control images. We propose a pre-screener for EM using DCNNs that shows substantial agreement with expert human clinician gold standard annotations and make this model publicly available.