Keywords

1 Introduction

Colorectal cancer is a malignant tumour that develops on the internal wall of the intestine (colon) or rectum [2]. The main reasons for studying this disease are the number os cases and mortality. The International Agency for Research on Cancer (IARC) presented a study in which colorectal cancer was defined as the third most common cancer in men (746,000 cases) and the second most common one for women (614,000 cases). The number of mortalities was 694,000 and the highest incidence of 52% of deaths occurred in less developed regions of the world [17]. The diagnosis for colorectal cancer can be made through sigmoidoscopy or by colonoscopy. Confirmation occurs by biopsies of the tissues stained with hematoxylin and eosin (H&E) and microscopically analyzed by pathologists.

The main difficulty for a medical diagnosis is the evaluation of the severity of abnormal findings when there are different opinions between inter and intraobservers [7, 10]. This fact has motivated the development of systems known as computer-aided diagnosis (CAD) [36] to support specialists in research and decision-making. A common challenge observed in proposals of CAD systems is to indicate the best combination between the selection and classification algorithms to achieve the highest success rates using the smallest features number [10, 15]. In this context, solutions obtained from metaheuristics models were relevant for different sorts of medical images. The techniques inspired by analogies found in nature or in evolutionary processes are worth mentioning, such as the methods based on genetic algorithms (GA) for the diagnosis of esophagus cancer [28], lung cancer, brain tumors, prostate cancer and leukemia [21].

A GA is a metaheuristic widely known in the literature and its main advantage in comparison with other evolutionary strategies is to have a structure that makes it possible to represent plausible new organizational forms (individuals) from a successful previous organizational construct (crossover) [5] without losing critical information from the problem [26, 34]. Despite of the different strategies considering genetic algorithms for the study and development of CAD, such as diagnosis of cardiac diseases [3] and lung cancer [24], the models available in the literature did not explore the method in order to determine the best combination of features, selection algorithms and classifiers [13, 22] in the context of histological images and diagnosis of colorectal cancer. Therefore, in this work we present a method based on a GA capable of analyzing a significative number of features obtained from fractal techniques, Haralick texture features and curvelet coefficients, as well as selection methods and classifiers in order to indicate an acceptable solution for the diagnosis of colorectal cancer. This type of study contributes significantly to the literature focused on the theme, especially with the development and improvement of CAD systems. The main contributions of the proposal are:

  1. 1.

    A method based on genetic algorithm capable of analyzing a significative number of features, selection methods and classifiers for the study and pattern recognition of colorectal cancer;

  2. 2.

    An approach capable of indicating the best features in order to separate benign and malignant colorectal cancer groups;

  3. 3.

    Information about methods and features which support development and enhancement of CAD systems.

2 Methodology

Each individual (genetic’s code bearer) was defined as a chromosome structure composed by four genes, represented by integer numbers. The information stored in each gene or genetic code defines a specific combination: (\(G_{id}\)) (the individual’s identification), selection method (\(G_{sel}\)), classification method (\(G_{clf}\)) and the number of features considered in the classification process (\(G_{num}\)). The initial values attributed to the \(G_{sel}\), \(G_{clf}\) and \(G_{num}\) genes were random. The structure described is illustrated in Fig. 1. Considering this structure, a population was defined. Each combination is unique and associated with an identifier \(G_{id}\) to define an acceptable solution.

Fig. 1.
figure 1

Chromosome structure defined to represent an individual (\(G_{id}\))

It is important to emphasize that each gene \(G_{sel}\) identifies a method capable of producing a ranking of the most significant features in order to distinguish the datasets under investigation. The explored methods were: T-statistics [11], information Gain [9], relief [23], gain ratio [9] and chi-squared [37]. Features were evaluated by each classifier indicated in the gene (\(G_{clf}\)): decision tree [32], J48 [29], random tree [4], random forest [4], multilayer perceptron [12], support vector machine (SVM) [33], K-nearest neighbors (KNN) [14] and KStar (K*) [8]. These techniques were applied on each training set and tested using k-folds cross-validation, with \(k=10\).

The structure of the model requires some parameters as inputs to define the best association, such as: population size (P), maximum number of generations or iterations (Iter), selection threshold (t) (representing who will be selected for reproduction—crossover), genetic mutation probability (m) and maximum number of features (MaxF) defined from the initial set of features. The MaxF parameter allows to limit the number of features that constitutes an individual \(G_{id}\). Considering the input parameters, the proposed method processes the information based on population evaluation, selection of the most fit individuals, reproduction (crossover) and mutation.

2.1 Population Evaluation and Selection

Population evaluation consists in calculating the mean accuracy (the fitness function) produced by each individual, based on the selection and classification techniques indicated in their genes. Therefore, for each individual, a selection method \(G_{sel}\) was applied on each of the k training fold and the results were the indexes of the N best features, being N defined by the value drawn for the \(G_{num}\) parameter.

The classifier indicated in \(G_ {clf}\) was trained considering the selected N features. This process was performed for each training file constructed by the \(k = 10\) cross-validation technique. The classification was executed in each correspondent test file, composed by chosen features. Therefore, for each individual \(G_{id}\), L accuracy values were obtained, one for each k training test. The average accuracy \(MeanAcc(G_{id})\) was calculated by applying Eq. 1:

$$\begin{aligned} MeanAcc(G_{id}) = \frac{\sum _{i=1}^{L} Acc(G_{id})(i)}{L}. \end{aligned}$$
(1)

The natural selection behavior proposed by Darwin was considered in the method presented by sorting the accuracy rates (\(MeanAcc(G_{id})\)) and selecting individuals with greater values than the selection parameter t (\(t=0.7\)). This model was developed considering the proposal described by Yang and Honavar [35]. Also, it is important to emphasize that chosen individuals were defined as parents in the next generation, by gathering genes (methods and features) capable of providing an acceptable solution: better combinations of features, selection methods and classifiers.

2.2 Crossover and Mutation

Reproduction is responsible for complementing the population on the current generation with individuals produced from those selected in the previous step. This type of approach simulates the sexual reproduction found in several species in nature. The genetic operation of crossover was implemented using the two-point approach, aiming the search for the best solution by replacing both selection and classification methods. Two-point crossover consists in choosing two locus of a chromosome as points of exchange (or pivots) and alternately making the copy of the genes of the parents for the two children generated. In the chromosome structure used in the model proposed, with the exception of the \(G_{id}\) gene, the other parts were used to determine the next generations. The mutation operator was applied on children to define the next iteration’s population. The mutation operation consisted of a few steps:

  • for each new born individual, a random number \(\alpha \) is drawn to indicate whether the child should be mutated, considering \(\alpha \) \(\in \) \(\mathbb {R}\) \(\mid \) 0 \(\le \) \(\alpha \) \(\le \) 1;

  • if \(\alpha \) > m, being m mutation probability, which was defined as 0.05% [25], the individual is not mutated. Otherwise, \(\alpha \) \(\le \) m, the individual will be mutated.

When an individual is submitted to the mutation process, an index \(\beta \) is drawn to indicate which gene must be mutated. The variable \(\beta \) can assume 1, 2 or 3 indexes, which represent the \(G_{sel}\), \(G_{clf}\) and \(G_{num}\) genes, respectively. Flip mutation was applied on the genes representing lists (\(G_{sel}\) and \(G_{clf}\)), as well as the creep mutation for the gene that indicates a number (\(G_{num}\)). In the flip mutation, a method was replaced by another of the same type. In the creep mutation, a value was subtracted or added to the gene. Considering these mutation processes, the following steps were performed:

  • if \(\beta \) = 1, 2, “take-the-next” strategy was applied on selection methods and classification methods lists.

  • if \(\beta \) = 3, the chosen gene is \(G_{num}\). In this case, a new random number \(\gamma \) is drawn, given by \(\gamma \) = 0 or \(\gamma \) = 1. If \(\gamma \) = 1, \(G_{num}\) is incremented by one. Otherwise (\(\gamma \) = 0), \(G_{num}\) is decremented by one. The maximum value for \(G_{num}\) is delimited by MaxF (maximum number of features).

The procedures previously described was repeated until the maximum number of generations or if 99% (or more) individuals provide a MeanAcc rate equal to 100%. A summary of the proposed method is shown in Algorithm 1.

figure a

2.3 Colorectal Database and Feature Set

The tests were performed from features extracted from a dataset of histological colorectal cancer images. They were defined by the method described in [30]. The dataset consists of samples derived from 16 H&E colon histology sections from stage T3 or T4 of colorectal adenocarcinoma. Each section belongs to a patient. Areas with different histological architectures were extracted from the sections and the samples were stained with H&E.

For each input image, features were defined by two Fractal Dimension values \(DF_p\) [18] and \(DF_f\) [27], five lacunarity values (Lac), obtained by area under curve metrics (ARC), skewness (SKW), area ratio (AR), maximum point (MP) and scale of the maximum point SMP, represented by Lac(1) to Lac(5); 14 Haralick texture features (Har) [16], represented by Har(1) up until Har(14), such as angular second moment, correlation and sum of squares; and 15 percolation features (Perc(1) up until Perc(15)) [31], in which ARC, SKW, AR, MP and SMP metrics were also applied for each percolation function, given by cluster average (C), percolating box ratio (Q) and average coverage ratio of the largest cluster (\(\varGamma \)). Mentioned features were also calculated for the curvelet subimages [6]. The curvelets were calculated through observations made on 4 levels of resolution and a sequence of 8 rotation angles. This approach resulted in 41 curvelet subimages for each colorectal image given and a feature set composed by 1.512 features.

3 Results

The feature set was given as input to our method and analyzed from tests defined with different values for P (numbers of individuals) and Iter (iterations or generations). The purpose was to verify the method’s behavior under different situations, as well as identify possible patterns in the context of colorectal images. Results provided by the method are available from Tables 1 and 2, obtained by population values of \(P=50\) and \(P=500\). These values were defined considering works available in the literature [1] [5] and in order to indicate the best combinations in each scenario. The results represent a set of (random) possible solutions involving: selection method, classifier, number of features (NumF) and average accuracy rate (MeanAcc). Area under the ROC curve (AUC) was also collected in each test to complement the performance comparisons of our proposal.

Table 1. Best combination obtained by P = 50 for iterations defined as 50, 100 and 500.
Table 2. Best combination obtained by P = 500 for iterations defined as 50, 100 and 500.

Analyzing the results it is possible to observe considering P = 50, the best result was found with 100 iterations. The solution was indicated by relief (selection method) and random forest (classifier). In this case, the accuracy rate was 87.97%. The best case was determined with a significant number of individuals (P = 500). The highest accuracy rate with the lowest number of features was indicated with 50 iterations. The solution defined by our method provided an accuracy of 90.82%, computed with 29 features, Relief (selection method) and random forest (classifier). In this case, the value of AUC was 0.967. It is important to mention that despite the same accuracy in the tests performed with 100 and 500 iterations, even with indications of the same selection method and classifier, the difference in the number of features is significant.

Colorectal cancer classification from histological images is the subject of several papers available in the literature, such as those described in [20] and [19]. Therefore, a performance overview obtained with our proposal is presented in Table 3, based on AUC rate (which was measured by all the works used for this verification), total features and classification methods.

Table 3. AUC performance provided by related works developed for the study and classification of colorectal cancer from histological images.

It is important to observe that direct comparisons cannot be performed to indicate the best approach, since different methodologies and databases were used. Nevertheless, considering the rate provided by our proposal and what was found in the literature, we believe the method is promising and capable of providing an acceptable solution (indication of the highest distinction rate considering the least number of features as possible). Our solution indicated a 0.967 AUC rate, with 29 of 1.512 features, values compatible with important works in the literature directed to the development of CAD systems and colorectal cancer.

One of the advantages of our proposal is identifying the most relevant features and its values (Table 4). It is possible to observe that most features were selected by percolation descriptors and subimages association, totalizing 16 features. Lacunarity attribute was the second most selected type, totalizing 9 features, with measurements obtained (total of 8), mainly of curvelet subimages. Lastly, Haralick’s measures contributed with four metrics. On the other hand, multiscale and multidimensional fractal dimension measurements (\(DF_p\) and \(DF_f\)) were not selected by our strategy to classify colorectal cancer from the H&E images. We believe that this information is important for the CAD system development area. It is possible to observe that area under the curve (ARC), skewness (SKW), area ratio (AR), maximum point (MP) and maximum point scale (SMP) were the most used features for lacunarity and percolation attributes. These metrics were obtained mainly from the combination with curvelet subimages.

Table 4. Discrimination of the selected features obtained in the best result.

4 Conclusion

In this work, a method based on GA capable of finding the best combination of features, selection methods and classifier was proposed in order to provide information for the diagnosis of colorectal cancer from H&E images. This methodology was built from a structured model of evaluation, selection, crossover and mutation. The method presented relevant results. The best solution was determined from 500 individuals and 50 iterations, resulting in 29 features, Relief selection algorithm and random forest classifier. The accuracy rate obtained was 90.82% and the AUC rate was 0.967. Performance was compared to important works available in the literature. The results were relevant, especially when considering the use of comparisons under similar conditions and the number of features considered. As an overview was given from studies developed for colorectal cancer classification, the performance was similar to which is available in the literature, with the differential of discriminating and detailing possible patterns of features indicated for separation of benign and malignant groups of colorectal cancer. In future works we intend to explore different values for the parameters required by our model and types of images. At last, we intend to test our model for pattern recognition in H&E images of lymphomas and breast cancer, with or without normalization of the dyes present in the slides.