Mining the breast cancer pattern using artificial neural networks and multivariate adaptive regression splines
Introduction
Modern medical facilities are equipped with monitoring, collecting and other devices which can provide inexpensive ways to collect and store data in their information systems. Huge amount of data stored in these databases need special techniques for processing, analyzing, and effective use of them before these data can be helpful supports in handling medical related decision-making problems. Data mining (DM), sometimes referred to as knowledge discovery in database (KDD), is a systematic approach to find underlying patterns, trend, and relationships buried in data. According to Curt (1995), the methodologies consist of data visualization, machine learning, statistical techniques, and deductive database. And the related applications using these methodologies can be summarized as classification, prediction, clustering, summarization, dependency modeling, linkage analysis, and sequential analysis (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Data mining has drawn serious attention from both researchers and practitioners due to its applications in decision support, financial forecasting, fraud detection, marketing strategy, process control, medical research and other related fields (Cabena et al., 1998, Chen et al., 1996, Fayyad et al., 1996, Lee et al., 1999, Ngan et al., 1999, Pendharkar et al., 1999).
Breast cancer, a very common and serious cancer for women, affects almost one in every seven women in the United States (Wingo, Tong, & Bolden, 1995). One of the most commonly used methods in detecting breast cancer is mammography. However, literature has reported that radiologists show considerable variation in interpreting a mammography (Elmore et al., 1994). On the other hand, fine needle aspiration cytology (FNAC) is also widely adopted in the diagnosis of breast cancers. But, according to Fentiman (1998), the average correct identification rate of FNAC is only about 90%. It is therefore an absolute necessity to develop better identification tools in recognizing breast cancers. Owing to the above-mentioned needs, several researchers have used statistical and artificial intelligence techniques to successfully ‘predict’ breast cancer (Kovalerchuck et al., 1997, Pendharkar et al., 1999). Basically, the objective of these identification techniques is to assign patients to either a ‘benign’ group that does not have breast cancer or a ‘malignant’ group who has strong evidence of having breast cancer. And hence the breast cancer diagnostic problems are basically in the scope of the more general and widely discussed classification problems (Hand, 1981, Anderson, 1984, Dillon and Goldstein, 1984, Johnson and Wichern, 2002).
Generally, discriminant analysis and logistic regression are two most commonly used data mining techniques to construct classification models. However, linear discriminant analysis (LDA) has often been criticized due to its assumption about the categorical nature of the data and the fact that the covariance matrices of different classes are unlikely to be equal (Reichert, Cho, & Wagner, 1983). In addition to the LDA approach, logistic regression is an alternative to conduct classification tasks. Basically, the logistic regression model was emerged as the technique in predicting dichotomous outcomes. Harrell and Lee (1985) found out that logistic regression is as efficient as LDA. However, it is also being criticized for some strong model assumptions like variation homogeneity and hence limited its application. Theoretically, both LDA and logistic regression are appropriate modeling tools when the relationship among variables is linear. In addition to LDA and logistic regression, artificial neural networks became an efficient alternative in modeling classification problems due to its capability to capture complex nonlinear relationships among variables. Even though neural networks have reported to have better classification capability than LDA and logistic regression (Desai et al., 1996, Jensen, 1992, Lee et al., 2002, Piramuthu, 1999, West, 2000), it is, however, also being criticized for its long training process in designing the optimal network's topology and hard to identify the relative importance of potential input variables, and hence limited its applicability in handling classification problems (Chung et al., 1999, Craven and Shavlik, 1997, Lee et al., 2002).
In addition to the above-mentioned techniques, multivariate adaptive regression splines (MARS) is another commonly discussed data mining technique nowadays. MARS is widely accepted by data mining practitioners for the following facts. Firstly, unlike LDA and logistic regression, MARS exhibits the capability of modeling complex relationship among variables without strong model assumptions. Besides, unlike neural networks, MARS can identify ‘important’ independent variables through the built basis functions (more details will be discussed in Section 2) when many potential variables are considered. Thirdly, MARS does not need long training process and hence can save lots of modeling time when the data set is huge. Finally, one strong advantage of MARS over other classification techniques is the resulting model can be easily interpreted. It not only points out which variables are important in classifying objects/observations, but also indicates a particular object/observation belongs to a specific class when the built rules are satisfied. The final fact has important implications and can help professionals make appropriate decisions.
Aiming at improving the above-mentioned drawbacks of neural networks and increasing the classification accuracies of the existing approaches, the objective of the proposed study is to explore the performance of breast cancer diagnosis using a two-stage hybrid modeling procedure in integrating multivariate adaptive regression splines approach with neural networks technique. The rationale underlying the analyses is firstly to use MARS in modeling the breast cancer diagnostic problems. Then the obtained significant predictor variables are served as the input variables of the designed neural networks model. Please note that it is valuable to use MARS as a supporting tool for designing the topology of neural networks as we can learn more about the inner workings. Besides, as there is no theoretical method in determining the best input variables of a neural network model, MARS can be implemented as a generally accepted method for determining a good subset of input variables when many potential variables are considered in deciding the input vector of the designed neural network model. To demonstrate the feasibility and effectiveness that the inclusion of the obtained predictor variables from MARS would improve the classification accuracy of the neural network model, breast cancer diagnostic tasks are performed on one FNAC dataset. As to the structure of the designed neural network model, sensitivity analysis is firstly employed to solve the issue of finding the appropriate setup of the network's topology. Analytic results demonstrated that the proposed hybrid model provides a better initial solution and hence converges much faster than the conventional neural networks model. Besides, in comparison with the traditional neural network approach, the classification accuracy increases in terms of the proposed hybrid methodology. Moreover, the superior classification capability of the proposed technique can be observed by comparing the results with those using linear discrimintant analysis and solely using MARS approaches.
The rest of the paper is organized as follows. We will give a brief review and related literature of neural networks and multivariate adaptive regression splines in Section 2. The developments as well as the empirical results of breast cancer diagnostic models using linear discriminant analysis, MARS, neural networks, and the hybrid model in integrating MARS and neural networks approaches are presented in Section 3. Finally Section 4 addresses the conclusion and discusses the possible future research areas.
Section snippets
Artificial neural networks
Neural networks, originally derived from neurobiological models, are massively parallel, computer-intensive, and data-driven algorithmic systems composed of a multitude of highly interconnected nodes, known as neurons as well. Mimicking human neurobiological information-processing activities, each elementary node of a neural network is able to receive an input single from external sources or other nodes and the algorithmic procedure equipped in each node is sequentially activated to locally
Empirical study
In order to verify the feasibility and effectiveness of the proposed two-stage hybrid modeling procedure, one FNAC dataset provided by department of surgery, human oncology and computer sciences, University of Wisconsin at Madison is used in this study (Mangasarian et al., 1990, Bennett and Mangasarian, 1992). The data set consists of 569 patients' records. Among them, 212 are reported to have breast cancers while the remaining 357 are not. The diagnostic results of each patient consist of 30
Conclusions and areas of future research
Breast cancer is a very common and serious cancer for women through out the world. The commonly used diagnostic techniques, like mammography and FNAC, are reported to lack of high diagnostic capability. Therefore, there is an absolute necessity in developing better diagnostic techniques. Basically, the objective of these identification techniques is to assign patients to either a ‘benign’ group that does not have breast cancer or a ‘malignant’ group who has strong evidence of having breast
References (62)
- et al.
Forecasting exchange rates using TSMARS1
Journal of International Money and Finance
(1998) - et al.
A comparison of neural networks and linear scoring models in the credit union environment
European Journal of Operational Research
(1996) - et al.
Statistical techniques for the classification of chromites in diamond exploration samples
Journal of Geochemical Exploration
(1997) - et al.
Combining non-parametric models with logistic regression: an application to motor vehicle injury data
Computational Statistics and Data Analysis
(2000) - et al.
Investigating the information content of non-cash-trading index futures using neural networks
Expert Systems with Applications
(2002) - et al.
Credit scoring using the hybrid neural discriminant technique
Expert Systems with Applications
(2002) - et al.
Using multivariate adaptive regression splines to QSAR studies of dihydroartemisinin derivatives
European Journal of Medical Chemistry
(1996) - et al.
Evaluation of automatic knowledge acquisition techniques in the diagnosis of acute abdominal pain
Artificial Intelligence in Medicine
(1996) - et al.
Associations statistical, mathematical and neural approaches for mining breast cancer patterns
Expert Systems with Applications
(1999) Financial credit-risk evaluation with neural and neurofuzzy systems
European Journal of Operational Research
(1999)