Mining the breast cancer pattern using artificial neural networks and multivariate adaptive regression splines

https://doi.org/10.1016/j.eswa.2003.12.013Get rights and content

Abstract

Data mining is a very popular technique and has been widely applied in different areas these days. The artificial neural network has become a very popular alternative in prediction and classification tasks due to its associated memory characteristics and generalization capability. However, the relative importance of potential input variables and the long training process have often been criticized and hence limited its application in handling classification problems. The objective of the proposed study is to explore the performance of data classification by integrating artificial neural networks with the multivariate adaptive regression splines (MARS) approach. The rationale under the analyses is firstly to use MARS in modeling the classification problem, then the obtained significant variables are used as the input variables of the designed neural networks model. To demonstrate the inclusion of the obtained important variables from MARS would improve the classification accuracy of the networks, diagnostic tasks are performed on one fine needle aspiration cytology breast cancer data set. As the results reveal, the proposed integrated approach outperforms the results using discriminant analysis, artificial neural networks and multivariate adaptive regression splines and hence provides an efficient alternative in handling breast cancer diagnostic problems.

Introduction

Modern medical facilities are equipped with monitoring, collecting and other devices which can provide inexpensive ways to collect and store data in their information systems. Huge amount of data stored in these databases need special techniques for processing, analyzing, and effective use of them before these data can be helpful supports in handling medical related decision-making problems. Data mining (DM), sometimes referred to as knowledge discovery in database (KDD), is a systematic approach to find underlying patterns, trend, and relationships buried in data. According to Curt (1995), the methodologies consist of data visualization, machine learning, statistical techniques, and deductive database. And the related applications using these methodologies can be summarized as classification, prediction, clustering, summarization, dependency modeling, linkage analysis, and sequential analysis (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Data mining has drawn serious attention from both researchers and practitioners due to its applications in decision support, financial forecasting, fraud detection, marketing strategy, process control, medical research and other related fields (Cabena et al., 1998, Chen et al., 1996, Fayyad et al., 1996, Lee et al., 1999, Ngan et al., 1999, Pendharkar et al., 1999).

Breast cancer, a very common and serious cancer for women, affects almost one in every seven women in the United States (Wingo, Tong, & Bolden, 1995). One of the most commonly used methods in detecting breast cancer is mammography. However, literature has reported that radiologists show considerable variation in interpreting a mammography (Elmore et al., 1994). On the other hand, fine needle aspiration cytology (FNAC) is also widely adopted in the diagnosis of breast cancers. But, according to Fentiman (1998), the average correct identification rate of FNAC is only about 90%. It is therefore an absolute necessity to develop better identification tools in recognizing breast cancers. Owing to the above-mentioned needs, several researchers have used statistical and artificial intelligence techniques to successfully ‘predict’ breast cancer (Kovalerchuck et al., 1997, Pendharkar et al., 1999). Basically, the objective of these identification techniques is to assign patients to either a ‘benign’ group that does not have breast cancer or a ‘malignant’ group who has strong evidence of having breast cancer. And hence the breast cancer diagnostic problems are basically in the scope of the more general and widely discussed classification problems (Hand, 1981, Anderson, 1984, Dillon and Goldstein, 1984, Johnson and Wichern, 2002).

Generally, discriminant analysis and logistic regression are two most commonly used data mining techniques to construct classification models. However, linear discriminant analysis (LDA) has often been criticized due to its assumption about the categorical nature of the data and the fact that the covariance matrices of different classes are unlikely to be equal (Reichert, Cho, & Wagner, 1983). In addition to the LDA approach, logistic regression is an alternative to conduct classification tasks. Basically, the logistic regression model was emerged as the technique in predicting dichotomous outcomes. Harrell and Lee (1985) found out that logistic regression is as efficient as LDA. However, it is also being criticized for some strong model assumptions like variation homogeneity and hence limited its application. Theoretically, both LDA and logistic regression are appropriate modeling tools when the relationship among variables is linear. In addition to LDA and logistic regression, artificial neural networks became an efficient alternative in modeling classification problems due to its capability to capture complex nonlinear relationships among variables. Even though neural networks have reported to have better classification capability than LDA and logistic regression (Desai et al., 1996, Jensen, 1992, Lee et al., 2002, Piramuthu, 1999, West, 2000), it is, however, also being criticized for its long training process in designing the optimal network's topology and hard to identify the relative importance of potential input variables, and hence limited its applicability in handling classification problems (Chung et al., 1999, Craven and Shavlik, 1997, Lee et al., 2002).

In addition to the above-mentioned techniques, multivariate adaptive regression splines (MARS) is another commonly discussed data mining technique nowadays. MARS is widely accepted by data mining practitioners for the following facts. Firstly, unlike LDA and logistic regression, MARS exhibits the capability of modeling complex relationship among variables without strong model assumptions. Besides, unlike neural networks, MARS can identify ‘important’ independent variables through the built basis functions (more details will be discussed in Section 2) when many potential variables are considered. Thirdly, MARS does not need long training process and hence can save lots of modeling time when the data set is huge. Finally, one strong advantage of MARS over other classification techniques is the resulting model can be easily interpreted. It not only points out which variables are important in classifying objects/observations, but also indicates a particular object/observation belongs to a specific class when the built rules are satisfied. The final fact has important implications and can help professionals make appropriate decisions.

Aiming at improving the above-mentioned drawbacks of neural networks and increasing the classification accuracies of the existing approaches, the objective of the proposed study is to explore the performance of breast cancer diagnosis using a two-stage hybrid modeling procedure in integrating multivariate adaptive regression splines approach with neural networks technique. The rationale underlying the analyses is firstly to use MARS in modeling the breast cancer diagnostic problems. Then the obtained significant predictor variables are served as the input variables of the designed neural networks model. Please note that it is valuable to use MARS as a supporting tool for designing the topology of neural networks as we can learn more about the inner workings. Besides, as there is no theoretical method in determining the best input variables of a neural network model, MARS can be implemented as a generally accepted method for determining a good subset of input variables when many potential variables are considered in deciding the input vector of the designed neural network model. To demonstrate the feasibility and effectiveness that the inclusion of the obtained predictor variables from MARS would improve the classification accuracy of the neural network model, breast cancer diagnostic tasks are performed on one FNAC dataset. As to the structure of the designed neural network model, sensitivity analysis is firstly employed to solve the issue of finding the appropriate setup of the network's topology. Analytic results demonstrated that the proposed hybrid model provides a better initial solution and hence converges much faster than the conventional neural networks model. Besides, in comparison with the traditional neural network approach, the classification accuracy increases in terms of the proposed hybrid methodology. Moreover, the superior classification capability of the proposed technique can be observed by comparing the results with those using linear discrimintant analysis and solely using MARS approaches.

The rest of the paper is organized as follows. We will give a brief review and related literature of neural networks and multivariate adaptive regression splines in Section 2. The developments as well as the empirical results of breast cancer diagnostic models using linear discriminant analysis, MARS, neural networks, and the hybrid model in integrating MARS and neural networks approaches are presented in Section 3. Finally Section 4 addresses the conclusion and discusses the possible future research areas.

Section snippets

Artificial neural networks

Neural networks, originally derived from neurobiological models, are massively parallel, computer-intensive, and data-driven algorithmic systems composed of a multitude of highly interconnected nodes, known as neurons as well. Mimicking human neurobiological information-processing activities, each elementary node of a neural network is able to receive an input single from external sources or other nodes and the algorithmic procedure equipped in each node is sequentially activated to locally

Empirical study

In order to verify the feasibility and effectiveness of the proposed two-stage hybrid modeling procedure, one FNAC dataset provided by department of surgery, human oncology and computer sciences, University of Wisconsin at Madison is used in this study (Mangasarian et al., 1990, Bennett and Mangasarian, 1992). The data set consists of 569 patients' records. Among them, 212 are reported to have breast cancers while the remaining 357 are not. The diagnostic results of each patient consist of 30

Conclusions and areas of future research

Breast cancer is a very common and serious cancer for women through out the world. The commonly used diagnostic techniques, like mammography and FNAC, are reported to lack of high diagnostic capability. Therefore, there is an absolute necessity in developing better diagnostic techniques. Basically, the objective of these identification techniques is to assign patients to either a ‘benign’ group that does not have breast cancer or a ‘malignant’ group who has strong evidence of having breast

References (62)

  • M.S. Sanchez et al.

    Efficiency of multi-layered feedforward neural networks on classification in relation to linear discriminant analysis. Quadratic discriminant analysis and regularized discriminant analysis

    Chemometrics and Intelligent Laboratory Systems

    (1995)
  • A. Vellido et al.

    Neural networks in business: a survey of applications (1992–1998)

    Expert Systems with Applications

    (1999)
  • D. West

    Neural network credit scoring models

    Computers and Operations Research

    (2000)
  • F.S. Wong

    Time series forecasting using backpropagation neural networks

    Neurocomputing

    (1991)
  • G. Zhang et al.

    Forecasting with artificial neural networks: the state of the art

    International Journal of Forecasting

    (1998)
  • T.W. Anderson

    An introduction to multivariate statistical analysis

    (1984)
  • J.A. Anderson et al.

    Neurocomputing: foundations of research

    (1988)
  • K.P. Bennett et al.

    Robust linear programming discrimination of two linearly inseparable sets

    Optimization Methods and Software

    (1992)
  • L. Breiman et al.

    Classification and regression trees

    (1984)
  • P. Cabena et al.

    Discovering data mining from concept to implementation

    (1998)
  • M.S. Chen et al.

    Data mining: an overview from a database perspective

    IEEE Transactions on Knowledge and Data Engineering

    (1996)
  • B. Cheng et al.

    Neural Network: a review from a statistical perspective (with discussion)

    Statistical Science

    (1994)
  • C.C. Chiu et al.

    Identification of process disturbance using SPC/EPC and neural networks

    Journal of Intelligent Manufacturing

    (2003)
  • H.M. Chung et al.

    Special section: data mining

    Journal of Management Information Systems

    (1999)
  • M.W. Craven et al.

    Using neural networks for data mining

    Future Generation Computer Systems

    (1997)
  • P. Craven et al.

    Smoothing noisy data with spline functions. Estimating the correct degree of smoothing by the method of generalized cross-validation

    Numberische Mathematik

    (1979)
  • G. Cybenko

    Approximation by superpositions of a sigmoidal Function

    Mathematical Control Signal Systems

    (1989)
  • H. Curt

    The devil's in the detail: techniques: Tools, and applications for database mining and knowledge discovery-Part

    Intelligent Software Strategies

    (1995)
  • P.C. Davies

    Design issues in neural network development

    Neurovest

    (1994)
  • W.R. Dillon et al.

    Multivariate analysis methods and applications

    (1984)
  • J. Elmore et al.

    Variability in radiologists interpretation of mamograms

    New England Journal of Medicine

    (1994)
  • Cited by (0)

    View full text