Analysis on risk factors for cervical cancer using induction technique

https://doi.org/10.1016/j.eswa.2003.12.005Get rights and content

Abstract

Cervical cancer is a leading cause of cancer deaths in woman worldwide. New approach to the analysis of risk factors and management of cervical cancer is discussed in this study. We identified the combined patterns of cervical cancer risk factors including demographic, environmental and genetic factors using induction technique. We compared logistic regression and a decision tree algorithm, CHAID (Chi-squared Automatic Interaction Detection), using a test set of 133 participants and a training set of 577 participants. The CHAID had a better predictive rate and sensitivity (72.96 and 64.00%, respectively) than logistic regression (71.83 and 40.80%, respectively). However, the CHAID had lower specificity (77.83%) than logistic regression (88.70%). In addition, we demonstrated how the decision tree algorithm could be used in risk analysis and target segmentation for cervical cancer management. This is the first study using induction technique for the analysis of risk factors for cervical cancer, and the results of this study will contribute to developing the clinical practice guideline for cervical cancer.

Introduction

Cervical cancer is a major worldwide health problem with incidence and mortality rates second only to breast cancer (Herzog, 2003). It is the number one killer of young women in underdeveloped countries (Waggoner, 2003). In Korean women, cervical cancer is the third most common form of cancer, after stomach and breast cancer (National Statistical Office, 2001).

The risk factors for cervical cancer are early age at first intercourse, menarche, number of children, education level, smoking, family history and human papillomavirus (HPV) (Ferenczy and Franco, 2002, Janicek and Averette, 2001). HPV is understood to be necessary but not a sufficient cause of cervical cancer (Cox, 1995), with various risk factors being potentially important. However, it is still unclear what these risk factors are, or how they work together to develop cervical cancer. Even people exposed to the same environmental factors exhibit varied susceptibility to any specific disease, especially due to genetic background.

In cervical cancer, the tumor suppressor gene p53 has a single nucleotide polymorphism (SNP) in codon 72 which has been the issue of much heated controversy. Some studies show an over-expression of homozygous p53Arg in cervix cancer compared to heterozygous or homozygous p53Pro (Storey et al., 1998, Anderson et al., 2001, Yang et al., 2001). However, there are contrary reports that show that no relationship between p53 polymorphism and cervical cancer (Minoguchi et al., 1998, Dybikowska et al., 2000, Wong et al., 2000). The codon 31 Ser/Ser homozygote of the p21 gene, one of the p53 effecter proteins, could also be a risk factor for the development of cervical cancer (Roh et al., 2001).

Interferon regulatory factor-1 (IRF-1) binds to interferon genes, a family of cytokines, that have antiviral as well as tumor suppressing activities (Harada et al., 1989, Harada et al., 1993). Inactivation of IRF-1 by HPV E7 protein may influence the immune mechanism in cervical carcinogenesis (Park et al., 2000). The Fragile Histadine Triad (FHIT) gene, a putative tumor suppressor gene, located in chromosome region 3p14.2 appears to be particularly susceptible to carcinogens due to cigarette smoking, which is a well-known epidemiologic risk factor for cervical cancer. However, no study has been conducted on whether FHIT is a potential genetic link between environmental factors including cigarette smoking and cervical cancer. The background for choosing these four genes is that cervical cancer is an HPV-associated disease in which these genes play an important role in its carcinogenesis.

The relationship between gene polymorphism and cervical carcinogenesis has been studied by many groups. However, the results from these previous studies have failed to produce clear conclusions because most of them dealt mainly with only one type of polymorphism and tended to over-focus on only the biomolecular bases without taking environmental factors into consideration.

Most of past studies on cervical cancer have focused in describing the individual patient characteristics and genetic polymorphisms separately using statistical method. Therefore, the relative importance of risk factors could not be compared together and their relative statistical significance could not be established. In addition, no study has attempted to collectively analyze the relationship among demographic, environmental, and genetic factors.

With the intention of overcoming the limitations of past studies, we tried to discover the combined patterns of cervical cancer risk factors including demographic, environmental and genetic using data mining. Data mining, which is also known as knowledge discovery in databases, is a process of nontrivial extraction of implicit, previously unknown and potentially useful information from the database (Bose & Mahapatra, 2001). In general, data mining is an essential process in knowledge discovery where intelligent methods are applied in order to extract data patterns.

This paper presents a induction technique as a data mining approach to discover the knowledge to predict cervical cancer by demographic, environmental and genetic factors. We examined the risk factors for cervical cancer using logistic regression and a decision tree algorithm, and then analyzed the relationship among demographic, environmental and genetic factors for cervical cancer using induction technique. In addition, we compared logistic regression and the decision tree algorithm, and present the practical use of rule induction for the management of cervical cancer.

Section snippets

Risk factors for cervical cancer

Cervical cancer is one of the most common cancers for women world-wide, and many risk factors for its development have been identified (Braun & Gavey, 1999). Carcinogenesis is a complex multistep process involving a number of genetic and epigenetic events. The primary cause in development of cervical cancer is HPV. HPV is considered the mainly etiologic agent of cervical cancer and its high-grade precursor lesions (Ferenczy et al., 2002). Although many HPV types have been associated with

Research model

We analyzed the simultaneous relationship among demographic, environmental and genetic factors for cervical cancer using induction technique (Fig. 1). This study compared the relative effects of each risk factor for cervical cancer in the multivariate analysis model. We tried to discover the significant patterns and relationship among the risk factors and make decision rules for the management of cervical cancer.

Participants

The inpatients and outpatients visiting Kangnam St Mary's hospital from October

Comparison of characteristics between cervical cancer cases and controls

The characteristics of the study population are shown in Table 1. The mean age was 49.97 years for the cervical cancer cases and 45.62 years for the controls, with a wide range from 20 to 74 years. We performed the classical statistical analysis to examine the difference in the distribution of variables between the cervical cancer cases and controls. Table 1 shows the variables that were significantly different between the two groups based on the t-test and chi-square test at 5% level for

Discussion

Cervical cancer is a very common but preventable cancer. From an epidemiologic and cancer prevention perspective, great strides have been made in the developed world towards dramatically reducing the incidence and mortality from this disease (Janicek & Averette, 2001). The new approach to the analysis of risk factors and management of cervical cancer was discussed in this paper. This study examined the characteristics of induction technique to demonstrate how they can be used to predict

Acknowledgments

This work was supported by grant number FG01-0203-001 of the 21C Frontier Functional Human Genome Project from the Ministry of Science and Technology of Korea. In addition, the authors wish to thank professor Ingoo Han and Dr Tae Hyup Roh for their advice during manuscript preparation.

References (34)

  • J.S Park et al.

    Inactivation of interferon regulatory factor-1 tumor suppressor protein by HPV E7 protein; Implication for the E7-mediated immune evasion mechanism in cervical carcinogenesis

    Journal of Biology Chemistry

    (2000)
  • P.C Pendharkar et al.

    Association, statistical, mathematical and neural approaches for mining breat cancer patterns

    Expert Systems with Applications

    (1999)
  • J.W Roh et al.

    Polymorphism in codon 31 of p21 and cervical cancer susceptibility in Korean women

    Cancer Letters

    (2001)
  • S.E Waggoner

    Cervical cancer

    Lancet

    (2003)
  • S Anderson et al.

    The significance of p53 codon 72 polymorphism for the development of cervical adenocarcinomas

    British Journal of Cancer

    (2001)
  • M Burger et al.

    Cigarette smoking and human papillomavirus in patients with reported cervical cytological abnormality

    British Medical Journal

    (1993)
  • A Dybikowska et al.

    p53 codon 72 polymorphism in cervical cancer patients and healthy women from Poland

    Acta Biochemistry

    (2000)
  • Cited by (24)

    • Classification and diagnosis of cervical cancer with softmax classification with stacked autoencoder

      2019, Expert Systems with Applications
      Citation Excerpt :

      Considering the studies on cervical cancer, it is observed that SVM, Random Forest, C5.0, and Logistic Regression models were used as machine learning methods. Among the studies conducted with machine learning, the one with logistic regression algorithm has an 88.70% classification success rate over 710 samples with 12 attributes, 577 training and 133 test data (Ho, Jee, Lee, & Park, 2004) and another has an 89% success rate over 1728 samples and 133 attributes (Yamal et al., 2015). In one of the studies using the C5.0 algorithm, the success rate was 92.44% over 168 samples with 12 attributes, 118 training and 50 test data (Tseng, Lu, Chang, & Chen, 2014), and it was 67.5% in another study conducted over 237 samples with 10 attributes (Sharma, 2016).

    • A risk evaluation model of cervical cancer based on etiology and human leukocyte antigen allele susceptibility

      2014, International Journal of Infectious Diseases
      Citation Excerpt :

      A risk evaluation model is crucial for efficient cancer screening among high-risk populations.1

    • Comparison of regression tree data mining methods for prediction of mortality in head injury

      2011, Expert Systems with Applications
      Citation Excerpt :

      The CHIAD method naturally deals with interactions among the independent variables that are directly available from an examination of the tree. The final nodes identify subgroups as defined by different sets of independent variables (Ho, Jee, Lee, & Park, 2004). Interactive Exhaustive CHAID, a modification to the basic CHAID algorithm, performs a more thorough merging and testing of predictor variables, and hence requires more computing time.

    • Tree-Based Methods as an Alternative to Logistic Regression in Revealing Risk Factors of Crib-Biting in Horses

      2010, Journal of Equine Veterinary Science
      Citation Excerpt :

      Specifically, prediction accuracy of crib-biting horses (sensitivity) was much better by the tree-based methods than by the logistic regression; however, specificity was only slightly lower. This is in accordance with previous findings.10,11,13 Tree-based methods do not have strict applicability conditions like the logistic regression, work well with complex datasets, are less influenced by the multicollinearity of the variables, and handle the missing values and low prevalence easily.

    View all citing articles on Scopus
    1

    Tel.: +82-2-958-3673; fax: +82-2-958-3604.

    2

    Tel.: +82-2-364-4700; fax: +82-2-364-4778.

    3

    Tel.: +82-2-590-2596; fax: +82-2-595-8774.

    View full text