Analysis on risk factors for cervical cancer using induction technique
Introduction
Cervical cancer is a major worldwide health problem with incidence and mortality rates second only to breast cancer (Herzog, 2003). It is the number one killer of young women in underdeveloped countries (Waggoner, 2003). In Korean women, cervical cancer is the third most common form of cancer, after stomach and breast cancer (National Statistical Office, 2001).
The risk factors for cervical cancer are early age at first intercourse, menarche, number of children, education level, smoking, family history and human papillomavirus (HPV) (Ferenczy and Franco, 2002, Janicek and Averette, 2001). HPV is understood to be necessary but not a sufficient cause of cervical cancer (Cox, 1995), with various risk factors being potentially important. However, it is still unclear what these risk factors are, or how they work together to develop cervical cancer. Even people exposed to the same environmental factors exhibit varied susceptibility to any specific disease, especially due to genetic background.
In cervical cancer, the tumor suppressor gene p53 has a single nucleotide polymorphism (SNP) in codon 72 which has been the issue of much heated controversy. Some studies show an over-expression of homozygous p53Arg in cervix cancer compared to heterozygous or homozygous p53Pro (Storey et al., 1998, Anderson et al., 2001, Yang et al., 2001). However, there are contrary reports that show that no relationship between p53 polymorphism and cervical cancer (Minoguchi et al., 1998, Dybikowska et al., 2000, Wong et al., 2000). The codon 31 Ser/Ser homozygote of the p21 gene, one of the p53 effecter proteins, could also be a risk factor for the development of cervical cancer (Roh et al., 2001).
Interferon regulatory factor-1 (IRF-1) binds to interferon genes, a family of cytokines, that have antiviral as well as tumor suppressing activities (Harada et al., 1989, Harada et al., 1993). Inactivation of IRF-1 by HPV E7 protein may influence the immune mechanism in cervical carcinogenesis (Park et al., 2000). The Fragile Histadine Triad (FHIT) gene, a putative tumor suppressor gene, located in chromosome region 3p14.2 appears to be particularly susceptible to carcinogens due to cigarette smoking, which is a well-known epidemiologic risk factor for cervical cancer. However, no study has been conducted on whether FHIT is a potential genetic link between environmental factors including cigarette smoking and cervical cancer. The background for choosing these four genes is that cervical cancer is an HPV-associated disease in which these genes play an important role in its carcinogenesis.
The relationship between gene polymorphism and cervical carcinogenesis has been studied by many groups. However, the results from these previous studies have failed to produce clear conclusions because most of them dealt mainly with only one type of polymorphism and tended to over-focus on only the biomolecular bases without taking environmental factors into consideration.
Most of past studies on cervical cancer have focused in describing the individual patient characteristics and genetic polymorphisms separately using statistical method. Therefore, the relative importance of risk factors could not be compared together and their relative statistical significance could not be established. In addition, no study has attempted to collectively analyze the relationship among demographic, environmental, and genetic factors.
With the intention of overcoming the limitations of past studies, we tried to discover the combined patterns of cervical cancer risk factors including demographic, environmental and genetic using data mining. Data mining, which is also known as knowledge discovery in databases, is a process of nontrivial extraction of implicit, previously unknown and potentially useful information from the database (Bose & Mahapatra, 2001). In general, data mining is an essential process in knowledge discovery where intelligent methods are applied in order to extract data patterns.
This paper presents a induction technique as a data mining approach to discover the knowledge to predict cervical cancer by demographic, environmental and genetic factors. We examined the risk factors for cervical cancer using logistic regression and a decision tree algorithm, and then analyzed the relationship among demographic, environmental and genetic factors for cervical cancer using induction technique. In addition, we compared logistic regression and the decision tree algorithm, and present the practical use of rule induction for the management of cervical cancer.
Section snippets
Risk factors for cervical cancer
Cervical cancer is one of the most common cancers for women world-wide, and many risk factors for its development have been identified (Braun & Gavey, 1999). Carcinogenesis is a complex multistep process involving a number of genetic and epigenetic events. The primary cause in development of cervical cancer is HPV. HPV is considered the mainly etiologic agent of cervical cancer and its high-grade precursor lesions (Ferenczy et al., 2002). Although many HPV types have been associated with
Research model
We analyzed the simultaneous relationship among demographic, environmental and genetic factors for cervical cancer using induction technique (Fig. 1). This study compared the relative effects of each risk factor for cervical cancer in the multivariate analysis model. We tried to discover the significant patterns and relationship among the risk factors and make decision rules for the management of cervical cancer.
Participants
The inpatients and outpatients visiting Kangnam St Mary's hospital from October
Comparison of characteristics between cervical cancer cases and controls
The characteristics of the study population are shown in Table 1. The mean age was 49.97 years for the cervical cancer cases and 45.62 years for the controls, with a wide range from 20 to 74 years. We performed the classical statistical analysis to examine the difference in the distribution of variables between the cervical cancer cases and controls. Table 1 shows the variables that were significantly different between the two groups based on the t-test and chi-square test at 5% level for
Discussion
Cervical cancer is a very common but preventable cancer. From an epidemiologic and cancer prevention perspective, great strides have been made in the developed world towards dramatically reducing the incidence and mortality from this disease (Janicek & Averette, 2001). The new approach to the analysis of risk factors and management of cervical cancer was discussed in this paper. This study examined the characteristics of induction technique to demonstrate how they can be used to predict
Acknowledgments
This work was supported by grant number FG01-0203-001 of the 21C Frontier Functional Human Genome Project from the Ministry of Science and Technology of Korea. In addition, the authors wish to thank professor Ingoo Han and Dr Tae Hyup Roh for their advice during manuscript preparation.
References (34)
- et al.
Business data mining: a machine learning perspective
Information and Management
(2001) - et al.
With the best of reasons: cervical cancer prevention policy and the suppression of sexual risk factor information
Social Science and Medicine
(1999) - et al.
Data mining approach to policy analysis in a health insurance domain
International Journal of Medical Informatics
(2001) Epidemiology of cervical intraepithelial neoplasia: the role of human papillomavirus
Clinical Obstetrics and Gynecology
(1995)Contemporary theories of cervical carcinogenesis: the virus, the host, and the stem cell
Molecular Pathology
(2000)- et al.
Persistent human papillomavirus infection and cervical neoplasia
Lancet
(2002) - et al.
Structurally similar but functionally distinct factors, IRF-1 and IRF-2, bind to the same regulatory elements of IFN and IFN-inducible genes
Cell
(1989) New approaches for the management of cervical cancer
Gynecologic Oncology
(2003)- et al.
A comparison of logistic regression to decision-tree induction in a medical domain
Computer Biomedical Research
(1993) - et al.
Identification of benzopyrene metabolites in cervical mucus and DNA adducts in cervical tissues in humans by gas chromatography-mass spectrometry
Cancer Letters
(1999)
Inactivation of interferon regulatory factor-1 tumor suppressor protein by HPV E7 protein; Implication for the E7-mediated immune evasion mechanism in cervical carcinogenesis
Journal of Biology Chemistry
Association, statistical, mathematical and neural approaches for mining breat cancer patterns
Expert Systems with Applications
Polymorphism in codon 31 of p21 and cervical cancer susceptibility in Korean women
Cancer Letters
Cervical cancer
Lancet
The significance of p53 codon 72 polymorphism for the development of cervical adenocarcinomas
British Journal of Cancer
Cigarette smoking and human papillomavirus in patients with reported cervical cytological abnormality
British Medical Journal
p53 codon 72 polymorphism in cervical cancer patients and healthy women from Poland
Acta Biochemistry
Cited by (24)
Classification and diagnosis of cervical cancer with softmax classification with stacked autoencoder
2019, Expert Systems with ApplicationsCitation Excerpt :Considering the studies on cervical cancer, it is observed that SVM, Random Forest, C5.0, and Logistic Regression models were used as machine learning methods. Among the studies conducted with machine learning, the one with logistic regression algorithm has an 88.70% classification success rate over 710 samples with 12 attributes, 577 training and 133 test data (Ho, Jee, Lee, & Park, 2004) and another has an 89% success rate over 1728 samples and 133 attributes (Yamal et al., 2015). In one of the studies using the C5.0 algorithm, the success rate was 92.44% over 168 samples with 12 attributes, 118 training and 50 test data (Tseng, Lu, Chang, & Chen, 2014), and it was 67.5% in another study conducted over 237 samples with 10 attributes (Sharma, 2016).
A risk evaluation model of cervical cancer based on etiology and human leukocyte antigen allele susceptibility
2014, International Journal of Infectious DiseasesCitation Excerpt :A risk evaluation model is crucial for efficient cancer screening among high-risk populations.1
Comparison of regression tree data mining methods for prediction of mortality in head injury
2011, Expert Systems with ApplicationsCitation Excerpt :The CHIAD method naturally deals with interactions among the independent variables that are directly available from an examination of the tree. The final nodes identify subgroups as defined by different sets of independent variables (Ho, Jee, Lee, & Park, 2004). Interactive Exhaustive CHAID, a modification to the basic CHAID algorithm, performs a more thorough merging and testing of predictor variables, and hence requires more computing time.
Tree-Based Methods as an Alternative to Logistic Regression in Revealing Risk Factors of Crib-Biting in Horses
2010, Journal of Equine Veterinary ScienceCitation Excerpt :Specifically, prediction accuracy of crib-biting horses (sensitivity) was much better by the tree-based methods than by the logistic regression; however, specificity was only slightly lower. This is in accordance with previous findings.10,11,13 Tree-based methods do not have strict applicability conditions like the logistic regression, work well with complex datasets, are less influenced by the multicollinearity of the variables, and handle the missing values and low prevalence easily.
Constructing of the risk classification model of cervical cancer by artificial neural network
2007, Expert Systems with ApplicationsMammographic case base applied for supporting image diagnosis of breast lesion
2006, Expert Systems with Applications
- 1
Tel.: +82-2-958-3673; fax: +82-2-958-3604.
- 2
Tel.: +82-2-364-4700; fax: +82-2-364-4778.
- 3
Tel.: +82-2-590-2596; fax: +82-2-595-8774.