Abstract
The single nucleotide polymorphisms (SNPs) are believed to determine human differences and, to some degree, provide biomedical researchers a possibility of predicting risks of some diseases and explaining patients’ different responses to drug regimens. With the availability of millions of SNPs in the Hapmap Project, although large amount of information about SNPs is available, the tremendous size also causes a major challenge for research on SNPs. Inspired from the recent research work on population classification by Park et al (2006), we attempt to find as few SNPs as possible from the original nearly 4 millions SNPs to classify the 3 populations in the Hapmap genotype data. In this paper, we propose to first use a modified t-test measure to rank SNPs, and then combine the ranking result with a classifier, e.g., the support vector machine, to find the optimal SNP subset. Compared with Park et al’s result, our proposed method is more efficient in ranking features and classifying the three populations, i.e., we obtained perfect classification using only 11 SNPs in comparison with 82 SNPs used by Park et al.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bafna, V., Halldorsson, B., Schwartz, R., Clark, A., Istrail, S.: Haplotypes and Informative SNP selection: Don’t block out information. In: Proc. of RECOMB, pp. 19–27 (2003)
Celedon, J.C.: Candidate genes, SNPs, Haplotypes and linkage disequilibrium. Powerpoint presentation (2004), http://innateimmunity.net/files/CANDGENES/siframes.html
Devore, J., Peck, R.: Statistics:the exploration and analysis of data, 3rd edn. Duxbury Press, Pacific Grove (1997)
Duerinck, K.F.: (2001), http://www.duerinck.com/snp.html
Francois, R., Langrognet, F.: Double Cross Validation for Model Based Classification, User (2006), http://www.r-project.org/user-2006/Abstracts/Francois+Langrognet.pdf
Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3, 1157–1182 (2003)
Halldrsson, B., Bafna, V., Lippert, R., Schwartz, R., de la Vega, F., Clark, A., Istrail, S.: Optimal haplotype blockfree selection of tagging snps for genome-wide association studies. Genome research 14, 1633–1640 (2004)
Halperin, E., Kimmel, G., Shamir, R.: Tag SNP selection in genotype data for maximizig SNP prediction accuracy. Bioinformatics 199, 195–203 (2005)
Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei (2003)
Human genome project information (2006), http://www.ornl.gov/sci/techresources/Human_Genome/faq/snps.html
Jaeger, J., Sengupta, R., Ruzzo, W.L.: Improved Gene Selection For Classification Of Microarrays. Pac. Symp. Biocomput., 53–64 (2003)
Keerthi, S.S., Lin, C.-J.: Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation 15, 1667–1689 (2003)
Levner, I.: Feature selection and nearest centroid classification for protein mass spectrometry. BMC Bioinformatics 6, 68 (2005)
Liu, B., Wan, C.R., Wang, L.P.: An efficient semi-unsupervised gene selection method via spectral biclustering. IEEE Trans. on Nano-Bioscience 5, 110–114 (2006)
Mitra, Pabitra, Murthy, C.A., Pal, S.K.: Unsupervised feature selection using feature similarity. IEEE trans. on Pattern analysis and machine intelligence 3, 301–312 (2002)
Park, J.S., Hwang, S.H., Lee, Y.S., Kim, S.C.: SNP@Ethnos: a database of ethnically variant single-nucleotide polymorphisms. Nucleic Acids Research 0, D1–D5 (2006)
Phuong, T.M., Lin, Z., Altman, R.B.: Choosing SNPs using Feature Selection. In: Proc IEEE Comput Syst Bioinform Conf. 2005 (CSB 2005), pp. 301–309 (2005)
Pritchard, J.K., Przeworski, M.: Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69, 1–14 (2001)
Rosenberg, N.A., et al.: Informativeness of genetic markers for inference of ancestry. Am. J. Hum. Genet. 73, 1402–1422 (2003)
Rosenberg, N.A.: Algorithms for selecting informative marker panels for population assignment. Journal of computational biology 9, 1183–1201 (2005)
Su, Y., Murali, T.M., Pavlovic, V., Schaffer, M., Kasif, S.: RankGene: Identifcation of Diagnostic Genes Based on Expression Data. Bioinformatics 19, 1578–1579 (2003)
The International HapMap Consortium: The international Hapmap Project. Nature 426, 789–796 (2003), www.hapmap.org/genotypes
Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS 99, 6567–6572 (2002)
Trochim, W.M.: The Research Methods Knowledge Base, 2nd edn. Atomic Dog Publishing (2004), http://www.socialresearchmethods.net/kb/
Vapnik, V.: Statistical learning theory. Wiley, NewYork (1998)
Wang, L.P.: Support Vector Machines: Theory and Applications. Springer, Heidelberg (2005)
Wang, L.P., Chu, F., Xie, W.: Accurate cancer classification using expressions of very few genes. IEEE Transactions on Bioinformatics and Computational Biology 4, 40–53 (2007)
Wang, L.P., Fu, X.J.: Data Mining with Computational Intelligence. Springer, Berlin (2005)
Welch, B.L.: The generalizaition of student’s problem when several different population are involved. Biomethika 34, 28–35 (1947)
Wright, S.: The interpretation of population structure by F-statistics with special regard to systems of mating. Evolution 19, 395–420 (1965)
Wu, B., Abbott, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D., Williams, K., Zhao, H.: Comparison of statistical methods for classifcation of ovarian cancer using mass spectrometry data. BioInformatics 19, 1636–1643 (2003)
Zhen, L., Altman, R.B.: Finding Haplotype Tagging SNPs by Use of Principle Components Analysis. Am. J. Hum. Genet. 75, 850–861 (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhou, N., Wang, L. (2008). Perfect Population Classification on Hapmap Data with a Small Number of SNPs. In: Ishikawa, M., Doya, K., Miyamoto, H., Yamakawa, T. (eds) Neural Information Processing. ICONIP 2007. Lecture Notes in Computer Science, vol 4985. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69162-4_82
Download citation
DOI: https://doi.org/10.1007/978-3-540-69162-4_82
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69159-4
Online ISBN: 978-3-540-69162-4
eBook Packages: Computer ScienceComputer Science (R0)