ABSTRACT
Many learning techniques have been applied to identify disease-associated genes. At the early, they usually approached this problem as a binary classification, where training set is comprised of positive and negative samples. In which, positive samples are constructed from known disease genes, whereas negative samples are the remaining which are not known to be associated with diseases. This is the limitation of the binary classification-based solutions, since the negative training set should be actual non-disease genes; however, construction of this set is nearly impossible in biomedical researches. Therefore, to reduce this uncertainty, more realistic classification-based methods have been proposed. For instance, unary classification technique based on one-class SVM method was proposed by learning from only positive samples. In addition, the remaining set may contain unknown disease genes; therefore, semi-supervised methods such as binary semi-supervised and positive and unlabeled (PU) learning classifications have been proposed. In particular, PU learning methods, which learn from both known disease genes and the remaining genes, were shown to outperform others. In these studies, data sources are usually represented by vectorial form for binary classifiers, while they are in kernel matrices for unary and PU learning ones. The kernel-based data fusion is only suitable for data with different types and it seems unfair for the comparison based on different data representations. Therefore, in this study, we compared different classification techniques for the disease gene prediction based on vectorial representation of samples. The simulation result showed that the unary classification technique, which combines both density and class probability estimation strategies, achieved the best performance, whereas it is worst for the one-class SVM-based method. Interestingly, performance of the best binary classification technique is comparable with that of biased SVM-based PU learning and binary semi-supervised classification methods. And, they are all better than the multi-level SVM-based one.
- D.-H. Le, N. Xuan Hoai, and Y.-K. Kwon, "A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction," Knowledge and Systems Engineering, Advances in Intelligent Systems and Computing V.-H. Nguyen, A.-C. Le and V.-N. Huynh, eds., pp. 577--588: Springer International Publishing, 2015.Google Scholar
- N. Lospez-Bigas, and C. A. Ouzounis, "Genome-wide identification of genes likely to be involved in human genetic disease," Nucleic acids research, vol. 32, no. 10, pp. 3108--3114, 2004.Google ScholarCross Ref
- E. Adie, R. Adams, K. Evans, D. Porteous, and B. Pickard, "Speeding disease gene discovery by sequence based candidate prioritization," BMC Bioinformatics, vol. 6, no. 1, pp. 55, 2005.Google ScholarCross Ref
- J. Xu, and Y. Li, "Discovering disease-genes by topological features in human protein-protein interaction network," Bioinformatics, vol. 22, no. 22, pp. 2800--2805, November 15, 2006, 2006. Google ScholarDigital Library
- S. Calvo, M. Jain, X. Xie, S. A. Sheth, B. Chang, O. A. Goldberger, A. Spinazzola, M. Zeviani, S. A. Carr, and V. K. Mootha, "Systematic identification of human mitochondrial disease genes through integrative genomics," Nat Genet, vol. 38, no. 5, pp. 576--582, 2006.Google ScholarCross Ref
- K. Lage, E. O. Karlberg, Z. M. Storling, P. I. Olason, A. G. Pedersen, O. Rigina, A. M. Hinsby, Z. Tumer, F. Pociot, N. Tommerup, Y. Moreau, and S. Brunak, "A human phenome-interactome network of protein complexes implicated in genetic disorders," Nat Biotech, vol. 25, no. 3, pp. 309--316, 2007.Google ScholarCross Ref
- A. Smalter, S. F. Lei, and X.-w. Chen, "Human disease-gene classification with integrative sequence-based and topological features of protein-protein interaction networks." pp. 209--216. Google ScholarDigital Library
- P. Radivojac, K. Peng, W. T. Clark, B. J. Peters, A. Mohan, S. M. Boyle, and S. D. Mooney, "An integrated approach to inferring gene--disease associations in humans," Proteins: Structure, Function, and Bioinformatics, vol. 72, no. 3, pp. 1030--1037, 2008.Google ScholarCross Ref
- S. Keerthikumar, S. Bhadra, K. Kandasamy, R. Raju, Y. L. Ramachandra, C. Bhattacharyya, K. Imai, O. Ohara, S. Mohan, and A. Pandey, "Prediction of candidate primary immunodeficiency disease genes using a support vector machine learning approach," DNA Research, vol. 16, no. 6, pp. 345--351, 2009.Google ScholarCross Ref
- S. Jiabao, J. C. Patra, and L. Yongjin, "Functional Link Artificial Neural Network-based disease gene prediction." pp. 3003--3010.Google Scholar
- T. De Bie, L.-C. Tranchevent, L. M. M. Van Oeffelen, and Y. Moreau, "Kernel-based data fusion for gene prioritization," Bioinformatics, vol. 23, no. 13, pp. i125--i132, 2007. Google ScholarDigital Library
- S. Yu, S. Van Vooren, L.-C. Tranchevent, B. De Moor, and Y. Moreau, "Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining," Bioinformatics, vol. 24, no. 16, pp. i119--i125, 2008. Google ScholarDigital Library
- S. Yu, L.-C. Tranchevent, B. De Moor, and Y. Moreau, "Gene prioritization and clustering by multi-view text mining," BMC Bioinformatics, vol. 11, no. 1, pp. 28, 2010.Google ScholarCross Ref
- B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, "Estimating the support of a high-dimensional distribution," Neural computation, vol. 13, no. 7, pp. 1443--1471, 2001. Google ScholarDigital Library
- G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan, "Learning the kernel matrix with semidefinite programming," The Journal of Machine Learning Research, vol. 5, pp. 27--72, 2004. Google ScholarDigital Library
- T.-P. Nguyen, and T.-B. Ho, "Detecting disease genes based on semi-supervised learning and protein-protein interaction networks," Artificial Intelligence in Medicine, vol. 54, no. 1, pp. 63--71, 2012. Google ScholarDigital Library
- P. Yang, X.-L. Li, J.-P. Mei, C.-K. Kwoh, and S.-K. Ng, "Positive-unlabeled learning for disease gene identification," Bioinformatics, vol. 28, no. 20, pp. 2640--2647, 2012. Google ScholarDigital Library
- F. Mordelet, and J.-P. Vert, "ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples," BMC Bioinformatics, vol. 12, no. 1, pp. 389, 2011.Google ScholarCross Ref
- K. Hempstalk, E. Frank, and I. Witten, "One-Class Classification by Combining Density and Class Probability Estimation," Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science W. Daelemans, B. Goethals and K. Morik, eds., pp. 505--519: Springer Berlin Heidelberg, 2008. Google ScholarDigital Library
- V. Sindhwani, and S. S. Keerthi, "Large scale semi-supervised linear SVMs," in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, USA, 2006, pp. 477--484. Google ScholarDigital Library
- V. Sindhwani, and S. S. Keerthi, "Newton methods for fast solution of semi-supervised linear SVMs," Large scale kernel machines, pp. 155--174, 2007.Google Scholar
- L. Breiman, "Random Forests," Machine learning, vol. 45, no. 1, pp. 5--32, 2001. Google ScholarDigital Library
- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA data mining software: an update," ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10--18, 2009. Google ScholarDigital Library
- C.-C. Chang, and C.-J. Lin, "LIBSVM: a library for support vector machines," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 27, 2011. Google ScholarDigital Library
- B. Liu, W. S. Lee, P. S. Yu, and X. Li, "Partially supervised classification of text documents," MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pp. 387--394, 2002. Google ScholarDigital Library
- T. Liu, X. Du, Y.-D. Xu, M. Li, and X. Wang, "Partially Supervised Text Classification with Multi-Level Examples."Google Scholar
- K. Brown, and I. Jurisica, "Unequal evolutionary conservation of human protein interactions in interologous networks," Genome Biology, vol. 8, no. 5, pp. R95, 2007.Google ScholarCross Ref
- J. Freudenberg, and P. Propping, "A similarity-based method for genome-wide prediction of disease-relevant human genes," Bioinformatics, vol. 18, no. suppl 2, pp. S110--S115, 2002.Google ScholarCross Ref
- F. Turner, D. Clutterbuck, and C. Semple, "POCUS: mining genomic sequence annotation to predict disease genes," Genome Biology, vol. 4, no. 11, pp. R75, 2003.Google ScholarCross Ref
- C. The UniProt, "The Universal Protein Resource (UniProt) in 2010," Nucl. Acids Res., vol. 38, no. suppl_1, pp. D142--148, January 1, 2010, 2010.Google Scholar
- P. F. Jonsson, and P. A. Bates, "Global topological features of cancer proteins in the human interactome," Bioinformatics, vol. 22, no. 18, pp. 2291--2297, September 15, 2006, 2006. Google ScholarDigital Library
- R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, E. Birney, M. Biswas, P. Bucher, L. Cerutti, F. Corpet, and M. D. R. Croning, "The InterPro database, an integrated documentation resource for protein families, domains and functional sites," Nucleic acids research, vol. 29, no. 1, pp. 37--40, 2001.Google ScholarCross Ref
- S. Hunter, P. Jones, A. Mitchell, R. Apweiler, T. K. Attwood, A. Bateman, T. Bernard, D. Binns, P. Bork, and S. Burge, "InterPro in 2011: new developments in the family and domain prediction database," Nucleic acids research, vol. 40, no. D1, pp. D306--D312, 2011.Google Scholar
- D. Smedley, S. Haider, B. Ballester, R. Holland, D. London, G. Thorisson, and A. Kasprzyk, "BioMart - biological queries made easy," BMC Genomics, vol. 10, no. 1, pp. 22, 2009.Google ScholarCross Ref
- Z. Tu, L. Wang, M. Xu, X. Zhou, T. Chen, and F. Sun, "Further understanding human disease genes by comparing with housekeeping genes and other genes," BMC Genomics, vol. 7, no. 1, pp. 31, 2006.Google ScholarCross Ref
- E. W. Sayers, T. Barrett, D. A. Benson, E. Bolton, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. DiCuccio, S. Federhen, M. Feolo, I. M. Fingerman, L. Y. Geer, W. Helmberg, Y. Kapustin, D. Landsman, D. J. Lipman, Z. Lu, T. L. Madden, T. Madej, D. R. Maglott, A. Marchler-Bauer, V. Miller, I. Mizrachi, J. Ostell, A. Panchenko, L. Phan, K. D. Pruitt, G. D. Schuler, E. Sequeira, S. T. Sherry, M. Shumway, K. Sirotkin, D. Slotta, A. Souvorov, G. Starchenko, T. A. Tatusova, L. Wagner, Y. Wang, W. J. Wilbur, E. Yaschenko, and J. Ye, "Database resources of the National Center for Biotechnology Information," Nucleic acids research, vol. 39, no. suppl 1, pp. D38--D51, January 1, 2011, 2011.Google ScholarCross Ref
- J. Amberger, C. A. Bocchini, A. F. Scott, and A. Hamosh, "McKusick's Online Mendelian Inheritance in Man (OMIM®)," Nucleic Acids Research, vol. 37, no. suppl 1, pp. D793--D796, January 1, 2009, 2009.Google ScholarCross Ref
- H. Luo, Y. Lin, F. Gao, C.-T. Zhang, and R. Zhang, "DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements," Nucleic acids research, vol. 42, no. D1, pp. D574--D580, January 1, 2014, 2014.Google ScholarCross Ref
- G. Dennis, B. Sherman, D. Hosack, J. Yang, W. Gao, H. Lane, and R. Lempicki, "DAVID: Database for Annotation, Visualization, and Integrated Discovery," Genome Biology, vol. 4, no. 9, pp. R60, 2003.Google ScholarCross Ref
- P. Yang, X. Li, H.-N. Chua, C.-K. Kwoh, and S.-K. Ng, "Ensemble Positive Unlabeled Learning for Disease Gene Identification," PLoS ONE, vol. 9, no. 5, pp. e97079, 2014.Google ScholarCross Ref
Recommendations
Detecting disease genes based on semi-supervised learning and protein-protein interaction networks
Objective: Predicting or prioritizing the human genes that cause disease, or ''disease genes'', is one of the emerging tasks in biomedicine informatics. Research on network-based approach to this problem is carried out upon the key assumption of ''the ...
Indentifying disease genes using disease-specific amino acid usage
ICIC'11: Proceedings of the 7th international conference on Intelligent Computing: bio-inspired computing and applicationsThe identification of disease genes from candidated regions is one of the most important tasks in bioinformatics research. Among all the approaches reported recently, methods based on sequence characteristics have the widest application range. However, ...
Network-Based Prediction of Polygenic Disease Genes Involved in Cell Motility: Extended Abstract
BCB '18: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health InformaticsSchizophrenia and autism are examples of polygenic diseases caused by a multitude of genetic variants. Recently, both diseases have been associated with disrupted neuron motility and migration patterns, suggesting that aberrant cell motility is a ...
Comments