skip to main content
10.1145/2833258.2833269acmotherconferencesArticle/Chapter ViewAbstractPublication PagessoictConference Proceedingsconference-collections
research-article

Towards more realistic machine learning techniques for prediction of disease-associated genes

Authors Info & Claims
Published:03 December 2015Publication History

ABSTRACT

Many learning techniques have been applied to identify disease-associated genes. At the early, they usually approached this problem as a binary classification, where training set is comprised of positive and negative samples. In which, positive samples are constructed from known disease genes, whereas negative samples are the remaining which are not known to be associated with diseases. This is the limitation of the binary classification-based solutions, since the negative training set should be actual non-disease genes; however, construction of this set is nearly impossible in biomedical researches. Therefore, to reduce this uncertainty, more realistic classification-based methods have been proposed. For instance, unary classification technique based on one-class SVM method was proposed by learning from only positive samples. In addition, the remaining set may contain unknown disease genes; therefore, semi-supervised methods such as binary semi-supervised and positive and unlabeled (PU) learning classifications have been proposed. In particular, PU learning methods, which learn from both known disease genes and the remaining genes, were shown to outperform others. In these studies, data sources are usually represented by vectorial form for binary classifiers, while they are in kernel matrices for unary and PU learning ones. The kernel-based data fusion is only suitable for data with different types and it seems unfair for the comparison based on different data representations. Therefore, in this study, we compared different classification techniques for the disease gene prediction based on vectorial representation of samples. The simulation result showed that the unary classification technique, which combines both density and class probability estimation strategies, achieved the best performance, whereas it is worst for the one-class SVM-based method. Interestingly, performance of the best binary classification technique is comparable with that of biased SVM-based PU learning and binary semi-supervised classification methods. And, they are all better than the multi-level SVM-based one.

References

  1. D.-H. Le, N. Xuan Hoai, and Y.-K. Kwon, "A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction," Knowledge and Systems Engineering, Advances in Intelligent Systems and Computing V.-H. Nguyen, A.-C. Le and V.-N. Huynh, eds., pp. 577--588: Springer International Publishing, 2015.Google ScholarGoogle Scholar
  2. N. Lospez-Bigas, and C. A. Ouzounis, "Genome-wide identification of genes likely to be involved in human genetic disease," Nucleic acids research, vol. 32, no. 10, pp. 3108--3114, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  3. E. Adie, R. Adams, K. Evans, D. Porteous, and B. Pickard, "Speeding disease gene discovery by sequence based candidate prioritization," BMC Bioinformatics, vol. 6, no. 1, pp. 55, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  4. J. Xu, and Y. Li, "Discovering disease-genes by topological features in human protein-protein interaction network," Bioinformatics, vol. 22, no. 22, pp. 2800--2805, November 15, 2006, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Calvo, M. Jain, X. Xie, S. A. Sheth, B. Chang, O. A. Goldberger, A. Spinazzola, M. Zeviani, S. A. Carr, and V. K. Mootha, "Systematic identification of human mitochondrial disease genes through integrative genomics," Nat Genet, vol. 38, no. 5, pp. 576--582, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  6. K. Lage, E. O. Karlberg, Z. M. Storling, P. I. Olason, A. G. Pedersen, O. Rigina, A. M. Hinsby, Z. Tumer, F. Pociot, N. Tommerup, Y. Moreau, and S. Brunak, "A human phenome-interactome network of protein complexes implicated in genetic disorders," Nat Biotech, vol. 25, no. 3, pp. 309--316, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  7. A. Smalter, S. F. Lei, and X.-w. Chen, "Human disease-gene classification with integrative sequence-based and topological features of protein-protein interaction networks." pp. 209--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Radivojac, K. Peng, W. T. Clark, B. J. Peters, A. Mohan, S. M. Boyle, and S. D. Mooney, "An integrated approach to inferring gene--disease associations in humans," Proteins: Structure, Function, and Bioinformatics, vol. 72, no. 3, pp. 1030--1037, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  9. S. Keerthikumar, S. Bhadra, K. Kandasamy, R. Raju, Y. L. Ramachandra, C. Bhattacharyya, K. Imai, O. Ohara, S. Mohan, and A. Pandey, "Prediction of candidate primary immunodeficiency disease genes using a support vector machine learning approach," DNA Research, vol. 16, no. 6, pp. 345--351, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  10. S. Jiabao, J. C. Patra, and L. Yongjin, "Functional Link Artificial Neural Network-based disease gene prediction." pp. 3003--3010.Google ScholarGoogle Scholar
  11. T. De Bie, L.-C. Tranchevent, L. M. M. Van Oeffelen, and Y. Moreau, "Kernel-based data fusion for gene prioritization," Bioinformatics, vol. 23, no. 13, pp. i125--i132, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Yu, S. Van Vooren, L.-C. Tranchevent, B. De Moor, and Y. Moreau, "Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining," Bioinformatics, vol. 24, no. 16, pp. i119--i125, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Yu, L.-C. Tranchevent, B. De Moor, and Y. Moreau, "Gene prioritization and clustering by multi-view text mining," BMC Bioinformatics, vol. 11, no. 1, pp. 28, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  14. B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, "Estimating the support of a high-dimensional distribution," Neural computation, vol. 13, no. 7, pp. 1443--1471, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan, "Learning the kernel matrix with semidefinite programming," The Journal of Machine Learning Research, vol. 5, pp. 27--72, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T.-P. Nguyen, and T.-B. Ho, "Detecting disease genes based on semi-supervised learning and protein-protein interaction networks," Artificial Intelligence in Medicine, vol. 54, no. 1, pp. 63--71, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Yang, X.-L. Li, J.-P. Mei, C.-K. Kwoh, and S.-K. Ng, "Positive-unlabeled learning for disease gene identification," Bioinformatics, vol. 28, no. 20, pp. 2640--2647, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Mordelet, and J.-P. Vert, "ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples," BMC Bioinformatics, vol. 12, no. 1, pp. 389, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  19. K. Hempstalk, E. Frank, and I. Witten, "One-Class Classification by Combining Density and Class Probability Estimation," Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science W. Daelemans, B. Goethals and K. Morik, eds., pp. 505--519: Springer Berlin Heidelberg, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Sindhwani, and S. S. Keerthi, "Large scale semi-supervised linear SVMs," in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, USA, 2006, pp. 477--484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V. Sindhwani, and S. S. Keerthi, "Newton methods for fast solution of semi-supervised linear SVMs," Large scale kernel machines, pp. 155--174, 2007.Google ScholarGoogle Scholar
  22. L. Breiman, "Random Forests," Machine learning, vol. 45, no. 1, pp. 5--32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA data mining software: an update," ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10--18, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C.-C. Chang, and C.-J. Lin, "LIBSVM: a library for support vector machines," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 27, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Liu, W. S. Lee, P. S. Yu, and X. Li, "Partially supervised classification of text documents," MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pp. 387--394, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Liu, X. Du, Y.-D. Xu, M. Li, and X. Wang, "Partially Supervised Text Classification with Multi-Level Examples."Google ScholarGoogle Scholar
  27. K. Brown, and I. Jurisica, "Unequal evolutionary conservation of human protein interactions in interologous networks," Genome Biology, vol. 8, no. 5, pp. R95, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  28. J. Freudenberg, and P. Propping, "A similarity-based method for genome-wide prediction of disease-relevant human genes," Bioinformatics, vol. 18, no. suppl 2, pp. S110--S115, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  29. F. Turner, D. Clutterbuck, and C. Semple, "POCUS: mining genomic sequence annotation to predict disease genes," Genome Biology, vol. 4, no. 11, pp. R75, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  30. C. The UniProt, "The Universal Protein Resource (UniProt) in 2010," Nucl. Acids Res., vol. 38, no. suppl_1, pp. D142--148, January 1, 2010, 2010.Google ScholarGoogle Scholar
  31. P. F. Jonsson, and P. A. Bates, "Global topological features of cancer proteins in the human interactome," Bioinformatics, vol. 22, no. 18, pp. 2291--2297, September 15, 2006, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, E. Birney, M. Biswas, P. Bucher, L. Cerutti, F. Corpet, and M. D. R. Croning, "The InterPro database, an integrated documentation resource for protein families, domains and functional sites," Nucleic acids research, vol. 29, no. 1, pp. 37--40, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  33. S. Hunter, P. Jones, A. Mitchell, R. Apweiler, T. K. Attwood, A. Bateman, T. Bernard, D. Binns, P. Bork, and S. Burge, "InterPro in 2011: new developments in the family and domain prediction database," Nucleic acids research, vol. 40, no. D1, pp. D306--D312, 2011.Google ScholarGoogle Scholar
  34. D. Smedley, S. Haider, B. Ballester, R. Holland, D. London, G. Thorisson, and A. Kasprzyk, "BioMart - biological queries made easy," BMC Genomics, vol. 10, no. 1, pp. 22, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  35. Z. Tu, L. Wang, M. Xu, X. Zhou, T. Chen, and F. Sun, "Further understanding human disease genes by comparing with housekeeping genes and other genes," BMC Genomics, vol. 7, no. 1, pp. 31, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  36. E. W. Sayers, T. Barrett, D. A. Benson, E. Bolton, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. DiCuccio, S. Federhen, M. Feolo, I. M. Fingerman, L. Y. Geer, W. Helmberg, Y. Kapustin, D. Landsman, D. J. Lipman, Z. Lu, T. L. Madden, T. Madej, D. R. Maglott, A. Marchler-Bauer, V. Miller, I. Mizrachi, J. Ostell, A. Panchenko, L. Phan, K. D. Pruitt, G. D. Schuler, E. Sequeira, S. T. Sherry, M. Shumway, K. Sirotkin, D. Slotta, A. Souvorov, G. Starchenko, T. A. Tatusova, L. Wagner, Y. Wang, W. J. Wilbur, E. Yaschenko, and J. Ye, "Database resources of the National Center for Biotechnology Information," Nucleic acids research, vol. 39, no. suppl 1, pp. D38--D51, January 1, 2011, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  37. J. Amberger, C. A. Bocchini, A. F. Scott, and A. Hamosh, "McKusick's Online Mendelian Inheritance in Man (OMIM®)," Nucleic Acids Research, vol. 37, no. suppl 1, pp. D793--D796, January 1, 2009, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  38. H. Luo, Y. Lin, F. Gao, C.-T. Zhang, and R. Zhang, "DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements," Nucleic acids research, vol. 42, no. D1, pp. D574--D580, January 1, 2014, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  39. G. Dennis, B. Sherman, D. Hosack, J. Yang, W. Gao, H. Lane, and R. Lempicki, "DAVID: Database for Annotation, Visualization, and Integrated Discovery," Genome Biology, vol. 4, no. 9, pp. R60, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  40. P. Yang, X. Li, H.-N. Chua, C.-K. Kwoh, and S.-K. Ng, "Ensemble Positive Unlabeled Learning for Disease Gene Identification," PLoS ONE, vol. 9, no. 5, pp. e97079, 2014.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    SoICT '15: Proceedings of the 6th International Symposium on Information and Communication Technology
    December 2015
    372 pages
    ISBN:9781450338431
    DOI:10.1145/2833258

    Copyright © 2015 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 3 December 2015

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    SoICT '15 Paper Acceptance Rate49of106submissions,46%Overall Acceptance Rate147of318submissions,46%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader