research-article

Towards more realistic machine learning techniques for prediction of disease-associated genes

Authors:
Duc-Hau Le

School of Computer Science and Engineering, Water Resources University, 175 Tay Son, Dong Da, Hanoi, Vietnam., (84) 912324564

School of Computer Science and Engineering, Water Resources University, 175 Tay Son, Dong Da, Hanoi, Vietnam., (84) 912324564
View Profile

,
Manh-Hien Nguyen

School of Computer Science and Engineering, Water Resources University, 175 Tay Son, Dong Da, Hanoi, Vietnam., (84) 912324564

School of Computer Science and Engineering, Water Resources University, 175 Tay Son, Dong Da, Hanoi, Vietnam., (84) 912324564
View Profile

SoICT '15: Proceedings of the 6th International Symposium on Information and Communication TechnologyDecember 2015Pages 116–120https://doi.org/10.1145/2833258.2833269

Published:03 December 2015Publication History

SoICT '15: Proceedings of the 6th International Symposium on Information and Communication Technology

Pages 116–120

ABSTRACT

Many learning techniques have been applied to identify disease-associated genes. At the early, they usually approached this problem as a binary classification, where training set is comprised of positive and negative samples. In which, positive samples are constructed from known disease genes, whereas negative samples are the remaining which are not known to be associated with diseases. This is the limitation of the binary classification-based solutions, since the negative training set should be actual non-disease genes; however, construction of this set is nearly impossible in biomedical researches. Therefore, to reduce this uncertainty, more realistic classification-based methods have been proposed. For instance, unary classification technique based on one-class SVM method was proposed by learning from only positive samples. In addition, the remaining set may contain unknown disease genes; therefore, semi-supervised methods such as binary semi-supervised and positive and unlabeled (PU) learning classifications have been proposed. In particular, PU learning methods, which learn from both known disease genes and the remaining genes, were shown to outperform others. In these studies, data sources are usually represented by vectorial form for binary classifiers, while they are in kernel matrices for unary and PU learning ones. The kernel-based data fusion is only suitable for data with different types and it seems unfair for the comparison based on different data representations. Therefore, in this study, we compared different classification techniques for the disease gene prediction based on vectorial representation of samples. The simulation result showed that the unary classification technique, which combines both density and class probability estimation strategies, achieved the best performance, whereas it is worst for the one-class SVM-based method. Interestingly, performance of the best binary classification technique is comparable with that of biased SVM-based PU learning and binary semi-supervised classification methods. And, they are all better than the multi-level SVM-based one.

References

D.-H. Le, N. Xuan Hoai, and Y.-K. Kwon, "A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction," Knowledge and Systems Engineering, Advances in Intelligent Systems and Computing V.-H. Nguyen, A.-C. Le and V.-N. Huynh, eds., pp. 577--588: Springer International Publishing, 2015.Google Scholar
N. Lospez-Bigas, and C. A. Ouzounis, "Genome-wide identification of genes likely to be involved in human genetic disease," Nucleic acids research, vol. 32, no. 10, pp. 3108--3114, 2004.Google ScholarCross Ref
E. Adie, R. Adams, K. Evans, D. Porteous, and B. Pickard, "Speeding disease gene discovery by sequence based candidate prioritization," BMC Bioinformatics, vol. 6, no. 1, pp. 55, 2005.Google ScholarCross Ref
J. Xu, and Y. Li, "Discovering disease-genes by topological features in human protein-protein interaction network," Bioinformatics, vol. 22, no. 22, pp. 2800--2805, November 15, 2006, 2006. Google ScholarDigital Library
S. Calvo, M. Jain, X. Xie, S. A. Sheth, B. Chang, O. A. Goldberger, A. Spinazzola, M. Zeviani, S. A. Carr, and V. K. Mootha, "Systematic identification of human mitochondrial disease genes through integrative genomics," Nat Genet, vol. 38, no. 5, pp. 576--582, 2006.Google ScholarCross Ref
K. Lage, E. O. Karlberg, Z. M. Storling, P. I. Olason, A. G. Pedersen, O. Rigina, A. M. Hinsby, Z. Tumer, F. Pociot, N. Tommerup, Y. Moreau, and S. Brunak, "A human phenome-interactome network of protein complexes implicated in genetic disorders," Nat Biotech, vol. 25, no. 3, pp. 309--316, 2007.Google ScholarCross Ref
A. Smalter, S. F. Lei, and X.-w. Chen, "Human disease-gene classification with integrative sequence-based and topological features of protein-protein interaction networks." pp. 209--216. Google ScholarDigital Library
P. Radivojac, K. Peng, W. T. Clark, B. J. Peters, A. Mohan, S. M. Boyle, and S. D. Mooney, "An integrated approach to inferring gene--disease associations in humans," Proteins: Structure, Function, and Bioinformatics, vol. 72, no. 3, pp. 1030--1037, 2008.Google ScholarCross Ref
S. Keerthikumar, S. Bhadra, K. Kandasamy, R. Raju, Y. L. Ramachandra, C. Bhattacharyya, K. Imai, O. Ohara, S. Mohan, and A. Pandey, "Prediction of candidate primary immunodeficiency disease genes using a support vector machine learning approach," DNA Research, vol. 16, no. 6, pp. 345--351, 2009.Google ScholarCross Ref
S. Jiabao, J. C. Patra, and L. Yongjin, "Functional Link Artificial Neural Network-based disease gene prediction." pp. 3003--3010.Google Scholar
T. De Bie, L.-C. Tranchevent, L. M. M. Van Oeffelen, and Y. Moreau, "Kernel-based data fusion for gene prioritization," Bioinformatics, vol. 23, no. 13, pp. i125--i132, 2007. Google ScholarDigital Library
S. Yu, S. Van Vooren, L.-C. Tranchevent, B. De Moor, and Y. Moreau, "Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining," Bioinformatics, vol. 24, no. 16, pp. i119--i125, 2008. Google ScholarDigital Library
S. Yu, L.-C. Tranchevent, B. De Moor, and Y. Moreau, "Gene prioritization and clustering by multi-view text mining," BMC Bioinformatics, vol. 11, no. 1, pp. 28, 2010.Google ScholarCross Ref
B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, "Estimating the support of a high-dimensional distribution," Neural computation, vol. 13, no. 7, pp. 1443--1471, 2001. Google ScholarDigital Library
G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan, "Learning the kernel matrix with semidefinite programming," The Journal of Machine Learning Research, vol. 5, pp. 27--72, 2004. Google ScholarDigital Library
T.-P. Nguyen, and T.-B. Ho, "Detecting disease genes based on semi-supervised learning and protein-protein interaction networks," Artificial Intelligence in Medicine, vol. 54, no. 1, pp. 63--71, 2012. Google ScholarDigital Library
P. Yang, X.-L. Li, J.-P. Mei, C.-K. Kwoh, and S.-K. Ng, "Positive-unlabeled learning for disease gene identification," Bioinformatics, vol. 28, no. 20, pp. 2640--2647, 2012. Google ScholarDigital Library
F. Mordelet, and J.-P. Vert, "ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples," BMC Bioinformatics, vol. 12, no. 1, pp. 389, 2011.Google ScholarCross Ref
K. Hempstalk, E. Frank, and I. Witten, "One-Class Classification by Combining Density and Class Probability Estimation," Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science W. Daelemans, B. Goethals and K. Morik, eds., pp. 505--519: Springer Berlin Heidelberg, 2008. Google ScholarDigital Library
V. Sindhwani, and S. S. Keerthi, "Large scale semi-supervised linear SVMs," in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, USA, 2006, pp. 477--484. Google ScholarDigital Library
V. Sindhwani, and S. S. Keerthi, "Newton methods for fast solution of semi-supervised linear SVMs," Large scale kernel machines, pp. 155--174, 2007.Google Scholar
L. Breiman, "Random Forests," Machine learning, vol. 45, no. 1, pp. 5--32, 2001. Google ScholarDigital Library
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA data mining software: an update," ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10--18, 2009. Google ScholarDigital Library
C.-C. Chang, and C.-J. Lin, "LIBSVM: a library for support vector machines," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 27, 2011. Google ScholarDigital Library
B. Liu, W. S. Lee, P. S. Yu, and X. Li, "Partially supervised classification of text documents," MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pp. 387--394, 2002. Google ScholarDigital Library
T. Liu, X. Du, Y.-D. Xu, M. Li, and X. Wang, "Partially Supervised Text Classification with Multi-Level Examples."Google Scholar
K. Brown, and I. Jurisica, "Unequal evolutionary conservation of human protein interactions in interologous networks," Genome Biology, vol. 8, no. 5, pp. R95, 2007.Google ScholarCross Ref
J. Freudenberg, and P. Propping, "A similarity-based method for genome-wide prediction of disease-relevant human genes," Bioinformatics, vol. 18, no. suppl 2, pp. S110--S115, 2002.Google ScholarCross Ref
F. Turner, D. Clutterbuck, and C. Semple, "POCUS: mining genomic sequence annotation to predict disease genes," Genome Biology, vol. 4, no. 11, pp. R75, 2003.Google ScholarCross Ref
C. The UniProt, "The Universal Protein Resource (UniProt) in 2010," Nucl. Acids Res., vol. 38, no. suppl_1, pp. D142--148, January 1, 2010, 2010.Google Scholar
P. F. Jonsson, and P. A. Bates, "Global topological features of cancer proteins in the human interactome," Bioinformatics, vol. 22, no. 18, pp. 2291--2297, September 15, 2006, 2006. Google ScholarDigital Library
R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, E. Birney, M. Biswas, P. Bucher, L. Cerutti, F. Corpet, and M. D. R. Croning, "The InterPro database, an integrated documentation resource for protein families, domains and functional sites," Nucleic acids research, vol. 29, no. 1, pp. 37--40, 2001.Google ScholarCross Ref
S. Hunter, P. Jones, A. Mitchell, R. Apweiler, T. K. Attwood, A. Bateman, T. Bernard, D. Binns, P. Bork, and S. Burge, "InterPro in 2011: new developments in the family and domain prediction database," Nucleic acids research, vol. 40, no. D1, pp. D306--D312, 2011.Google Scholar
D. Smedley, S. Haider, B. Ballester, R. Holland, D. London, G. Thorisson, and A. Kasprzyk, "BioMart - biological queries made easy," BMC Genomics, vol. 10, no. 1, pp. 22, 2009.Google ScholarCross Ref
Z. Tu, L. Wang, M. Xu, X. Zhou, T. Chen, and F. Sun, "Further understanding human disease genes by comparing with housekeeping genes and other genes," BMC Genomics, vol. 7, no. 1, pp. 31, 2006.Google ScholarCross Ref
E. W. Sayers, T. Barrett, D. A. Benson, E. Bolton, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. DiCuccio, S. Federhen, M. Feolo, I. M. Fingerman, L. Y. Geer, W. Helmberg, Y. Kapustin, D. Landsman, D. J. Lipman, Z. Lu, T. L. Madden, T. Madej, D. R. Maglott, A. Marchler-Bauer, V. Miller, I. Mizrachi, J. Ostell, A. Panchenko, L. Phan, K. D. Pruitt, G. D. Schuler, E. Sequeira, S. T. Sherry, M. Shumway, K. Sirotkin, D. Slotta, A. Souvorov, G. Starchenko, T. A. Tatusova, L. Wagner, Y. Wang, W. J. Wilbur, E. Yaschenko, and J. Ye, "Database resources of the National Center for Biotechnology Information," Nucleic acids research, vol. 39, no. suppl 1, pp. D38--D51, January 1, 2011, 2011.Google ScholarCross Ref
J. Amberger, C. A. Bocchini, A. F. Scott, and A. Hamosh, "McKusick's Online Mendelian Inheritance in Man (OMIM®)," Nucleic Acids Research, vol. 37, no. suppl 1, pp. D793--D796, January 1, 2009, 2009.Google ScholarCross Ref
H. Luo, Y. Lin, F. Gao, C.-T. Zhang, and R. Zhang, "DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements," Nucleic acids research, vol. 42, no. D1, pp. D574--D580, January 1, 2014, 2014.Google ScholarCross Ref
G. Dennis, B. Sherman, D. Hosack, J. Yang, W. Gao, H. Lane, and R. Lempicki, "DAVID: Database for Annotation, Visualization, and Integrated Discovery," Genome Biology, vol. 4, no. 9, pp. R60, 2003.Google ScholarCross Ref
P. Yang, X. Li, H.-N. Chua, C.-K. Kwoh, and S.-K. Ng, "Ensemble Positive Unlabeled Learning for Disease Gene Identification," PLoS ONE, vol. 9, no. 5, pp. e97079, 2014.Google ScholarCross Ref

Recommendations

Detecting disease genes based on semi-supervised learning and protein-protein interaction networks

Objective: Predicting or prioritizing the human genes that cause disease, or ''disease genes'', is one of the emerging tasks in biomedicine informatics. Research on network-based approach to this problem is carried out upon the key assumption of ''the ...
Read More
Indentifying disease genes using disease-specific amino acid usage
ICIC'11: Proceedings of the 7th international conference on Intelligent Computing: bio-inspired computing and applications

The identification of disease genes from candidated regions is one of the most important tasks in bioinformatics research. Among all the approaches reported recently, methods based on sequence characteristics have the widest application range. However, ...
Read More
Network-Based Prediction of Polygenic Disease Genes Involved in Cell Motility: Extended Abstract
BCB '18: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

Schizophrenia and autism are examples of polygenic diseases caused by a multitude of genetic variants. Recently, both diseases have been associated with disrupted neuron motility and migration patterns, suggesting that aberrant cell motility is a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SoICT '15: Proceedings of the 6th International Symposium on Information and Communication Technology
December 2015
372 pages
ISBN:9781450338431
DOI:10.1145/2833258
General Chairs:
Huynh Quyet Thang
HUST, Vietnam
,
Le Anh Phuong
HUCE, Vietnam
,
Program Chairs:
Luc De Raedt
KULeuven, Belgium
,
Yves Deville
UCLouvain, Belgium
,
Marc Bui
EPHE, France
,
Truong Thi Dieu Linh
HUST, Vietnam
,
Publications Chairs:
Nguyen Thi Oanh
HUST, Vietnam
,
Dinh Viet Sang
HUST, Vietnam
,
Nguyen Ba Ngoc
HUST, Vietnam
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 December 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Disease gene prediction
PU learning
binary classification
semi-supervised learning
unary classification
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
SoICT '15 Paper Acceptance Rate49of106submissions,46%Overall Acceptance Rate147of318submissions,46%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 93
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Towards more realistic machine learning techniques for prediction of disease-associated genes

SoICT '15: Proceedings of the 6th International Symposium on Information and Communication Technology

ABSTRACT

References

Cited By

Recommendations

Detecting disease genes based on semi-supervised learning and protein-protein interaction networks

Indentifying disease genes using disease-specific amino acid usage

Network-Based Prediction of Polygenic Disease Genes Involved in Cell Motility: Extended Abstract

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Towards more realistic machine learning techniques for prediction of disease-associated genes

SoICT '15: Proceedings of the 6th International Symposium on Information and Communication Technology

ABSTRACT

References

Cited By

Recommendations

Detecting disease genes based on semi-supervised learning and protein-protein interaction networks

Indentifying disease genes using disease-specific amino acid usage

Network-Based Prediction of Polygenic Disease Genes Involved in Cell Motility: Extended Abstract

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media