Abstract
In this chapter, we show how the predictive clustering tree framework can be used to predict the functions of genes. The gene function prediction task is an example of a hierarchical multi-label classification (HMC) task: genes may have multiple functions and these functions are organized in a hierarchy. The hierarchy of functions can be such that each function has at most one parent (tree structure) or such that functions may have multiple parents (DAG structure).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research 25: 3389–3402 (1997)
Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., Harris, M., Hill, D., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J., Richardson, J., Ringwald, M., Rubin, G., Sherlock, G.: Gene Ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 25(1): 25–29 (2000)
Astikainen, K., L., H., Pitkanen, E., S., S., Rousu, J.: Towards structured output prediction of enzyme function. BMC Proceedings 2(Suppl 4): S2(2008)
Barutcuoglu, Z., Schapire, R., Troyanskaya, O.: Hierarchical multi-label prediction of gene function. Bioinformatics 22(7): 830–836 (2006).
Blockeel, H., Bruynooghe, M., Džeroski, S., Ramon, J., Struyf, J.: Hierarchical multiclassification. In: Proc. Wshp on Multi-RelationalData Mining, pp. 21–35. ACM SIGKDD (2002)
Blockeel, H., De Raedt, L., Ramon, J.: Top-down induction of clustering trees. In: Proc. of the 15th Intl Conf. on Machine Learning, pp. 55–63. Morgan Kaufmann (1998)
Blockeel, H., Schietgat, L., Struyf, J., Džeroski, S., Clare, A.: Decision trees for hierarchical multilabel classification: A case study in functional genomics. In: Proc. of the 10th European Conf. on Principles and Practices of Knowledge Discovery in Databases, LNCS, vol. 4213, pp. 18–29. Springer (2006)
Breiman, L.: Bagging predictors. Machine Learning 24(2): 123–140 (1996)
Breiman, L.: Out-of-bag estimation. Technical Report, Statistics Department, University of California (1996)
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)
Cesa-Bianchi, N., Gentile, C., Zaniboni, L.: Incremental algorithms for hierarchical classification. Journal of Machine Learning Research 7: 31–54 (2006)
Cesa-Bianchi, N., Valentini, G.: Hierarchical cost-sensitive algorithms for genome-wide gene function prediction. In Proc. 3rd Intl Wshp on Machine Learning in Systems Biology, JMLR: Workshop and Conference Proceedings 8: 14–29 (2010)
Chen, Y., Xu, D.: Global protein function annotation through mining genome-scale data in yeast saccharomyces cerevisiae. Nucleic Acids Research 32(21): 6414–6424 (2004)
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P., Herskowitz, I.: The transcriptional program of sporulation in budding yeast. Science 282: 699–705 (1998)
Chua, H., Sung, W., Wong, L.: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 22(13): 1623–1630 (2006)
Clare, A.: Machine Learning and Data Mining for Yeast Functional Genomics. Ph.D. thesis, University of Wales, Aberystwyth (2003)
Clare, A., Karwath, A., Ougham, H., King, R.D.: Functional bioinformatics for Arabidopsis thaliana. Bioinformatics 22(9): 1130–1136 (2006)
Clare, A., King, R.D.: Predicting gene function in Saccharomyces cerevisiae. Bioinformatics 19(Suppl. 2): 42–49 (2003).
Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In Proc. of the 23rd Intl Conf. on Machine Learning, pp. 233–240. ACM Press (2006)
Deng, M., Zhang, K., Mehta, S., Chen, T., Sun, F.: Prediction of protein function using proteinprotein interaction data. In Proc. of the IEEE Computer Society Bioinformatics Conf., pp. 197–206. IEEE Computer Society Press (2002)
DeRisi, J., Iyer, V., Brown, P.: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278: 680–686 (1997)
Eisen, M., Spellman, P., Brown, P., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. In Proc. National Academy of Sciences of USA 95(14): 14863–14868 (1998)
Gasch, A., Huang, M., Metzner, S., Botstein, D., Elledge, S., Brown, P.: Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Molecular Biology of the Cell 12(10): 2987–3000 (2001)
Gasch, A., Spellman, P., Kao, C., Carmel-Harel, O., Eisen, M., Storz, G., Botstein, D., Brown, P.: Genomic expression program in the response of yeast cells to environmental changes. Molecular Biology of the Cell 11: 4241–4257 (2000)
Geurts, P., Wehenkel, L., d’Alché Buc, F.: Kernelizing the output of tree-based methods. In Proc. of the 23rd Intl Conf. on Machine learning, pp. 345–352. ACM Press (2006).
Gough, J., Karplus, K., Hughey, R., Chothia, C.: Assignment of homology to genome sequences using a library of hidden markov models that represent all proteins of known structure. Molecular Biology 313(4): 903–919 (2001)
Guan, Y., Myers, C., Hess, D., Barutcuoglu, Z., Caudy, A., Troyanskaya, O.: Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biology 9(Suppl 1): S3(2008)
Joachims, T.: Making large-scale SVM learning practical. In: B. Scholkopf, C. Burges, A. Smola (eds.) Advances in Kernel Methods – Support Vector Learning. MIT Press (1999)
Karaoz, U., Murali, T., Letovsky, S., Zheng, Y., Ding, C., Cantor, C., Kasif, S.: Whole-genome annotation by using evidence integration in functional-linkage networks. Proc. National Academy of Sciences of USA 101(9): 2888–2893 (2004)
Kim, W., Krumpelman, C., Marcotte, E.: Inferring mouse gene functions from genomic-scale data using a combined functional network/classification strategy. Genome Biology 9(Suppl 1): S5(2008)
Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Ensembles of multi-objective decision trees. In: Proc. of the 18th European Conf. on Machine Learning, LNCS, vol. 4701, pp. 624–631. Springer (2007)
Lanckriet, G.R., Deng, M., Cristianini, N., Jordan, M.I., Noble, W.S.: Kernel-based data fusion and its application to protein function prediction in yeast. In Proc. of the Pacific Symposium on Biocomputing, pp. 300–311. World Scientific Press (2004)
Lee, H., Tu, Z., Deng, M., Sun, F., Chen, T.: Diffusion kernel-based logistic regression models for protein function prediction. OMICS 10(1): 40–55 (2006)
Mewes, H., Heumann, K., Kaps, A., Mayer, K., Pfeiffer, F., Stocker, S., Frishman, D.: MIPS: A database for protein sequences and complete genomes. Nucleic Acids Research 27: 44–48 (1999)
Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C., Morris, Q.: GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biology 9(Suppl 1): S4(2008)
Obozinski, G., Lanckriet, G., Grant, C., Jordan, M., Noble, W.: Consistent probabilistic outputs for protein function prediction. Genome Biology 9(Suppl 1): S6(2008)
Ouali, M., King, R.: Cascaded multiple classifiers for secondary structure prediction. Protein Science 9(6): 1162–76 (2000)
Provost, F., Fawcett, T.: Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In Proc. of the Third Intl Conf. on Knowledge Discovery and Data Mining, pp. 43–48. AAAI Press (1998)
Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)
Roth, F., Hughes, J., Estep, P., Church, G.: Fining DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnology 16: 939–945 (1998)
Rousu, J., Saunders, C., Szedmak, S., Shawe-Taylor, J.: Kernel-based learning of hierarchical multilabel classification models. Journal of Machine Learning Research 7: 1601–1626 (2006)
Schietgat, L., Vens, C., Struyf, J., Blockeel, H., Kocev, D., Džeroski, S.: Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinformatics 11;2(2010)
Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., Futcher, B.: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9: 3273–3297 (1998)
Taskar, B., Guestrin, C., Koller, D.: Max-margin Markov networks. Advances in Neural Information Processing Systems 16. MIT Press (2003)
Tian, W., Zhang, L., Tasan, M., Gibbons, F., King, O., Park, J., Wunderlich, Z., Cherry, J., Roth, F.: Combining guilt-by-association and guilt-by-profiling to predict saccharomyces cerevisiae gene function. Genome Biology 9(Suppl 1): S7(2008)
Troyanskaya, O., Dolinski, K., Owen, A., Altman, R., D., B.: A bayesian framework for combining heterogeneous data sources for gene function prediction (in saccharomyces cerevisiae). Proc. National Academy of Sciences of USA 100(14): 8348–8353 (2003)
Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6: 1453–1484 (2005)
Valentini, G., Re, M.: Weighted true path rule: a multilabel hierarchical algorithm for gene function prediction. In Proc. of the 1st Intl Wshp on Learning from Multi-Label Data, pp. 133–146. ECML/PKDD (2009)
Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H.: Decision trees for hierarchical multi-label classification. Machine Learning 73(2): 185–214 (2008)
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1: 80–83 (1945)
Zdobnov, E., Apweiler, R.: Interproscan - an integration platform for the signature-recognition methods in interpro. Bioinformatics 17(9): 847–848 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Vens, C., Schietgat, L., Struyf, J., Blockeel, H., Kocev, D., Džeroski, S. (2010). Predicting Gene Function using Predictive Clustering Trees. In: Džeroski, S., Goethals, B., Panov, P. (eds) Inductive Databases and Constraint-Based Data Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-7738-0_15
Download citation
DOI: https://doi.org/10.1007/978-1-4419-7738-0_15
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-7737-3
Online ISBN: 978-1-4419-7738-0
eBook Packages: Computer ScienceComputer Science (R0)