ABSTRACT
Significant research efforts for robust integration of information from multiple sources are being pursued at a rapid pace. However, the information in heterogeneous sources is often incomplete and hence making the maximum use of all the available information is a challenging problem. Most of the recent research on data integration have been primarily focused on the cases where the information is available across all the different sources and can not effectively integrate sources in the presence of partial information. We develop an ensemble method that boosts the decisions made from different models on individual sources and obtain robust results for the task of class prediction. We propose a heterogeneous boosting framework that uses all the available information even if some of the sources do not provide any information about some objects. We demonstrate the effectiveness of the proposed framework for the problem of gene function prediction and compare to the state-of-the-art methods using several real-world biological datasets. We also show that the proposed method outperforms any kind of imputation schemes that are widely used while integrating data with partial information
- N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321--357, 2002. Google ScholarCross Ref
- M. des Jardins, P. Karp, M. Krummenacker, T. Lee, and C. Ouzounis. Prediction of enzyme classification from protein sequence without the use of sequence similarity. In Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology, pages 92--99, 1997. Google ScholarDigital Library
- T. G. Dietterich. Ensemble methods in machine learning. In MCS '00: Proceedings of the First International Workshop on Multiple Classifier Systems, pages 1--15, London, UK, 2000. Springer-Verlag. Google ScholarDigital Library
- Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In International Conference on Machine Learning, pages 148--156, 1996.Google ScholarDigital Library
- U. Karaoz, T. Murali, S. Letovsky, Y. Zheng, C. Ding, C. Cantor, and S. Kasif. Whole-genome annotation by using evidence integration in functional-linkage networks. Proc. Natl Acad. Sci. USA, 101:2888--2893, 2004.Google ScholarCross Ref
- L. Kuncheva, J. Bezdek, and R. Duin. Decision templates for multiple classifier fusion: an experimental comparison. Pattern Recognition, 34(2):299--314, 2001.Google ScholarCross Ref
- G. Lanckriet, T. De Bie, N. Cristianini, M. Jordan, and W. Noble. A statistical framework for genomic data fusion. Bioinformatics, 20:2626--2635, 2004. Google ScholarDigital Library
- R. Polikar. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine, 6(3):21--45, 2006.Google ScholarCross Ref
- F. Roli, G. Giacinto, and V. Gianni. Methods for designing multiple classifier systems. In Multiple Classifier Systems, pages 78--87, 2001. Google ScholarCross Ref
- A. Ruepp, D. Zollner, A. and Maier, K. Albermann, J. Hani, M. Mokrejs, I. Tetko, U. Guldener, G. Mannhaupt, M. Munsterkotter, and H. Mewes. The funcat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Research, 32(18):5539--5545, 2004.Google ScholarCross Ref
- A. J. Smola, S. V. N. Vishwanathan, and T. Hofmann. Kernel methods for missing variables. In Proceedings of International Workshop on Artificial Intelligence and Statistics, pages 325--332, 2005.Google Scholar
- G. Valentini. True path rule hierarchical ensembles for genome-wide gene function prediction. IEEE ACM Transactions on Computational Biology and Bioinformatics (in press), 2010. Google ScholarDigital Library
- X. Zhao, L. Chen, and K. Aihara. Protein function prediction with the shortest path in functional linkage graph and boosting. International Journalof Bioinformatics Research and Application, 4(4):375--384, 2008. Google ScholarDigital Library
Index Terms
- Robust prediction from multiple heterogeneous data sources with partial information
Recommendations
A Bayesian integration model for improved gene functional inference from heterogeneous data sources
BCB '11: Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and BiomedicineIncreasing amounts of biological data from various sources are being made available by high-throughput genomic technologies. However, no single biological data source analysis can fully unravel the complexities of the hierarchical gene function ...
Integration of Multiple Dissemination-Based Information Sources Using Source Data Arrival Properties
WISE '01: Proceedings of the Second International Conference on Web Information Systems Engineering (WISE'01) Volume 1 - Volume 1The integration of heterogeneous information sources is an important data engineering research issue. Various types of information sources are available today. They include dissemination-based information sources, which actively and autonomously deliver ...
Object Exchange Across Heterogeneous Information Sources
ICDE '95: Proceedings of the Eleventh International Conference on Data EngineeringWe address the problem of providing integrated access to diverse and dynamic information sources. We explain how this problem differs from the traditional database integration problem and we focus on one aspect of the information integration problem, ...
Comments