Abstract
Genomic strings are not of fixed length,but provide one- dimensional spatial data that do not divide for conquering by machine learning into manageable .xed size chunks obeying Dietterich independent and identically distributed assumption.We nonetheless need to divide genomic strings for conquering by machine learning in this case for genomic prediction. Orthologs are genomic strings derived from a common ancestor and having the same biological function.Ortholog detection is biologically interesting since it informs us about protein divergence through evolution, and,in the present context,also has important agricultural applications. In the present paper is indicated means to obtain an associated (fixed size)attribute vector for genomic string data and for dividing and conquering the machine learning problem of ortholog detection herein seen as an analogy problem.The attributes are based on both the typical string similarity measures of bioinformatics and on a large number of differential metrics,many new to bioinformatics.Many of the differential metrics are based on evolutionary considerations,both theoretical and empirically observed,in some cases observed by the authors. C5.0 with AdaBoosting activated was employed and the preliminary results reported herein re complete cDNA strings are very encouraging for eventually and usefully employing the techniques described for ortholog detection on the more readily available EST (incomplete)genomic data.
Machine learning [Mit97,RN95]involves algorithmic techniques for fitting programs to data and for outputting the programs fit for subsequent use in predicting future data. A program so fit to data is said to be learned.
Amino acid sequences fold into 3-D structures,but that,for us,will be taken into account in future work.See Section 6 below.
IL-2 is interleukin 2,an immune system protein.
Exons contain the coding portions of genes.
Applying attribute values for both chicken-mouse and chicken-human comparisons improves performance over just employing comparisons between chicken and one of these mammals.
Importantly,the voting weights are bigger for more accurate trees in the sequence of trees.
In the present project we are working only with exons or portions thereof.
Recall from Section 4 above that the ensemble of trees obtained from AdaBoosting makes its decisions by a judiciously weighted majority vote among the decisions of its constituent trees ?ven more usefully subtle decision making than that of any single tree.
http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html & http://www.cmpharm.ucsf.edu/ ~nomi/nnpredict-instrucs.html
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers,and David J. Lipman.Basic local alignment search tool.J.Mol.Biol.,215:403–410,1990.
D. Angluin, W. Gasarch,and C. Smith.Training sequences.Theoretical Computer Science,66(3):255–272,1989.
M.D. Adams, A.R. Kerlavage, R.D. Fleischmann, R.A. Fuldner, C.J. Bult, N.H. Lee, E.F. Kirkness, K.G. Weinstock, J.D. Gocayne, O. White,and et al.Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence.Nature,377:3–174, 1995.
S. Arikawa, S. Miyano, A. Shinohara, S. Kuhara, Y. Mukouchi,and T. Shinohara.A machine discovery from amino-acid-sequences by decision trees over regular patterns.New Generation Computing,11:361–375,1993.
Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller,and David J. Lipman.Gapped BLAST and PSI-BLAST:A new generation of protein database search programs. Nucleic Acids Research,25(17):3389–3402,1997.
R. Ashby.Design for a Brain:The Origin of Adaptive Behavior.Wiley, NY,second edition,1960.
P. Baldi and S. Brunak.Bioinformatics:The Machine Learning Approach. MIT Press, Cambridge,MA,third edition,1998.
E. Boros and Z. Füredi.Triangles covering the centre of an n-set.Geometriae Dedicata,17:69–77,1984.
Kai Bartlmae, Steffen Gutjahr,and Gholamreza Nakhaeizadeh.Incorporating prior knowledge about financial markets through neural multitask learning.In Proceedings of the Fifth International Conferenc on Neural Networks in the Capital Markets,1997.
C. Burge and S. Karlin.Prediction of complete gene structures in human genomic DNA.J.Mol.Biol.,268:78–94,1997.
Andreas D.Baxevanis and B.F. Francis Ouellette ,editors.Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins.John Wiley & Sons,Inc.,1998.
Richard A.Caruana.Multitask connectionist learning.In Proceedings of the 1993 Connectionist Models Summer School,pages 372–379,1993.
R. Caruana.Algorithms and applications for multitask learning.In Proceedings of the Thirteenth International Conferenc on Machine Learning (ICML-96),pages 87–95.Morgan Kaufmann, San Francisco,CA,1996.
J. Case, S. Jain, M. Ott, A. Sharma,and F. Stephan.Robust learning aided by context.Journal of Computer and System Sciences (Special Issue for COLT’ 98 ),60:234–257,2000.
Andrew Y.Cheng and Ming Ouyang.On algorithms for simplicial depth. In 13th Canadian Conferenc on Computational Geometry,pages 53–56. University of Waterloo,August 13-15 2001.
Thomas G.Dietterich, Hermann Hild,and Ghulum Bakiri.A comparison of ID3 and backpropogation for English text-to-speech mapping.Machine Learning,18(1):51–15,1995.
T. Dietterich.The divide-and-conquer manifesto.In Proceedings of The 11th International Workshop on Algorithmic Learning Theory (ALT’ 0),Lecture Notes in Artificial Intelligence,pages 13–16.Springer-Verlag, Berlin,2000.
T. Evans.A program for the solution of a class of geometric-analogy intelligence-test questions.In M. Minsky,editor,Semantic Information Processing,pages 271–353.MIT Press,1968.
Y. Freund, Y. Mansour,and R. Schapire.Why averaging classifiers can protect against overfitting.In Proceedings of the Eighth International Workshop on Artificial Intelligenc and Statistics,2001.
Y. Freund and R. Schapire.Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conferenc on Machine Learning (ICML-96),pages 148–156.Morgan Kaufmann, San Francisco, CA,1996.
Y. Freund and R. Schapire.A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences,55:119–139,1997.
Y. Freund and R. Schapire.A short introduction to boosting.Journal of Japanese Society for Artificial Intelligenc,14(5):771–780,1999.In Japanese and translated by Naoki Abe;English version at http://www.research.att.com/~schapire/cgi-bin/uncompress-papers/FreundSc99.ps.
Y. Freund, R. Schapire, P. Bartlett,and W. Lee.Boosting the margin:A new explanation for the efectiveness of voting methods.The Annals of tatistics,26(5):1651–1686,1998.
X. Guan, R.J. Mural, J.R. Einstein, R.C. Mann,and E.C. Uberbacher. GRAIL:An integrated artificial intelligence system for gene recognition and interpretation.In Eighth IEEE Conferenc on AI Applications,pages 9–3,Monterey,CA, March 2–6 1992.IEEE Computer Society Press.
O. Gotoh.An improved algorithm for matching biological sequences.J. Mol.Biol.,162:705–708,1982.
Samuel Karlin and Stephen F. Altschul.Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.Proc.Natl.Acad.Sci.USA,87:2264–2268,1990.
D.G. Kneller, F.E. Cohen,and R. Langridge.Improvements in protein secondary structure prediction by an enhanced neural network.Journal of Molecular Biology,214:171–182,1990.
M. Kummer and F. Stephan.Inclusion problems in parallel learning and games.In Proceedings of the Workshop on Computational Learning Theory,pages 287–298.ACM Press, NY,July 1994.Journal version to appear, Journal of Computer and System Sciences (Special Issue for COLT 94), 52(3):403–420,1996.
E. Kinber, C. Smith, M. Velauthapillai,and R. Wiehagen.On learning learning multiple concepts in parallel.In Proceedings of the Workshop on Computational Learning Theory,pages 175–81.ACM, NY,1993.
Wen-Hsiung Li.Molecular Evolution.Sinauer Associates,Inc.,1997.
R.Y. Liu.On a notion of data depth based on random simplices.The Annals of Statistics,pages 405–414,1990.
R.Y. Liu and K. Singh.A quality index based on data depth and multivariate rank tests.Journal of American Statistical Association,88:252–260, 1993.
Wojciech Makalowski and Mark S. Boguski.Evolutionary parameters of the transcribed mammalian genome:An analysis of 2,820 orthologous rodent and human sequences.Proc.Natl.Acad.Sci.USA,95:9407–9412, 1998.
T. Mitchell, R. Caruana, D. Freitag, J. McDermott,and D. Zabowski. Experience with a learning,personal assistant.Communications of the ACM,37:80–91,1994.
T. Mitchell.MachineLearning.McGraw Hill,1997.
S. Matwin and M. Kubat.The role of context in concept learning.In M. Kubat and G. Widmer,editors,Proceedings of the ICML-96 Pre-Conferenc Workshop on Learning in Context-Sensitive Domains, Bari, Italy,pages 1–5,1996.
D. Michie, D. Spiegelhalter, and C. Taylor,editors.Machine Learning, Neural and Statistical Classiffication.Ellis Horwood,NY,1994.
Saul B. Needleman and Christian D. Wunsch.A general method applicable to the search for similarities in the amino acid sequence of two proteins. J.Mol.Biol.,48:443–453,1970.
M.J. Pazzani, C.A. Brunk,and G. Silverstein.A knowledge-intensive approach to learning relational concepts.In L. Birnbaum and G. Collins, editors,Proceedings of the 8th International Workshop on Machine Learning,pages 432–436.Morgan Kaufmann,1991.
William R. Pearson.Comparison of methods for searching protein sequence databases.Protein Science,4:1145–1160,1995.
L. Pratt, J. Mostow,and C. Kamm.Direct transfer of learned information among neural networks.In Proceedings of the 9th National Conferenc on Artificial Intelligenc (AAAI-91),1991.
J.R. Quinlan.C4.5:Programs for Machine Learning.Morgan Kaufmann Publishers, San Mateo,CA,1993.
J.R. Quinlan,1997.Private communication.
R. Quinlan.Miniboosting decision trees.Journal of AI Research,1998.
S. Russell and P. Norvig.Artificial Intelligence:A Modern Approach. Prentittce-Hall,NJ,1995.
Gerald M. Rubin, Mark D. Yandell, Jennifer R. Wortman, George L. Gabor Miklos, Catherine R. Nelson, Iswar K. Hariharan, Mark E. Fortini, Peter W. Li, Rolf Apweiler, Wolfgang Fleischmann, J. Michael Cherry, Steven Heniko., Marain P. Skupski, Sima Misra, Michael Ashburner, Ewan Birney, Mark S. Boguski, Thomas Brody, Peter Brokstein, Susan E. Celniker, Stephen A. Chervitz, David Coates, Anibal Cravchik, Andrei Gabrielian, Richard F. Falle, William M. Gelbart, Reed A. George, Lawrence S.B._Goldstein, Fangcheng Gong, Ping Guan, Nomi L. Harris, Bruce A. Hay, Roger A. Hoskins, Jiayin Li, Zhenya Li, Richard O. Hynes, S.J.M. Jones, Peter M. Kuehl, Bruno Lemaitre, J. Troy Littleton, Debrah K. Morrison, Chris Mungall, Patrick H. O ?arrell, Oxana K. Pickeral, Chris Shue, Leslie B. Vosshall, Jiong Zhang, Qi Zhao, Xiangqun H. Zheng, Fei Zhong, Wenyan Zhong, Richard Gibbs, J. Craig Wenter, Mark D. Adams,and Suzanna Lewis.Comparative genomics of the eukaryotes.Science,287:2204–2215,2000.
Paul M. Sharp, Elizabeth Cowe, Desmond G. Higgins, Denis C. Shields, Kenneth H. Wolfe,and Frank Wright.Codon usage patterns in escherichia coli,bacillus subtilis,saccharomyces c revisiae,schizosaccharomyces pombe,drosophila melanogaster and homo sapiens:a review of the considerable within-species diversity.Nucleic Acids Research, 16(17): 8207–8211,1988.
Steven Salzberg, Arthur L. Delcher, Kenneth H. Fasman,and John Henderson.A decision tree system for finding genes in DNA.Journal of Computational Biology,5(4):667–680,1998.
David J. States and Warren Gish.Combined use of sequence similarity and codon bias for coding region identification.Journal of Computational Biology,1(1):39–50,1994.
R. Staden and A.D. McLachlan.Codon preference and its use in identifying protein coding regions in long DNA sequences.Nucleic Acids Research, 10(1):141–156,1982.
Terrence J. Sejnowski and Charles Rosenberg.NETtalk:A parallel network that learns to read aloud.Technical Report JHU-EECS-86-01,Johns Hopkins University,1986.
R. Sternberg.The Triarchic Mind.Viking, NY,1988.
S. Thrun and J. Sullivan.Discovering structure in multiple learning tasks: The TC algorithm.In Proceedings of the Thirteenth International Conferenc on Machine Learning (ICML-96),pages 489–497.Morgan Kaufmann, San Francisco,CA,1996.
V. Tirunagaru, L. Sofer,and J. Burnside.An expressed sequence tag database of activated chicken T cells:Sequence analysis of 5000 cDNA clones.Genomics,2000.In press.
V. Vapnik.The Natur of Statistical Learning Theory.Springer Verlag, New York,1995.
V. Vapnik.Statistical Learning Theory.John Wiley and Sons,New York, 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ouyang, M., Case, J., Burnside, J. (2001). Divide and Conquer Machine Learning for a Genomics Analogy Problem. In: Jantke, K.P., Shinohara, A. (eds) Discovery Science. DS 2001. Lecture Notes in Computer Science(), vol 2226. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45650-3_26
Download citation
DOI: https://doi.org/10.1007/3-540-45650-3_26
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42956-2
Online ISBN: 978-3-540-45650-6
eBook Packages: Springer Book Archive