Divide and Conquer Machine Learning for a Genomics Analogy Problem

Ouyang, Ming; Case, John; Burnside, Joan

doi:10.1007/3-540-45650-3_26

Ming Ouyang³,
John Case⁴ &
Joan Burnside⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2226))

Included in the following conference series:

International Conference on Discovery Science

404 Accesses

Abstract

Genomic strings are not of fixed length,but provide one- dimensional spatial data that do not divide for conquering by machine learning into manageable .xed size chunks obeying Dietterich independent and identically distributed assumption.We nonetheless need to divide genomic strings for conquering by machine learning in this case for genomic prediction. Orthologs are genomic strings derived from a common ancestor and having the same biological function.Ortholog detection is biologically interesting since it informs us about protein divergence through evolution, and,in the present context,also has important agricultural applications. In the present paper is indicated means to obtain an associated (fixed size)attribute vector for genomic string data and for dividing and conquering the machine learning problem of ortholog detection herein seen as an analogy problem.The attributes are based on both the typical string similarity measures of bioinformatics and on a large number of differential metrics,many new to bioinformatics.Many of the differential metrics are based on evolutionary considerations,both theoretical and empirically observed,in some cases observed by the authors. C5.0 with AdaBoosting activated was employed and the preliminary results reported herein re complete cDNA strings are very encouraging for eventually and usefully employing the techniques described for ortholog detection on the more readily available EST (incomplete)genomic data.

Machine learning [Mit97,RN95]involves algorithmic techniques for fitting programs to data and for outputting the programs fit for subsequent use in predicting future data. A program so fit to data is said to be learned.

Amino acid sequences fold into 3-D structures,but that,for us,will be taken into account in future work.See Section 6 below.

http://www.usda.gov/news/pubs/fbook98/ch1a.htm

IL-2 is interleukin 2,an immune system protein.

Exons contain the coding portions of genes.

Applying attribute values for both chicken-mouse and chicken-human comparisons improves performance over just employing comparisons between chicken and one of these mammals.

http://www.tigr.org/docs/tigr-scripts/egad scripts/role report.spl

Importantly,the voting weights are bigger for more accurate trees in the sequence of trees.

In the present project we are working only with exons or portions thereof.

Recall from Section 4 above that the ensemble of trees obtained from AdaBoosting makes its decisions by a judiciously weighted majority vote among the decisions of its constituent trees ?ven more usefully subtle decision making than that of any single tree.

http://www.tigr.org/docs/tigr-scripts/egad scripts/role report.spl

http://www.chickest.udel.edu

http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html & http://www.cmpharm.ucsf.edu/ ~nomi/nnpredict-instrucs.html

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Fast and Exact Algorithm for the Exemplar Breakpoint Distance

Domain similarity based orthology detection

Article Open access 13 May 2015

Optimizing the Parametrization of Homologue Classification in the Pan-Genome Computation for a Bacterial Species: Case Study Streptococcus pyogenes

References

Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers,and David J. Lipman.Basic local alignment search tool.J.Mol.Biol.,215:403–410,1990.
Google Scholar
D. Angluin, W. Gasarch,and C. Smith.Training sequences.Theoretical Computer Science,66(3):255–272,1989.
Article MATH MathSciNet Google Scholar
M.D. Adams, A.R. Kerlavage, R.D. Fleischmann, R.A. Fuldner, C.J. Bult, N.H. Lee, E.F. Kirkness, K.G. Weinstock, J.D. Gocayne, O. White,and et al.Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence.Nature,377:3–174, 1995.
Google Scholar
S. Arikawa, S. Miyano, A. Shinohara, S. Kuhara, Y. Mukouchi,and T. Shinohara.A machine discovery from amino-acid-sequences by decision trees over regular patterns.New Generation Computing,11:361–375,1993.
Article MATH Google Scholar
Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller,and David J. Lipman.Gapped BLAST and PSI-BLAST:A new generation of protein database search programs. Nucleic Acids Research,25(17):3389–3402,1997.
Article Google Scholar
R. Ashby.Design for a Brain:The Origin of Adaptive Behavior.Wiley, NY,second edition,1960.
Google Scholar
P. Baldi and S. Brunak.Bioinformatics:The Machine Learning Approach. MIT Press, Cambridge,MA,third edition,1998.
Google Scholar
E. Boros and Z. Füredi.Triangles covering the centre of an n-set.Geometriae Dedicata,17:69–77,1984.
Article MATH MathSciNet Google Scholar
Kai Bartlmae, Steffen Gutjahr,and Gholamreza Nakhaeizadeh.Incorporating prior knowledge about financial markets through neural multitask learning.In Proceedings of the Fifth International Conferenc on Neural Networks in the Capital Markets,1997.
Google Scholar
C. Burge and S. Karlin.Prediction of complete gene structures in human genomic DNA.J.Mol.Biol.,268:78–94,1997.
Article Google Scholar
Andreas D.Baxevanis and B.F. Francis Ouellette ,editors.Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins.John Wiley & Sons,Inc.,1998.
Google Scholar
Richard A.Caruana.Multitask connectionist learning.In Proceedings of the 1993 Connectionist Models Summer School,pages 372–379,1993.
Google Scholar
R. Caruana.Algorithms and applications for multitask learning.In Proceedings of the Thirteenth International Conferenc on Machine Learning (ICML-96),pages 87–95.Morgan Kaufmann, San Francisco,CA,1996.
Google Scholar
J. Case, S. Jain, M. Ott, A. Sharma,and F. Stephan.Robust learning aided by context.Journal of Computer and System Sciences (Special Issue for COLT’ 98 ),60:234–257,2000.
MATH MathSciNet Google Scholar
Andrew Y.Cheng and Ming Ouyang.On algorithms for simplicial depth. In 13th Canadian Conferenc on Computational Geometry,pages 53–56. University of Waterloo,August 13-15 2001.
Google Scholar
Thomas G.Dietterich, Hermann Hild,and Ghulum Bakiri.A comparison of ID3 and backpropogation for English text-to-speech mapping.Machine Learning,18(1):51–15,1995.
Google Scholar
T. Dietterich.The divide-and-conquer manifesto.In Proceedings of The 11th International Workshop on Algorithmic Learning Theory (ALT’ 0),Lecture Notes in Artificial Intelligence,pages 13–16.Springer-Verlag, Berlin,2000.
Google Scholar
T. Evans.A program for the solution of a class of geometric-analogy intelligence-test questions.In M. Minsky,editor,Semantic Information Processing,pages 271–353.MIT Press,1968.
Google Scholar
Y. Freund, Y. Mansour,and R. Schapire.Why averaging classifiers can protect against overfitting.In Proceedings of the Eighth International Workshop on Artificial Intelligenc and Statistics,2001.
Google Scholar
Y. Freund and R. Schapire.Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conferenc on Machine Learning (ICML-96),pages 148–156.Morgan Kaufmann, San Francisco, CA,1996.
Google Scholar
Y. Freund and R. Schapire.A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences,55:119–139,1997.
Article MATH MathSciNet Google Scholar
Y. Freund and R. Schapire.A short introduction to boosting.Journal of Japanese Society for Artificial Intelligenc,14(5):771–780,1999.In Japanese and translated by Naoki Abe;English version at http://www.research.att.com/~schapire/cgi-bin/uncompress-papers/FreundSc99.ps.
Google Scholar
Y. Freund, R. Schapire, P. Bartlett,and W. Lee.Boosting the margin:A new explanation for the efectiveness of voting methods.The Annals of tatistics,26(5):1651–1686,1998.
Article MATH MathSciNet Google Scholar
X. Guan, R.J. Mural, J.R. Einstein, R.C. Mann,and E.C. Uberbacher. GRAIL:An integrated artificial intelligence system for gene recognition and interpretation.In Eighth IEEE Conferenc on AI Applications,pages 9–3,Monterey,CA, March 2–6 1992.IEEE Computer Society Press.
Chapter Google Scholar
O. Gotoh.An improved algorithm for matching biological sequences.J. Mol.Biol.,162:705–708,1982.
Google Scholar
Samuel Karlin and Stephen F. Altschul.Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.Proc.Natl.Acad.Sci.USA,87:2264–2268,1990.
Article MATH Google Scholar
D.G. Kneller, F.E. Cohen,and R. Langridge.Improvements in protein secondary structure prediction by an enhanced neural network.Journal of Molecular Biology,214:171–182,1990.
Article Google Scholar
M. Kummer and F. Stephan.Inclusion problems in parallel learning and games.In Proceedings of the Workshop on Computational Learning Theory,pages 287–298.ACM Press, NY,July 1994.Journal version to appear, Journal of Computer and System Sciences (Special Issue for COLT 94), 52(3):403–420,1996.
Google Scholar
E. Kinber, C. Smith, M. Velauthapillai,and R. Wiehagen.On learning learning multiple concepts in parallel.In Proceedings of the Workshop on Computational Learning Theory,pages 175–81.ACM, NY,1993.
Google Scholar
Wen-Hsiung Li.Molecular Evolution.Sinauer Associates,Inc.,1997.
Google Scholar
R.Y. Liu.On a notion of data depth based on random simplices.The Annals of Statistics,pages 405–414,1990.
Google Scholar
R.Y. Liu and K. Singh.A quality index based on data depth and multivariate rank tests.Journal of American Statistical Association,88:252–260, 1993.
Article MATH MathSciNet Google Scholar
Wojciech Makalowski and Mark S. Boguski.Evolutionary parameters of the transcribed mammalian genome:An analysis of 2,820 orthologous rodent and human sequences.Proc.Natl.Acad.Sci.USA,95:9407–9412, 1998.
Article Google Scholar
T. Mitchell, R. Caruana, D. Freitag, J. McDermott,and D. Zabowski. Experience with a learning,personal assistant.Communications of the ACM,37:80–91,1994.
Article Google Scholar
T. Mitchell.MachineLearning.McGraw Hill,1997.
Google Scholar
S. Matwin and M. Kubat.The role of context in concept learning.In M. Kubat and G. Widmer,editors,Proceedings of the ICML-96 Pre-Conferenc Workshop on Learning in Context-Sensitive Domains, Bari, Italy,pages 1–5,1996.
Google Scholar
D. Michie, D. Spiegelhalter, and C. Taylor,editors.Machine Learning, Neural and Statistical Classiffication.Ellis Horwood,NY,1994.
Google Scholar
Saul B. Needleman and Christian D. Wunsch.A general method applicable to the search for similarities in the amino acid sequence of two proteins. J.Mol.Biol.,48:443–453,1970.
Article Google Scholar
M.J. Pazzani, C.A. Brunk,and G. Silverstein.A knowledge-intensive approach to learning relational concepts.In L. Birnbaum and G. Collins, editors,Proceedings of the 8th International Workshop on Machine Learning,pages 432–436.Morgan Kaufmann,1991.
Google Scholar
William R. Pearson.Comparison of methods for searching protein sequence databases.Protein Science,4:1145–1160,1995.
Article Google Scholar
L. Pratt, J. Mostow,and C. Kamm.Direct transfer of learned information among neural networks.In Proceedings of the 9th National Conferenc on Artificial Intelligenc (AAAI-91),1991.
Google Scholar
J.R. Quinlan.C4.5:Programs for Machine Learning.Morgan Kaufmann Publishers, San Mateo,CA,1993.
Google Scholar
J.R. Quinlan,1997.Private communication.
Google Scholar
R. Quinlan.Miniboosting decision trees.Journal of AI Research,1998.
Google Scholar
S. Russell and P. Norvig.Artificial Intelligence:A Modern Approach. Prentittce-Hall,NJ,1995.
Google Scholar
Gerald M. Rubin, Mark D. Yandell, Jennifer R. Wortman, George L. Gabor Miklos, Catherine R. Nelson, Iswar K. Hariharan, Mark E. Fortini, Peter W. Li, Rolf Apweiler, Wolfgang Fleischmann, J. Michael Cherry, Steven Heniko., Marain P. Skupski, Sima Misra, Michael Ashburner, Ewan Birney, Mark S. Boguski, Thomas Brody, Peter Brokstein, Susan E. Celniker, Stephen A. Chervitz, David Coates, Anibal Cravchik, Andrei Gabrielian, Richard F. Falle, William M. Gelbart, Reed A. George, Lawrence S.B._Goldstein, Fangcheng Gong, Ping Guan, Nomi L. Harris, Bruce A. Hay, Roger A. Hoskins, Jiayin Li, Zhenya Li, Richard O. Hynes, S.J.M. Jones, Peter M. Kuehl, Bruno Lemaitre, J. Troy Littleton, Debrah K. Morrison, Chris Mungall, Patrick H. O ?arrell, Oxana K. Pickeral, Chris Shue, Leslie B. Vosshall, Jiong Zhang, Qi Zhao, Xiangqun H. Zheng, Fei Zhong, Wenyan Zhong, Richard Gibbs, J. Craig Wenter, Mark D. Adams,and Suzanna Lewis.Comparative genomics of the eukaryotes.Science,287:2204–2215,2000.
Google Scholar
Paul M. Sharp, Elizabeth Cowe, Desmond G. Higgins, Denis C. Shields, Kenneth H. Wolfe,and Frank Wright.Codon usage patterns in escherichia coli,bacillus subtilis,saccharomyces c revisiae,schizosaccharomyces pombe,drosophila melanogaster and homo sapiens:a review of the considerable within-species diversity.Nucleic Acids Research, 16(17): 8207–8211,1988.
Article Google Scholar
Steven Salzberg, Arthur L. Delcher, Kenneth H. Fasman,and John Henderson.A decision tree system for finding genes in DNA.Journal of Computational Biology,5(4):667–680,1998.
Google Scholar
David J. States and Warren Gish.Combined use of sequence similarity and codon bias for coding region identification.Journal of Computational Biology,1(1):39–50,1994.
Google Scholar
R. Staden and A.D. McLachlan.Codon preference and its use in identifying protein coding regions in long DNA sequences.Nucleic Acids Research, 10(1):141–156,1982.
Article Google Scholar
Terrence J. Sejnowski and Charles Rosenberg.NETtalk:A parallel network that learns to read aloud.Technical Report JHU-EECS-86-01,Johns Hopkins University,1986.
Google Scholar
R. Sternberg.The Triarchic Mind.Viking, NY,1988.
Google Scholar
S. Thrun and J. Sullivan.Discovering structure in multiple learning tasks: The TC algorithm.In Proceedings of the Thirteenth International Conferenc on Machine Learning (ICML-96),pages 489–497.Morgan Kaufmann, San Francisco,CA,1996.
Google Scholar
V. Tirunagaru, L. Sofer,and J. Burnside.An expressed sequence tag database of activated chicken T cells:Sequence analysis of 5000 cDNA clones.Genomics,2000.In press.
Google Scholar
V. Vapnik.The Natur of Statistical Learning Theory.Springer Verlag, New York,1995.
Google Scholar
V. Vapnik.Statistical Learning Theory.John Wiley and Sons,New York, 1998.
Google Scholar

Download references

Author information

Authors and Affiliations

Environmental and Occupational Health Sciences Institute UMDNJ Robert Wood Johnson Medical School and Rutgers, The State University of New Jersey, 08854, Piscataway, NJ, USA
Ming Ouyang
Department of CIS, University of Delaware, DE 19716, Newark, USA
John Case
Department of Animal & Food Sciences, University of Delaware, DE 19716, Newark, USA
Joan Burnside

Authors

Ming Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
John Case
View author publications
You can also search for this author in PubMed Google Scholar
Joan Burnside
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

DFKI GmbH Saarbrücken, 66123, Saarbrücken, Germany
Klaus P. Jantke
Department of Informatics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, 812-8581, Fukuoka, Japan
Ayumi Shinohara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ouyang, M., Case, J., Burnside, J. (2001). Divide and Conquer Machine Learning for a Genomics Analogy Problem. In: Jantke, K.P., Shinohara, A. (eds) Discovery Science. DS 2001. Lecture Notes in Computer Science(), vol 2226. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45650-3_26

Download citation

DOI: https://doi.org/10.1007/3-540-45650-3_26
Published: 20 December 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42956-2
Online ISBN: 978-3-540-45650-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics