Skip to main content

Divide and Conquer Machine Learning for a Genomics Analogy Problem

(Progress Report)

  • Conference paper
  • First Online:
Discovery Science (DS 2001)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2226))

Included in the following conference series:

  • 384 Accesses

Abstract

Genomic strings are not of fixed length,but provide one- dimensional spatial data that do not divide for conquering by machine learning into manageable .xed size chunks obeying Dietterich independent and identically distributed assumption.We nonetheless need to divide genomic strings for conquering by machine learning in this case for genomic prediction. Orthologs are genomic strings derived from a common ancestor and having the same biological function.Ortholog detection is biologically interesting since it informs us about protein divergence through evolution, and,in the present context,also has important agricultural applications. In the present paper is indicated means to obtain an associated (fixed size)attribute vector for genomic string data and for dividing and conquering the machine learning problem of ortholog detection herein seen as an analogy problem.The attributes are based on both the typical string similarity measures of bioinformatics and on a large number of differential metrics,many new to bioinformatics.Many of the differential metrics are based on evolutionary considerations,both theoretical and empirically observed,in some cases observed by the authors. C5.0 with AdaBoosting activated was employed and the preliminary results reported herein re complete cDNA strings are very encouraging for eventually and usefully employing the techniques described for ortholog detection on the more readily available EST (incomplete)genomic data.

Machine learning [Mit97,RN95]involves algorithmic techniques for fitting programs to data and for outputting the programs fit for subsequent use in predicting future data. A program so fit to data is said to be learned.

Amino acid sequences fold into 3-D structures,but that,for us,will be taken into account in future work.See Section 6 below.

http://www.usda.gov/news/pubs/fbook98/ch1a.htm

IL-2 is interleukin 2,an immune system protein.

Exons contain the coding portions of genes.

Applying attribute values for both chicken-mouse and chicken-human comparisons improves performance over just employing comparisons between chicken and one of these mammals.

http://www.tigr.org/docs/tigr-scripts/egad scripts/role report.spl

Importantly,the voting weights are bigger for more accurate trees in the sequence of trees.

In the present project we are working only with exons or portions thereof.

Recall from Section 4 above that the ensemble of trees obtained from AdaBoosting makes its decisions by a judiciously weighted majority vote among the decisions of its constituent trees ?ven more usefully subtle decision making than that of any single tree.

http://www.tigr.org/docs/tigr-scripts/egad scripts/role report.spl

http://www.chickest.udel.edu

http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html & http://www.cmpharm.ucsf.edu/ ~nomi/nnpredict-instrucs.html

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers,and David J. Lipman.Basic local alignment search tool.J.Mol.Biol.,215:403–410,1990.

    Google Scholar 

  2. D. Angluin, W. Gasarch,and C. Smith.Training sequences.Theoretical Computer Science,66(3):255–272,1989.

    Article  MATH  MathSciNet  Google Scholar 

  3. M.D. Adams, A.R. Kerlavage, R.D. Fleischmann, R.A. Fuldner, C.J. Bult, N.H. Lee, E.F. Kirkness, K.G. Weinstock, J.D. Gocayne, O. White,and et al.Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence.Nature,377:3–174, 1995.

    Google Scholar 

  4. S. Arikawa, S. Miyano, A. Shinohara, S. Kuhara, Y. Mukouchi,and T. Shinohara.A machine discovery from amino-acid-sequences by decision trees over regular patterns.New Generation Computing,11:361–375,1993.

    Article  MATH  Google Scholar 

  5. Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller,and David J. Lipman.Gapped BLAST and PSI-BLAST:A new generation of protein database search programs. Nucleic Acids Research,25(17):3389–3402,1997.

    Article  Google Scholar 

  6. R. Ashby.Design for a Brain:The Origin of Adaptive Behavior.Wiley, NY,second edition,1960.

    Google Scholar 

  7. P. Baldi and S. Brunak.Bioinformatics:The Machine Learning Approach. MIT Press, Cambridge,MA,third edition,1998.

    Google Scholar 

  8. E. Boros and Z. Füredi.Triangles covering the centre of an n-set.Geometriae Dedicata,17:69–77,1984.

    Article  MATH  MathSciNet  Google Scholar 

  9. Kai Bartlmae, Steffen Gutjahr,and Gholamreza Nakhaeizadeh.Incorporating prior knowledge about financial markets through neural multitask learning.In Proceedings of the Fifth International Conferenc on Neural Networks in the Capital Markets,1997.

    Google Scholar 

  10. C. Burge and S. Karlin.Prediction of complete gene structures in human genomic DNA.J.Mol.Biol.,268:78–94,1997.

    Article  Google Scholar 

  11. Andreas D.Baxevanis and B.F. Francis Ouellette ,editors.Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins.John Wiley & Sons,Inc.,1998.

    Google Scholar 

  12. Richard A.Caruana.Multitask connectionist learning.In Proceedings of the 1993 Connectionist Models Summer School,pages 372–379,1993.

    Google Scholar 

  13. R. Caruana.Algorithms and applications for multitask learning.In Proceedings of the Thirteenth International Conferenc on Machine Learning (ICML-96),pages 87–95.Morgan Kaufmann, San Francisco,CA,1996.

    Google Scholar 

  14. J. Case, S. Jain, M. Ott, A. Sharma,and F. Stephan.Robust learning aided by context.Journal of Computer and System Sciences (Special Issue for COLT’ 98 ),60:234–257,2000.

    MATH  MathSciNet  Google Scholar 

  15. Andrew Y.Cheng and Ming Ouyang.On algorithms for simplicial depth. In 13th Canadian Conferenc on Computational Geometry,pages 53–56. University of Waterloo,August 13-15 2001.

    Google Scholar 

  16. Thomas G.Dietterich, Hermann Hild,and Ghulum Bakiri.A comparison of ID3 and backpropogation for English text-to-speech mapping.Machine Learning,18(1):51–15,1995.

    Google Scholar 

  17. T. Dietterich.The divide-and-conquer manifesto.In Proceedings of The 11th International Workshop on Algorithmic Learning Theory (ALT’ 0),Lecture Notes in Artificial Intelligence,pages 13–16.Springer-Verlag, Berlin,2000.

    Google Scholar 

  18. T. Evans.A program for the solution of a class of geometric-analogy intelligence-test questions.In M. Minsky,editor,Semantic Information Processing,pages 271–353.MIT Press,1968.

    Google Scholar 

  19. Y. Freund, Y. Mansour,and R. Schapire.Why averaging classifiers can protect against overfitting.In Proceedings of the Eighth International Workshop on Artificial Intelligenc and Statistics,2001.

    Google Scholar 

  20. Y. Freund and R. Schapire.Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conferenc on Machine Learning (ICML-96),pages 148–156.Morgan Kaufmann, San Francisco, CA,1996.

    Google Scholar 

  21. Y. Freund and R. Schapire.A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences,55:119–139,1997.

    Article  MATH  MathSciNet  Google Scholar 

  22. Y. Freund and R. Schapire.A short introduction to boosting.Journal of Japanese Society for Artificial Intelligenc,14(5):771–780,1999.In Japanese and translated by Naoki Abe;English version at http://www.research.att.com/~schapire/cgi-bin/uncompress-papers/FreundSc99.ps.

    Google Scholar 

  23. Y. Freund, R. Schapire, P. Bartlett,and W. Lee.Boosting the margin:A new explanation for the efectiveness of voting methods.The Annals of tatistics,26(5):1651–1686,1998.

    Article  MATH  MathSciNet  Google Scholar 

  24. X. Guan, R.J. Mural, J.R. Einstein, R.C. Mann,and E.C. Uberbacher. GRAIL:An integrated artificial intelligence system for gene recognition and interpretation.In Eighth IEEE Conferenc on AI Applications,pages 9–3,Monterey,CA, March 2–6 1992.IEEE Computer Society Press.

    Chapter  Google Scholar 

  25. O. Gotoh.An improved algorithm for matching biological sequences.J. Mol.Biol.,162:705–708,1982.

    Google Scholar 

  26. Samuel Karlin and Stephen F. Altschul.Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.Proc.Natl.Acad.Sci.USA,87:2264–2268,1990.

    Article  MATH  Google Scholar 

  27. D.G. Kneller, F.E. Cohen,and R. Langridge.Improvements in protein secondary structure prediction by an enhanced neural network.Journal of Molecular Biology,214:171–182,1990.

    Article  Google Scholar 

  28. M. Kummer and F. Stephan.Inclusion problems in parallel learning and games.In Proceedings of the Workshop on Computational Learning Theory,pages 287–298.ACM Press, NY,July 1994.Journal version to appear, Journal of Computer and System Sciences (Special Issue for COLT 94), 52(3):403–420,1996.

    Google Scholar 

  29. E. Kinber, C. Smith, M. Velauthapillai,and R. Wiehagen.On learning learning multiple concepts in parallel.In Proceedings of the Workshop on Computational Learning Theory,pages 175–81.ACM, NY,1993.

    Google Scholar 

  30. Wen-Hsiung Li.Molecular Evolution.Sinauer Associates,Inc.,1997.

    Google Scholar 

  31. R.Y. Liu.On a notion of data depth based on random simplices.The Annals of Statistics,pages 405–414,1990.

    Google Scholar 

  32. R.Y. Liu and K. Singh.A quality index based on data depth and multivariate rank tests.Journal of American Statistical Association,88:252–260, 1993.

    Article  MATH  MathSciNet  Google Scholar 

  33. Wojciech Makalowski and Mark S. Boguski.Evolutionary parameters of the transcribed mammalian genome:An analysis of 2,820 orthologous rodent and human sequences.Proc.Natl.Acad.Sci.USA,95:9407–9412, 1998.

    Article  Google Scholar 

  34. T. Mitchell, R. Caruana, D. Freitag, J. McDermott,and D. Zabowski. Experience with a learning,personal assistant.Communications of the ACM,37:80–91,1994.

    Article  Google Scholar 

  35. T. Mitchell.MachineLearning.McGraw Hill,1997.

    Google Scholar 

  36. S. Matwin and M. Kubat.The role of context in concept learning.In M. Kubat and G. Widmer,editors,Proceedings of the ICML-96 Pre-Conferenc Workshop on Learning in Context-Sensitive Domains, Bari, Italy,pages 1–5,1996.

    Google Scholar 

  37. D. Michie, D. Spiegelhalter, and C. Taylor,editors.Machine Learning, Neural and Statistical Classiffication.Ellis Horwood,NY,1994.

    Google Scholar 

  38. Saul B. Needleman and Christian D. Wunsch.A general method applicable to the search for similarities in the amino acid sequence of two proteins. J.Mol.Biol.,48:443–453,1970.

    Article  Google Scholar 

  39. M.J. Pazzani, C.A. Brunk,and G. Silverstein.A knowledge-intensive approach to learning relational concepts.In L. Birnbaum and G. Collins, editors,Proceedings of the 8th International Workshop on Machine Learning,pages 432–436.Morgan Kaufmann,1991.

    Google Scholar 

  40. William R. Pearson.Comparison of methods for searching protein sequence databases.Protein Science,4:1145–1160,1995.

    Article  Google Scholar 

  41. L. Pratt, J. Mostow,and C. Kamm.Direct transfer of learned information among neural networks.In Proceedings of the 9th National Conferenc on Artificial Intelligenc (AAAI-91),1991.

    Google Scholar 

  42. J.R. Quinlan.C4.5:Programs for Machine Learning.Morgan Kaufmann Publishers, San Mateo,CA,1993.

    Google Scholar 

  43. J.R. Quinlan,1997.Private communication.

    Google Scholar 

  44. R. Quinlan.Miniboosting decision trees.Journal of AI Research,1998.

    Google Scholar 

  45. S. Russell and P. Norvig.Artificial Intelligence:A Modern Approach. Prentittce-Hall,NJ,1995.

    Google Scholar 

  46. Gerald M. Rubin, Mark D. Yandell, Jennifer R. Wortman, George L. Gabor Miklos, Catherine R. Nelson, Iswar K. Hariharan, Mark E. Fortini, Peter W. Li, Rolf Apweiler, Wolfgang Fleischmann, J. Michael Cherry, Steven Heniko., Marain P. Skupski, Sima Misra, Michael Ashburner, Ewan Birney, Mark S. Boguski, Thomas Brody, Peter Brokstein, Susan E. Celniker, Stephen A. Chervitz, David Coates, Anibal Cravchik, Andrei Gabrielian, Richard F. Falle, William M. Gelbart, Reed A. George, Lawrence S.B._Goldstein, Fangcheng Gong, Ping Guan, Nomi L. Harris, Bruce A. Hay, Roger A. Hoskins, Jiayin Li, Zhenya Li, Richard O. Hynes, S.J.M. Jones, Peter M. Kuehl, Bruno Lemaitre, J. Troy Littleton, Debrah K. Morrison, Chris Mungall, Patrick H. O ?arrell, Oxana K. Pickeral, Chris Shue, Leslie B. Vosshall, Jiong Zhang, Qi Zhao, Xiangqun H. Zheng, Fei Zhong, Wenyan Zhong, Richard Gibbs, J. Craig Wenter, Mark D. Adams,and Suzanna Lewis.Comparative genomics of the eukaryotes.Science,287:2204–2215,2000.

    Google Scholar 

  47. Paul M. Sharp, Elizabeth Cowe, Desmond G. Higgins, Denis C. Shields, Kenneth H. Wolfe,and Frank Wright.Codon usage patterns in escherichia coli,bacillus subtilis,saccharomyces c revisiae,schizosaccharomyces pombe,drosophila melanogaster and homo sapiens:a review of the considerable within-species diversity.Nucleic Acids Research, 16(17): 8207–8211,1988.

    Article  Google Scholar 

  48. Steven Salzberg, Arthur L. Delcher, Kenneth H. Fasman,and John Henderson.A decision tree system for finding genes in DNA.Journal of Computational Biology,5(4):667–680,1998.

    Google Scholar 

  49. David J. States and Warren Gish.Combined use of sequence similarity and codon bias for coding region identification.Journal of Computational Biology,1(1):39–50,1994.

    Google Scholar 

  50. R. Staden and A.D. McLachlan.Codon preference and its use in identifying protein coding regions in long DNA sequences.Nucleic Acids Research, 10(1):141–156,1982.

    Article  Google Scholar 

  51. Terrence J. Sejnowski and Charles Rosenberg.NETtalk:A parallel network that learns to read aloud.Technical Report JHU-EECS-86-01,Johns Hopkins University,1986.

    Google Scholar 

  52. R. Sternberg.The Triarchic Mind.Viking, NY,1988.

    Google Scholar 

  53. S. Thrun and J. Sullivan.Discovering structure in multiple learning tasks: The TC algorithm.In Proceedings of the Thirteenth International Conferenc on Machine Learning (ICML-96),pages 489–497.Morgan Kaufmann, San Francisco,CA,1996.

    Google Scholar 

  54. V. Tirunagaru, L. Sofer,and J. Burnside.An expressed sequence tag database of activated chicken T cells:Sequence analysis of 5000 cDNA clones.Genomics,2000.In press.

    Google Scholar 

  55. V. Vapnik.The Natur of Statistical Learning Theory.Springer Verlag, New York,1995.

    Google Scholar 

  56. V. Vapnik.Statistical Learning Theory.John Wiley and Sons,New York, 1998.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ouyang, M., Case, J., Burnside, J. (2001). Divide and Conquer Machine Learning for a Genomics Analogy Problem. In: Jantke, K.P., Shinohara, A. (eds) Discovery Science. DS 2001. Lecture Notes in Computer Science(), vol 2226. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45650-3_26

Download citation

  • DOI: https://doi.org/10.1007/3-540-45650-3_26

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42956-2

  • Online ISBN: 978-3-540-45650-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics