Skip to main content
Log in

Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

Various in vitro and in-silico methods have been used for drug genotoxicity tests, which show limited genotoxicity (GT+) and non-genotoxicity (GT−) identification rates. New methods and combinatorial approaches have been explored for enhanced collective identification capability. The rates of in-silco methods may be further improved by significantly diversified training data enriched by the large number of recently reported GT+ and GT− compounds, but a major concern is the increased noise levels arising from high false-positive rates of in vitro data. In this work, we evaluated the effect of training data size and noise level on the performance of support vector machines (SVM) method known to tolerate high noise levels in training data. Two SVMs of different diversity/noise levels were developed and tested. H-SVM trained by higher diversity higher noise data (GT+ in any in vivo or in vitro test) outperforms L-SVM trained by lower noise lower diversity data (GT+ in in vivo or Ames test only). H-SVM trained by 4,763 GT+ compounds reported before 2008 and 8,232 GT− compounds excluding clinical trial drugs correctly identified 81.6% of the 38 GT+ compounds reported since 2008, predicted 83.1% of the 2,008 clinical trial drugs as GT−, and 23.96% of 168 K MDDR and 27.23% of 17.86M PubChem compounds as GT+. These are comparable to the 43.1–51.9% GT+ and 75–93% GT− rates of existing in-silico methods, 58.8% GT+ and 79% GT− rates of Ames method, and the estimated percentages of 23% in vivo and 31–33% in vitro GT+ compounds in the “universe of chemicals”. There is a substantial level of agreement between H-SVM and L-SVM predicted GT+ and GT− MDDR compounds and the prediction from TOPKAT. SVM showed good potential in identifying GT+ compounds from large compound libraries based on higher diversity and higher noise training data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Custer LL, Sweder KS (2008) Curr Drug Metab 9:978

    Article  CAS  Google Scholar 

  2. Bolzan AD, Bianchi MS (2002) Mutat Res 512:121

    Article  CAS  Google Scholar 

  3. Li Y, Luan Y, Qi X, Li M, Gong L, Xue X, Wu X, Wu Y, Chen M, Xing G, Yao J, Ren J (2010) Toxicol Sci 118(2):435

    Article  CAS  Google Scholar 

  4. Snyder RD, Pearl GS, Mandakas G, Choy WN, Goodsaid F, Rosenblum IY (2004) Environ Mol Mutagen 43:143

    Article  CAS  Google Scholar 

  5. Schwerdtle T, Ebert F, Thuy C, Richter C, Mullenders LH, Hartwig A (2010) Chem Res Toxicol 23(2):432–442

    Article  CAS  Google Scholar 

  6. Tweats DJ, Blakey D, Heflich RH, Jacobs A, Jacobsen SD, Morita T, Nohmi T, O’Donovan MR, Sasaki YF, Sofuni T, Tice R (2007) Mutat Res 627:78

    CAS  Google Scholar 

  7. Kirkland D, Aardema M, Henderson L, Muller L (2005) Mutat Res 584:1

    CAS  Google Scholar 

  8. Snyder RD, Smith MD (2005) Drug Discov Today 10:1119

    Article  CAS  Google Scholar 

  9. Rosenkranz HS (2003) Mutat Res 529:117

    CAS  Google Scholar 

  10. Li H, Ung CY, Yap CW, Xue Y, Li ZR, Cao ZW, Chen YZ (2005) Chem Res Toxicol 18:1071

    Article  CAS  Google Scholar 

  11. White AC, Mueller RA, Gallavan RH, Aaron S, Wilson AG (2003) Mutat Res 539:77

    CAS  Google Scholar 

  12. Kirkland D, Speit G (2008) Mutat Res 654:114

    CAS  Google Scholar 

  13. Kirkland D, Pfuhler S, Tweats D, Aardema M, Corvi R, Darroudi F, Elhajouji A, Glatt H, Hastwell P, Hayashi M, Kasper P, Kirchner S, Lynch A, Marzin D, Maurici D, Meunier JR, Muller L, Nohynek G, Parry J, Parry E, Thybaud V, Tice R, van Benthem J, Vanparys P, White P (2007) Mutat Res 628:31

    CAS  Google Scholar 

  14. Hastwell PW, Chai LL, Roberts KJ, Webster TW, Harvey JS, Rees RW, Walmsley RM (2006) Mutat Res 607:160

    CAS  Google Scholar 

  15. Ritter D, Knebel J (2009) Genotoxicity testing in vitro - development of a higher throughput analysis method based on the comet assay. Toxicol In Vitro 23(8):1570–1575

    Google Scholar 

  16. Glick M, Jenkins JL, Nettles JH, Hitchings H, Davies JW (2006) J Chem Inf Model 46:193

    Article  CAS  Google Scholar 

  17. Vasquez MZ (2010) Combining the in vivo comet and micronucleus assays: a practical approach to genotoxicity testing and data interpretation. Mutagenesis 25(2):187–199

    Google Scholar 

  18. Pfuhler S, Kirkland D, Kasper P, Hayashi M, Vanparys P, Carmichael P, Dertinger S, Eastmond D, Elhajouji A, Krul C, Rothfuss A, Schoening G, Smith A, Speit G, Thomas C, van Benthem J, Corvi R (2009) Mutat Res 680:31

    CAS  Google Scholar 

  19. Brambilla G, Martelli A (2009) Update on genotoxicity and carcinogenicity testing of 472 marketed pharmaceuticals. Mutat Res 681(2–3):209–229

    Google Scholar 

  20. Ma XH, Jia J, Zhu F, Xue Y, Li ZR, Chen YZ (2009) Comb Chem High Throughput Screen 12:344

    Article  CAS  Google Scholar 

  21. Pearlman RS (1988) In: CONCORD User’s Manual, Tripos, St. Louis, MO

  22. Pochet N, De Smet F, Suykens JA, De Moor BL (2004) Bioinformatics 20:3185

    Article  CAS  Google Scholar 

  23. Matthews BW (1975) Biochim Biophys Acta 405:442

    CAS  Google Scholar 

  24. Chin SF, Wang Y, Thorne NP, Teschendorff AE, Pinder SE, Vias M, Naderi A, Roberts I, Barbosa-Morais NL, Garcia MJ, Iyer NG, Kranjac T, Robertson JF, Aparicio S, Tavaré S, Ellis I, Brenton JD, Caldas C (2007) Using array-comparative genomic hybridization to define molecular portraits of primary breast cancers. Oncogene 26(13):1959–1970

    Google Scholar 

  25. Chou KC, Shen HB (2007) Large-scale plant protein subcellular location prediction. J Cell Biochem 100(3):665–678

    Google Scholar 

  26. Karakoc E, Cherkasov A, Sahinalp SC (2006) Bioinformatics 22:e243

    Article  CAS  Google Scholar 

  27. Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Machine Learning 6:37–66

    Google Scholar 

  28. Witten IH, Frank E (2005) Data Mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco

  29. Willett P (1998) J Chem Inf Comput Sci 38:983

    CAS  Google Scholar 

  30. Bostrom J, Hogner A, Schmitt S (2006) J Med Chem 49:6716

    Article  Google Scholar 

  31. Liu XH, Ma XH, Tan CY, Jiang YY, Go ML, Low BC, Chen YZ (2009) J Chem Inf Model 49:2101

    Article  CAS  Google Scholar 

  32. Xue Y, Li H, Ung CY, Yap CW, Chen YZ (2006) Chem Res Toxicol 19:1030

    Article  CAS  Google Scholar 

  33. Bolzan AD BMS (2002) Mutat Res 512:121

    Article  Google Scholar 

  34. Cavallo D, Ursini CL, Perniconi B, Francesco AD, Giglio M, Rubino FM, Marinaccio A, Iavicoli S (2005) Mutat Res 587:45

    CAS  Google Scholar 

  35. Wong WS (2005) Curr Opin Pharmacol 5:264

    Article  CAS  Google Scholar 

  36. Sugita A, Ogawa H, Azuma M, Muto S, Honjo A, Yanagawa H, Nishioka Y, Tani K, Itai A, Sone S (2009) Int Arch Allergy Immunol 148:186

    Article  CAS  Google Scholar 

  37. Andrianopoulos C, Stephanou G, Demopoulos NA (2006) Environ Mol Mutagen 47:169

    Article  CAS  Google Scholar 

  38. Arencibia JM, Del Rio M, Bonnin A, Lopes R, Lemoine NR, Lopez-Barahona M (2005) Int J Oncol 27:1617

    CAS  Google Scholar 

  39. Csoka AB, Szyf M (2009) Epigenetic side-effects of common pharmaceuticals: a potential new field in medicine and pharmacology. Med Hypotheses 73(5):770–780

    Google Scholar 

  40. Unterberger A, Andrews SD, Weaver IC, Szyf M (2006) Mol Cell Biol 26:7575

    Article  CAS  Google Scholar 

  41. Brambilla G, Martelli A (2006) Mutat Res 612:115

    Article  CAS  Google Scholar 

  42. Park HJ, Lee SH, Son DJ, Oh KW, Kim KH, Song HS, Kim GJ, Oh GT, Yoon DY, Hong JT (2004) Arthritis Rheum 50:3504

    Article  CAS  Google Scholar 

  43. Chouini-Lalanne N, Defais M, Paillous N (1998) Biochem Pharmacol 55:441

    Article  CAS  Google Scholar 

  44. Fischer A, Sananbenesi F, Wang X, Dobbin M, Tsai LH (2007) Nature 447:178

    Article  CAS  Google Scholar 

  45. Olaharski AJ, Ji Z, Woo JY, Lim S, Hubbard AE, Zhang L, Smith MT (2006) Toxicol Sci 93:341

    Article  CAS  Google Scholar 

  46. Bezerra DP, Moura DJ, Rosa RM, de Vasconcellos MC, e Silva AC, de Moraes MO, Silveira ER, Lima MA, Henriques JA, Costa-Lotufo LV, Saffi J (2008) Mutat Res 652:164

    CAS  Google Scholar 

  47. Yin H, Baart E, Betzendahl I, Eichenlaub-Ritter U (1998) Mutagenesis 13:567

    Article  CAS  Google Scholar 

  48. Lee MG, Wynder C, Schmidt DM, McCafferty DG, Shiekhattar R (2006) Chem Biol 13:563

    Article  CAS  Google Scholar 

  49. Brambilla G, Mattioli F, Martelli A (2009) Toxicology 261:77

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Zong Chen.

Additional information

Our SVM genotoxicity virtual screening models can be accessed at http://bidd.nus.edu.sg/gtox/gtox.html.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOC 56 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kumar, P., Ma, X., Liu, X. et al. Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries. J Comput Aided Mol Des 25, 455–467 (2011). https://doi.org/10.1007/s10822-011-9431-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-011-9431-3

Keywords

Navigation