Skip to main content

Advertisement

Log in

Drug discovery using very large numbers of patents. General strategy with extensive use of match and edit operations

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

A patent data base of 6.7 million compounds generated by a very high performance computer (Blue Gene) requires new techniques for exploitation when extensive use of chemical similarity is involved. Such exploitation includes the taxonomic classification of chemical themes, and data mining to assess mutual information between themes and companies. Importantly, we also launch candidates that evolve by “natural selection” as failure of partial match against the patent data base and their ability to bind to the protein target appropriately, by simulation on Blue Gene. An unusual feature of our method is that algorithms and workflows rely on dynamic interaction between match-and-edit instructions, which in practice are regular expressions. Similarity testing by these uses SMILES strings and, less frequently, graph or connectivity representations. Examining how this performs in high throughput, we note that chemical similarity and novelty are human concepts that largely have meaning by utility in specific contexts. For some purposes, mutual information involving chemical themes might be a better concept.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. All US patents are read, and chemistries other than pharmaceutical are thus extracted. These are retained for reasons of detecting prior art in compositions of matter, relevance to chemical taxonomy, repurposing for pharmaceutical applications, and studies relating to toxicity.

  2. Such a grammar is a set of continuously applied replacement rules that allows creation from a start object of any permissible grammatical object but no other objects in the language. The grammar includes that of the reserved features for managing the match and edit of content. It is only possible to disprove a sound generative grammar by doing the computation and halting at a badly formed object.

  3. Strictly speaking, this is an assertion. In the present chemical context, we have not found a counterexample that is not a foolish choice or “bug”. It is complex because the grammar of an object and its reserved match-edit features may be correct, but the legality of content depends on perception of reasonable chemistry. Not least because some objects are input, we check automatically at intervals that the object grammar is legal, and that X is a valid chemistry at very least with permissible valancies.

  4. In other words for this and what follows, the outcome is that if several objects are matched by one object, and several of the latter kind of objects are matched by one object, and so on, it implies a taxonomy, but it is not a requirement the reserved match-edit features have the same format and grammar at each level.

  5. Connectivity strings are not regex but the fast and compressed form of these for exact binary matching methods, because as b increases the number of paths increases explosively. They are re-definable in the system and several implementations and variations are being explored. Currently, rows 2–4 and columns 5–9 of the periodic table are numbered 0–14, and iodine added as 15. These are then expressed in binary. The propyl fragment C–C–C–C is thus 0001000100010001, held in two 8-bit bytes or characters in memory.

  6. However, this is not very efficient because if a probe is matching the patent data base, so will a waxed form of it, i.e. one broader in scope. Waxing should occur when there are no matches for a set number of mutations (editing iterations). Waning should occur both (a) in the case of matches against the patent data base and (b) at intervals when there persistently no matches in order to generate specific compounds. A critical parameter is thus a default or user-specified number of mutations which have elapsed before a probe is deemed “persistently novel” and worthy of waned to a specific candidate compound. Nonetheless, the optimal choice of parameter can vary from study to study, and it is best to start to wane when the novelty entropy (see above Footnote) reaches a critical value. .

  7. In one mode, this information and that from small complete molecules can be used to select the chemical themes in a first pass, and not overlap their chemistries when they are seen in larger compounds. However building blocks can be made out of building blocks, and we prefer an empirical approach based on association information described below.

  8. Obviously, this idea requires further analysis. While they typically appear to be regions of overlap between recurrent biochemical and synthetic themes, this may reflect the likelihood that companies associated with them tend to use the same synthetic strategy, while different companies have a bigger chance of using different strategies. Prominent in strong associations are ring systems, possibly suggesting more obvious synthetic strategies, as well as more restricted Canonical SMILES solutions.

  9. But this was confirmatory rather than discovery because the connection showed up in Internet research in Sect. 4.1 (e.g. Ref. [42]), so we do not describe our methods of building directories here.

References

  1. Adams RS (2006) Information sources in patents. Walter de Gruyter: Amsterdam, The Netherlands

  2. Lynch MF, Barnard JM, Welford SM (1981) Computer Storage and retrieval of generic chemical structures in patents, 1. Introduction and general strategy. J Chem Inf Comp Sci 21(3):148–150

    CAS  Google Scholar 

  3. Downs GM, Barnard JM (1998) Chemical patents and structural information: The Sheffield research in context. J Documentation 54(1):106–120

    Article  Google Scholar 

  4. Oldach S, Stabinsk N (2009) The value of patent analytics, 2008. Intellectual property today. http://www.iptoday.com/articles/2008-6-oldach.asp. Accessed 20 Mar 2009

  5. Feldman R, Sanger J (2006) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge

    Book  Google Scholar 

  6. Berks AH (2001) Current state of the art of Markush topological search systems. World Patent Inf 23(1):5–13

    Article  CAS  Google Scholar 

  7. Li J, Robson B (2000) Bioinformatics and computational chemistry in molecular design. Recent advances and their application. In Peptide and Protein Drug Analysis, Marcel Dekker NY, 285–307

  8. Paolini GV, Shapland HBR, van Hoorn WP, Mason JS, Hopkins AL (2006) Global mapping pharmaceutical space. Nat Biotechnol 24(7):805–815

    Article  CAS  Google Scholar 

  9. Chen YP, Chen F (2008) Identifying targets for drug discovery using bioinformatics. Expert Opin Ther Targets 12(4):383–389

    Article  Google Scholar 

  10. Digital Chemistry (2009) Digital chemistry. http://www.digitalchemistry.co.uk/prod_torus_patent.htm. Accessed 20 Jul 2009

  11. Reel Two, Reel Two web site (2007) http://www.reeltwo.com/. Accessed 20 Jul 2009

  12. Tripos Inc (2008) http://www.tripos.com/data/support/mol2.pdf. Accessed 5 Apr 09

  13. Symyx, Symyx Web Page (2009) http://www.symyx.com. Accessed 10 Nov 2009

  14. Grant JA, Haigh JA, Pickup BT, Nicholls A, Sayle RA (2006) Lingos, finite state machines, and fast similarity searching. J Chem Inf Model 46(5):1912–1918

    Article  CAS  Google Scholar 

  15. Haque IS, Pande VS, Walters WP (2010) SIML: A fast SIMD algorithm for calculating LINGO chemical similarities on GPUs. J Chem Inf Model 50:560–564

    Article  CAS  Google Scholar 

  16. Rhodes J, Boyer S, Kreulen J, Chen Y, Ordonez P (2007) Mining patents using molecular similarity search. pacific symposium on biocomputing, Maui, Hawaii, 3–7 January 2007 Ed. Altman et al. World Scientific Publishing; p 304–315, http://www.almaden.ibm.com/asr/projects/biw/publications/Rhodes.pdf

  17. Chen Y, Spangler S, Kreulen J, Boyer SK (2009) SIMPLE: A strategic information mining platform for IP excellence. In: IEEE international conference on data mining workshops, Miami, Florida, 6 Dec 2009. p 270–275. http://domino.research.ibm.com/library/cyberdig.nsf/papers/95D73078344701C9852576350055DBF3/$File/rj10450.pdf

  18. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comp Sci 28:31–36

    CAS  Google Scholar 

  19. The Open Group, Regular Expressions (2009) The Single UNIX ® Specification, Version 2, 1997. Opengroup.org. http://www.opengroup.org/onlinepubs/007908799/xbd/re.html. Accessed 1 Aug 2009

  20. Wall L, The Perl Development Team (2006) Perl.org. http://perldoc.perl.org/perlre.html. Accessed 9/1/2009

  21. Fisanick W (1990) The chemical abstracts service generic chemical (Markush) structure storage and retrieval capability. 1. Basic concepts. J Chem Inf Comp Sci 30(2):145–154

    CAS  Google Scholar 

  22. Barnard JM (1991) A comparison of different approaches to Markush structure handling. J Chem Inf Comp Sci 31(1):64–68

    CAS  Google Scholar 

  23. Barnard JM (1993) Substructure searching methods: old and new. J Chem Inf Comp Sci 33(4):532–538

    CAS  Google Scholar 

  24. Barnard JM, Downs GM (1997) Chemical fragment generation and clustering software. J Chem Inf Comp Sci 37(1):141–142

    CAS  Google Scholar 

  25. Downs GM, Barnard JM (1997) Techniques for generating descriptive fingerprints in combinatorial libraries. J Chem Inf Comp Sci 37(1):59–61

    CAS  Google Scholar 

  26. Barnard JM, Downs GM (1992) Clustering of chemical structures on the basis of two-dimensional similarity measure. J Chem Inf Comp Sci 32(6):644–649

    CAS  Google Scholar 

  27. Brown RD, Martin YC (1996) Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J Chem Inf Comp Sci 36:572–584

    CAS  Google Scholar 

  28. Robson B, Finn PW (1984) Rational design of conformationally flexible drugs. ATLA Journal. Alternatives to Laboratory Animals 11: 67–78

    Google Scholar 

  29. Ivanciuc O (2003) Canonical numbering and constitutional symmetry. In: Handbook of Chemoinformatics, Ed. J. Gasteiger, Wiley-VCH, pp 139–160

  30. Daylight Chemical Systems, Inc (2009) http://www.daylight.com/. Accessed 10 Apr 2009

  31. Dethlefsen W, Lynch MF, Gillet VJ, Downs GM, Holliday JD, Barnard JM (1991) Computer storage and retrieval of generic chemical structures in patents. 12. Principles of search operations involving parameter lists: matching-relations, user-defined match levels, and transition from the reduced graph search to the refined search. J Chem Inf Comp Sci 31(2):253–260

    CAS  Google Scholar 

  32. Robson B (1974) Analysis of the code relating sequence to conformation in globular proteins: theory and application of expected information. Biochem J 141:853–867

    CAS  Google Scholar 

  33. Robson B (2008) Clinical and pharmacogenomic data mining: 4. The FANO program and command set as an example of tools for biomedical discovery and evidence based medicine. J Proteome Res 7(9):3922–3947

    Article  CAS  Google Scholar 

  34. Wikepedia (2010) http://en.wikipedia.org/wiki/IUPAC_nomenclature. Accessed 8/30/2010

  35. Wikepedia (2010) Wikepedia. http://en.wikipedia.org/wiki/Blue_Gene. Accessed 8/3/09

  36. Kramer A, Horn HW, Rice J (2003) Fast 3D molecular superposition and similarity search in databases of flexible molecules. J Comp Aided Mol Des 17(1):13–38

    Article  Google Scholar 

  37. IBM Corporation, Data Discovery and Query Builder’s User’s Guide (2006) IBM Corporation. http://publib.boulder.ibm.com/infocenter/systems/topic/ddqb/v2r1ddqbusersguide.pdf. Accessed 7 Apr 2009

  38. University of California San Francisco, http://zinc.docking.org/. Accessed 8 Aug 2009

  39. RCSB Protein data Bank (2008) http://www.wwpdb.org/docs.html. Accessed 5 Apr 2009

  40. Warner J (2004) Licorice root may keep mental skills sharp: compound derived from licorice root may fight effects of aging on brain. 2004, March. WebMD News. http://www.webmd.com/alzheimers/news/20040329/licorice-root-may-keep-mental-skills-sharp. Accessed 5 Apr 2009

  41. Livingstone DE, Walker BR (2003) Is 11beta-hydroxysteroid dehydrogenase type 1 a therapeutic target? Effects of carbenoxolone in lean and obese Zucker rats. J Pharmacol Exp Ther 305(1):167–172

    Article  CAS  Google Scholar 

  42. Wikepedia (2009) http://en.wikipedia.org/wiki/Zipf’s_law. Accessed 6 Aug 2009

  43. CAS, a division of the American Chemical Society. Support Page (2009) http://www.cas.org/support/scifi/index.html. Accessed 1 Jan 2010

  44. CAS, a division of the American Chemical Society, Products page (2009) http://www.cas.org/products/sfacad/index.html. Accessed 1 Jan 2010

  45. Schmidt MW, Baldridge KK, Boatz JA, Elbert ST, Gordon MS, Jensen JH, Koseki S, Matsunaga N, Su S, Windus TL, Dupuis M, Montgomery JA (1993) General atomic and molecular electronic structure system. J Comp Chem 14:1347–1363

    Article  CAS  Google Scholar 

  46. Peters A, Lundberg M, Sosa CP, Lang T (2007) High throughput computing validation for drug discovery using the DOCK program on a massively parallel system. 1st annual MSCBB, Northwestern University, Evanston, IL, September 2007; available as Peters A, Lundberg M, Lang T, and Sosa, CP, 2008, RedPaper 4410 from IBM Corporation Poughkeepsie, NY

  47. Balius TE, Mukherjee S (2008) Stony Brook University web site. http://www.ams.sunysb.edu/~tbalius/NamdandDockonNYBlue.pdf. Accessed 8 Aug 2009

  48. Shivakumar D (2008) (updated 2009). University of California San Francisco, http://dock.compbio.ucsf.edu/DOCK_6/tutorials/amber_score/amber_score.htm. Accessed 12 Aug 2009

  49. McWeeny R (1979) Coulson’s Valence, 3rd edn. Oxford University Press, Oxford, UK see Ch. 6

    Google Scholar 

  50. Robson B, Curioni A, Mordasini T (2002) Studies in the assessment of folding quality for protein modeling and structure prediction. J Proteome Res (Am Chem Soc) 1(2):115–133

    CAS  Google Scholar 

  51. Robson B, Vaithilingham A (2008) “Protein Folding Revisited” pp 161–202 in Progess in Molecular Biology and Translational Science, Vol 84: Molecular Biology of Protein Folding, Elsevier Press/Academic Press

  52. Robson B, Douglas GM, Platt E (1982) A new algorithm for rapid calculation of conformational energies. Biochem Soc Trans 10:388–389

    CAS  Google Scholar 

  53. Robson B, Platt E (1986) Refined models for computer calculations in protein engineering. Calculation and testing of atomic potential functions compatible with more efficient calculations. 188: 259–281

  54. Collura VP, Greaney PJ, Robson B (1994) A method for rapidly assessing and refining simple solvent treatments in molecular modeling. Example studies on the antigen-combining loop H2 from FAB fragment McPC603. Protein Eng 7:221–233

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Barry Robson.

Additional information

This report also includes work done in part by authors Barry Robson, Amanda Peters and Stephen K. Boyer at IBM Corporation.

Appendix: Binding studies

Appendix: Binding studies

Binding studies will be more familiar, but are well known to be nontrivial. Note first that in the catalytic region of chain A of the experimental 2BEL tetramer structure with carbenoxolone and NADP cofactor ligands (and rather similarly for the other monomers), the serine Ser 170 side chain oxygen atom makes a surprisingly tight 2.5 Ǻ approach to the original double bonded C=O oxygen O11 atom on C11 of the steroid-like framework. The sum of the oxygen atom van der Waal’s radii of the serine and ligand oxygen is 1.52 + 1.52 = 3.04 Ǻ. Following DOCK and AMBER, this O…O distance is 3.0 Ǻ. Adding to the electronegative tension, there is also a close approach by the phenolic oxygen of Tyr 183 at 5.0 Ǻ, and by oxygen O7 N in the NADP cofactor at 4.9 Ǻ. The other close approaches do not compensate significantly: they are most importantly the carbons C4 N of NADP at 4.3 Ǻ and the Ala 172 side chain carbon at 4.5 Ǻ, but a good binding comes from interactions involving the whole binding site region. It is substitutions at or around C11 that are compensated by the most surprising degree of binding site accommodation, i.e. significant conformational changes in backbone as well as side-chains in the binding site. Replacing the O…O contact by O…S by using thioketone (cboS1) as ligand initially generates a considerable van der Waal’s and electrostatic repulsion as the thioketone S is strongly electronegative [49], but binding is still accommodated. These observations led to the following studies.

In order (a) to see if any multiple binding modes and binding site conformational changes may have been missed, and (b) to try and predict Blue Gene performance as a screening tool for potential candidates, a variety of extensive analyses and comparative studies using the full tetramer with NADP cofactors were performed using KRUNCH [50]. This can use and compare AMBER and other force fields and re-parameterize them, as well as perform both molecular mechanics and dynamics calculations. As an alternative to Blue Gene’s processing speed and brute force calculation, this has a large kit of techniques for spanning energy barriers and searching conformational space, developed and adapted over many years by Robson and coworkers. See Ref. [51] for a general review, and techniques of Refs. [5254] were also used. Results relative to carbenoxolone are shown in column 3 of Table 2. The regression slope of columns 2 and 3 is close to unity at 0.97 with a Pearson’s correlation R of 0.88, with intercept close to zero. That is after adjusting slope for predictive purposes: prior to the latter, on average the RPFF energies were amplified 4.6% over the DOCK-AMBER results. The major difference here appears to be in electrostatics, because a simple 6% increase in effective dielectric constant in the RPFF brings the results into alignment (regression slope 0.95). Even recalling that this is predicting another calculation, not experiment, the above is a reasonable result within the state of the art. However, in more recent studies, probes started from the patent and ZINC data bases were left to evolve several weeks. Some ligands bound specifically to the 2BEL binding site by the RPFF plus heuristic methods, but did not bind to the site using DOCK and AMBER. An early example was (2,3)-dithio-6-hydroxy-8-carboxy-(1,7,9)-azanaphthalene as indicated by ‘**’ in the second column of Table 2. The common feature of these “prediction failures” is that they are small ligands of one or two rings, and display multiple binding modes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Robson, B., Li, J., Dettinger, R. et al. Drug discovery using very large numbers of patents. General strategy with extensive use of match and edit operations. J Comput Aided Mol Des 25, 427–441 (2011). https://doi.org/10.1007/s10822-011-9429-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-011-9429-x

Keywords

Navigation