Drug discovery using very large numbers of patents. General strategy with extensive use of match and edit operations

Robson, Barry; Li, Jin; Dettinger, Richard; Peters, Amanda; Boyer, Stephen K.

doi:10.1007/s10822-011-9429-x

Drug discovery using very large numbers of patents. General strategy with extensive use of match and edit operations

Published: 03 May 2011

Volume 25, pages 427–441, (2011)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Barry Robson¹,
Jin Li²,
Richard Dettinger³,
Amanda Peters⁴ &
…
Stephen K. Boyer⁵

348 Accesses
15 Citations
Explore all metrics

Abstract

A patent data base of 6.7 million compounds generated by a very high performance computer (Blue Gene) requires new techniques for exploitation when extensive use of chemical similarity is involved. Such exploitation includes the taxonomic classification of chemical themes, and data mining to assess mutual information between themes and companies. Importantly, we also launch candidates that evolve by “natural selection” as failure of partial match against the patent data base and their ability to bind to the protein target appropriately, by simulation on Blue Gene. An unusual feature of our method is that algorithms and workflows rely on dynamic interaction between match-and-edit instructions, which in practice are regular expressions. Similarity testing by these uses SMILES strings and, less frequently, graph or connectivity representations. Examining how this performs in high throughput, we note that chemical similarity and novelty are human concepts that largely have meaning by utility in specific contexts. For some purposes, mutual information involving chemical themes might be a better concept.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Assessment of the significance of patent-derived information for the early identification of compound–target interaction hypotheses

Article Open access 21 April 2017

Representation and Searching of Chemical Structure Information in Patents

Automated patent landscaping

Article Open access 28 March 2018

Notes

All US patents are read, and chemistries other than pharmaceutical are thus extracted. These are retained for reasons of detecting prior art in compositions of matter, relevance to chemical taxonomy, repurposing for pharmaceutical applications, and studies relating to toxicity.
Such a grammar is a set of continuously applied replacement rules that allows creation from a start object of any permissible grammatical object but no other objects in the language. The grammar includes that of the reserved features for managing the match and edit of content. It is only possible to disprove a sound generative grammar by doing the computation and halting at a badly formed object.
Strictly speaking, this is an assertion. In the present chemical context, we have not found a counterexample that is not a foolish choice or “bug”. It is complex because the grammar of an object and its reserved match-edit features may be correct, but the legality of content depends on perception of reasonable chemistry. Not least because some objects are input, we check automatically at intervals that the object grammar is legal, and that X is a valid chemistry at very least with permissible valancies.
In other words for this and what follows, the outcome is that if several objects are matched by one object, and several of the latter kind of objects are matched by one object, and so on, it implies a taxonomy, but it is not a requirement the reserved match-edit features have the same format and grammar at each level.
Connectivity strings are not regex but the fast and compressed form of these for exact binary matching methods, because as b increases the number of paths increases explosively. They are re-definable in the system and several implementations and variations are being explored. Currently, rows 2–4 and columns 5–9 of the periodic table are numbered 0–14, and iodine added as 15. These are then expressed in binary. The propyl fragment C–C–C–C is thus 0001000100010001, held in two 8-bit bytes or characters in memory.
However, this is not very efficient because if a probe is matching the patent data base, so will a waxed form of it, i.e. one broader in scope. Waxing should occur when there are no matches for a set number of mutations (editing iterations). Waning should occur both (a) in the case of matches against the patent data base and (b) at intervals when there persistently no matches in order to generate specific compounds. A critical parameter is thus a default or user-specified number of mutations which have elapsed before a probe is deemed “persistently novel” and worthy of waned to a specific candidate compound. Nonetheless, the optimal choice of parameter can vary from study to study, and it is best to start to wane when the novelty entropy (see above Footnote) reaches a critical value. .
In one mode, this information and that from small complete molecules can be used to select the chemical themes in a first pass, and not overlap their chemistries when they are seen in larger compounds. However building blocks can be made out of building blocks, and we prefer an empirical approach based on association information described below.
Obviously, this idea requires further analysis. While they typically appear to be regions of overlap between recurrent biochemical and synthetic themes, this may reflect the likelihood that companies associated with them tend to use the same synthetic strategy, while different companies have a bigger chance of using different strategies. Prominent in strong associations are ring systems, possibly suggesting more obvious synthetic strategies, as well as more restricted Canonical SMILES solutions.
But this was confirmatory rather than discovery because the connection showed up in Internet research in Sect. 4.1 (e.g. Ref. [42]), so we do not describe our methods of building directories here.

References

Adams RS (2006) Information sources in patents. Walter de Gruyter: Amsterdam, The Netherlands
Lynch MF, Barnard JM, Welford SM (1981) Computer Storage and retrieval of generic chemical structures in patents, 1. Introduction and general strategy. J Chem Inf Comp Sci 21(3):148–150
CAS Google Scholar
Downs GM, Barnard JM (1998) Chemical patents and structural information: The Sheffield research in context. J Documentation 54(1):106–120
Article Google Scholar
Oldach S, Stabinsk N (2009) The value of patent analytics, 2008. Intellectual property today. http://www.iptoday.com/articles/2008-6-oldach.asp. Accessed 20 Mar 2009
Feldman R, Sanger J (2006) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge
Book Google Scholar
Berks AH (2001) Current state of the art of Markush topological search systems. World Patent Inf 23(1):5–13
Article CAS Google Scholar
Li J, Robson B (2000) Bioinformatics and computational chemistry in molecular design. Recent advances and their application. In Peptide and Protein Drug Analysis, Marcel Dekker NY, 285–307
Paolini GV, Shapland HBR, van Hoorn WP, Mason JS, Hopkins AL (2006) Global mapping pharmaceutical space. Nat Biotechnol 24(7):805–815
Article CAS Google Scholar
Chen YP, Chen F (2008) Identifying targets for drug discovery using bioinformatics. Expert Opin Ther Targets 12(4):383–389
Article Google Scholar
Digital Chemistry (2009) Digital chemistry. http://www.digitalchemistry.co.uk/prod_torus_patent.htm. Accessed 20 Jul 2009
Reel Two, Reel Two web site (2007) http://www.reeltwo.com/. Accessed 20 Jul 2009
Tripos Inc (2008) http://www.tripos.com/data/support/mol2.pdf. Accessed 5 Apr 09
Symyx, Symyx Web Page (2009) http://www.symyx.com. Accessed 10 Nov 2009
Grant JA, Haigh JA, Pickup BT, Nicholls A, Sayle RA (2006) Lingos, finite state machines, and fast similarity searching. J Chem Inf Model 46(5):1912–1918
Article CAS Google Scholar
Haque IS, Pande VS, Walters WP (2010) SIML: A fast SIMD algorithm for calculating LINGO chemical similarities on GPUs. J Chem Inf Model 50:560–564
Article CAS Google Scholar
Rhodes J, Boyer S, Kreulen J, Chen Y, Ordonez P (2007) Mining patents using molecular similarity search. pacific symposium on biocomputing, Maui, Hawaii, 3–7 January 2007 Ed. Altman et al. World Scientific Publishing; p 304–315, http://www.almaden.ibm.com/asr/projects/biw/publications/Rhodes.pdf
Chen Y, Spangler S, Kreulen J, Boyer SK (2009) SIMPLE: A strategic information mining platform for IP excellence. In: IEEE international conference on data mining workshops, Miami, Florida, 6 Dec 2009. p 270–275. http://domino.research.ibm.com/library/cyberdig.nsf/papers/95D73078344701C9852576350055DBF3/$File/rj10450.pdf
Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comp Sci 28:31–36
CAS Google Scholar
The Open Group, Regular Expressions (2009) The Single UNIX ® Specification, Version 2, 1997. Opengroup.org. http://www.opengroup.org/onlinepubs/007908799/xbd/re.html. Accessed 1 Aug 2009
Wall L, The Perl Development Team (2006) Perl.org. http://perldoc.perl.org/perlre.html. Accessed 9/1/2009
Fisanick W (1990) The chemical abstracts service generic chemical (Markush) structure storage and retrieval capability. 1. Basic concepts. J Chem Inf Comp Sci 30(2):145–154
CAS Google Scholar
Barnard JM (1991) A comparison of different approaches to Markush structure handling. J Chem Inf Comp Sci 31(1):64–68
CAS Google Scholar
Barnard JM (1993) Substructure searching methods: old and new. J Chem Inf Comp Sci 33(4):532–538
CAS Google Scholar
Barnard JM, Downs GM (1997) Chemical fragment generation and clustering software. J Chem Inf Comp Sci 37(1):141–142
CAS Google Scholar
Downs GM, Barnard JM (1997) Techniques for generating descriptive fingerprints in combinatorial libraries. J Chem Inf Comp Sci 37(1):59–61
CAS Google Scholar
Barnard JM, Downs GM (1992) Clustering of chemical structures on the basis of two-dimensional similarity measure. J Chem Inf Comp Sci 32(6):644–649
CAS Google Scholar
Brown RD, Martin YC (1996) Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J Chem Inf Comp Sci 36:572–584
CAS Google Scholar
Robson B, Finn PW (1984) Rational design of conformationally flexible drugs. ATLA Journal. Alternatives to Laboratory Animals 11: 67–78
Google Scholar
Ivanciuc O (2003) Canonical numbering and constitutional symmetry. In: Handbook of Chemoinformatics, Ed. J. Gasteiger, Wiley-VCH, pp 139–160
Daylight Chemical Systems, Inc (2009) http://www.daylight.com/. Accessed 10 Apr 2009
Dethlefsen W, Lynch MF, Gillet VJ, Downs GM, Holliday JD, Barnard JM (1991) Computer storage and retrieval of generic chemical structures in patents. 12. Principles of search operations involving parameter lists: matching-relations, user-defined match levels, and transition from the reduced graph search to the refined search. J Chem Inf Comp Sci 31(2):253–260
CAS Google Scholar
Robson B (1974) Analysis of the code relating sequence to conformation in globular proteins: theory and application of expected information. Biochem J 141:853–867
CAS Google Scholar
Robson B (2008) Clinical and pharmacogenomic data mining: 4. The FANO program and command set as an example of tools for biomedical discovery and evidence based medicine. J Proteome Res 7(9):3922–3947
Article CAS Google Scholar
Wikepedia (2010) http://en.wikipedia.org/wiki/IUPAC_nomenclature. Accessed 8/30/2010
Wikepedia (2010) Wikepedia. http://en.wikipedia.org/wiki/Blue_Gene. Accessed 8/3/09
Kramer A, Horn HW, Rice J (2003) Fast 3D molecular superposition and similarity search in databases of flexible molecules. J Comp Aided Mol Des 17(1):13–38
Article Google Scholar
IBM Corporation, Data Discovery and Query Builder’s User’s Guide (2006) IBM Corporation. http://publib.boulder.ibm.com/infocenter/systems/topic/ddqb/v2r1ddqbusersguide.pdf. Accessed 7 Apr 2009
University of California San Francisco, http://zinc.docking.org/. Accessed 8 Aug 2009
RCSB Protein data Bank (2008) http://www.wwpdb.org/docs.html. Accessed 5 Apr 2009
Warner J (2004) Licorice root may keep mental skills sharp: compound derived from licorice root may fight effects of aging on brain. 2004, March. WebMD News. http://www.webmd.com/alzheimers/news/20040329/licorice-root-may-keep-mental-skills-sharp. Accessed 5 Apr 2009
Livingstone DE, Walker BR (2003) Is 11beta-hydroxysteroid dehydrogenase type 1 a therapeutic target? Effects of carbenoxolone in lean and obese Zucker rats. J Pharmacol Exp Ther 305(1):167–172
Article CAS Google Scholar
Wikepedia (2009) http://en.wikipedia.org/wiki/Zipf’s_law. Accessed 6 Aug 2009
CAS, a division of the American Chemical Society. Support Page (2009) http://www.cas.org/support/scifi/index.html. Accessed 1 Jan 2010
CAS, a division of the American Chemical Society, Products page (2009) http://www.cas.org/products/sfacad/index.html. Accessed 1 Jan 2010
Schmidt MW, Baldridge KK, Boatz JA, Elbert ST, Gordon MS, Jensen JH, Koseki S, Matsunaga N, Su S, Windus TL, Dupuis M, Montgomery JA (1993) General atomic and molecular electronic structure system. J Comp Chem 14:1347–1363
Article CAS Google Scholar
Peters A, Lundberg M, Sosa CP, Lang T (2007) High throughput computing validation for drug discovery using the DOCK program on a massively parallel system. 1st annual MSCBB, Northwestern University, Evanston, IL, September 2007; available as Peters A, Lundberg M, Lang T, and Sosa, CP, 2008, RedPaper 4410 from IBM Corporation Poughkeepsie, NY
Balius TE, Mukherjee S (2008) Stony Brook University web site. http://www.ams.sunysb.edu/~tbalius/NamdandDockonNYBlue.pdf. Accessed 8 Aug 2009
Shivakumar D (2008) (updated 2009). University of California San Francisco, http://dock.compbio.ucsf.edu/DOCK_6/tutorials/amber_score/amber_score.htm. Accessed 12 Aug 2009
McWeeny R (1979) Coulson’s Valence, 3rd edn. Oxford University Press, Oxford, UK see Ch. 6
Google Scholar
Robson B, Curioni A, Mordasini T (2002) Studies in the assessment of folding quality for protein modeling and structure prediction. J Proteome Res (Am Chem Soc) 1(2):115–133
CAS Google Scholar
Robson B, Vaithilingham A (2008) “Protein Folding Revisited” pp 161–202 in Progess in Molecular Biology and Translational Science, Vol 84: Molecular Biology of Protein Folding, Elsevier Press/Academic Press
Robson B, Douglas GM, Platt E (1982) A new algorithm for rapid calculation of conformational energies. Biochem Soc Trans 10:388–389
CAS Google Scholar
Robson B, Platt E (1986) Refined models for computer calculations in protein engineering. Calculation and testing of atomic potential functions compatible with more efficient calculations. 188: 259–281
Collura VP, Greaney PJ, Robson B (1994) A method for rapidly assessing and refining simple solvent treatments in molecular modeling. Example studies on the antigen-combining loop H2 from FAB fragment McPC603. Protein Eng 7:221–233
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

St Matthews University School of Medicine, Grand Cayman, Cayman Islands, The University of Wisconsin-Stout, Menomonie, Wisconsin, USA
Barry Robson
Global Compound Sciences, AstraZeneca R&D, Alderley Park, Macclesfield, Cheshire, UK
Jin Li
Prentice, Rochester, USA
Richard Dettinger
Department of Physics, Harvard University, Cambridge, MA, USA
Amanda Peters
Collabra Inc., San Jose, CA, USA
Stephen K. Boyer

Authors

Barry Robson
View author publications
You can also search for this author in PubMed Google Scholar
Jin Li
View author publications
You can also search for this author in PubMed Google Scholar
Richard Dettinger
View author publications
You can also search for this author in PubMed Google Scholar
Amanda Peters
View author publications
You can also search for this author in PubMed Google Scholar
Stephen K. Boyer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Barry Robson.

Additional information

This report also includes work done in part by authors Barry Robson, Amanda Peters and Stephen K. Boyer at IBM Corporation.

Appendix: Binding studies

Binding studies will be more familiar, but are well known to be nontrivial. Note first that in the catalytic region of chain A of the experimental 2BEL tetramer structure with carbenoxolone and NADP cofactor ligands (and rather similarly for the other monomers), the serine Ser 170 side chain oxygen atom makes a surprisingly tight 2.5 Ǻ approach to the original double bonded C=O oxygen O11 atom on C11 of the steroid-like framework. The sum of the oxygen atom van der Waal’s radii of the serine and ligand oxygen is 1.52 + 1.52 = 3.04 Ǻ. Following DOCK and AMBER, this O…O distance is 3.0 Ǻ. Adding to the electronegative tension, there is also a close approach by the phenolic oxygen of Tyr 183 at 5.0 Ǻ, and by oxygen O7 N in the NADP cofactor at 4.9 Ǻ. The other close approaches do not compensate significantly: they are most importantly the carbons C4 N of NADP at 4.3 Ǻ and the Ala 172 side chain carbon at 4.5 Ǻ, but a good binding comes from interactions involving the whole binding site region. It is substitutions at or around C11 that are compensated by the most surprising degree of binding site accommodation, i.e. significant conformational changes in backbone as well as side-chains in the binding site. Replacing the O…O contact by O…S by using thioketone (cboS1) as ligand initially generates a considerable van der Waal’s and electrostatic repulsion as the thioketone S is strongly electronegative [49], but binding is still accommodated. These observations led to the following studies.

In order (a) to see if any multiple binding modes and binding site conformational changes may have been missed, and (b) to try and predict Blue Gene performance as a screening tool for potential candidates, a variety of extensive analyses and comparative studies using the full tetramer with NADP cofactors were performed using KRUNCH [50]. This can use and compare AMBER and other force fields and re-parameterize them, as well as perform both molecular mechanics and dynamics calculations. As an alternative to Blue Gene’s processing speed and brute force calculation, this has a large kit of techniques for spanning energy barriers and searching conformational space, developed and adapted over many years by Robson and coworkers. See Ref. [51] for a general review, and techniques of Refs. [52–54] were also used. Results relative to carbenoxolone are shown in column 3 of Table 2. The regression slope of columns 2 and 3 is close to unity at 0.97 with a Pearson’s correlation R of 0.88, with intercept close to zero. That is after adjusting slope for predictive purposes: prior to the latter, on average the RPFF energies were amplified 4.6% over the DOCK-AMBER results. The major difference here appears to be in electrostatics, because a simple 6% increase in effective dielectric constant in the RPFF brings the results into alignment (regression slope 0.95). Even recalling that this is predicting another calculation, not experiment, the above is a reasonable result within the state of the art. However, in more recent studies, probes started from the patent and ZINC data bases were left to evolve several weeks. Some ligands bound specifically to the 2BEL binding site by the RPFF plus heuristic methods, but did not bind to the site using DOCK and AMBER. An early example was (2,3)-dithio-6-hydroxy-8-carboxy-(1,7,9)-azanaphthalene as indicated by ‘**’ in the second column of Table 2. The common feature of these “prediction failures” is that they are small ligands of one or two rings, and display multiple binding modes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Robson, B., Li, J., Dettinger, R. et al. Drug discovery using very large numbers of patents. General strategy with extensive use of match and edit operations. J Comput Aided Mol Des 25, 427–441 (2011). https://doi.org/10.1007/s10822-011-9429-x

Download citation

Received: 07 December 2010
Accepted: 12 April 2011
Published: 03 May 2011
Issue Date: May 2011
DOI: https://doi.org/10.1007/s10822-011-9429-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Drug discovery using very large numbers of patents. General strategy with extensive use of match and edit operations

Abstract

Access this article

Similar content being viewed by others

Assessment of the significance of patent-derived information for the early identification of compound–target interaction hypotheses

Representation and Searching of Chemical Structure Information in Patents

Automated patent landscaping

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Binding studies

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Drug discovery using very large numbers of patents. General strategy with extensive use of match and edit operations

Abstract

Access this article

Similar content being viewed by others

Assessment of the significance of patent-derived information for the early identification of compound–target interaction hypotheses

Representation and Searching of Chemical Structure Information in Patents

Automated patent landscaping

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Binding studies

Appendix: Binding studies

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation