Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples

Neglur, Greeshma; Grossman, Robert L.; Liu, Bing

doi:10.1007/11530084_13

Greeshma Neglur²¹,
Robert L. Grossman²¹ &
Bing Liu²²

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3615))

Included in the following conference series:

International Workshop on Data Integration in the Life Sciences

1115 Accesses
11 Citations
3 Altmetric

Abstract

Integrating data involving chemical structures is simplified when unique identifiers (UIDs) can be associated with chemical structures. For example, these identifiers can be used as database keys. One common approach is to use the Unique SMILES notation introduced in [2]. The Unique SMILES views a chemical structure as a graph with atoms as nodes and bonds as edges and uses a depth first traversal of the graph to generate the SMILES strings. The algorithm establishes a node ordering by using certain symmetry properties of the graphs. In this paper, we present certain molecular graphs for which the algorithm fails to generate UIDs. Indeed, we show that different graphs in the same symmetry class employed by the Unique SMILES algorithm have different Unique SMILES IDs. We tested the algorithm on the National Cancer Institute (NCI) database [7] and found several molecular structures for which the algorithm also failed. We have also written a python script that generates molecular graphs for which the algorithm fails.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Reconciling Inconsistent Molecular Structures from Biochemical Databases

Approach to Improving the Quality of Open Data in the Universe of Small Molecules

Ambiguity of non-systematic chemical identifiers within and between small-molecule databases

Article Open access 16 November 2015

References

Weininger, D.: SMILES, a Chemical Language and Information System 1: Introduction to Methodology and Encoding Rules, Medicinal Chemistry Project, Pomona College (1988)
Google Scholar
Weininger, D., Weininger, A., Weininger, J.L.: SMILES 2: Algorithm for Generation of Unique SMILES Notation, Daylight Chemical Information Systems, Irvine, California 92714 (1989); Note that although the Unique SMILES implementation has been changed by the Daylight Chemical Information System, this appears to be the most recent publication describing the algorithm
Google Scholar
Weininger, D.: SMILES 3: Depicting Graphical Depiction of Chemical Structures, Daylight Chemical Information Systems, New Orleans, Louisiana
Google Scholar
A SMILES to graph translation can be found at: http://www.daylight.com/daycgi/depict
A SMILES to UNIQUE SMILES translation can be found at, http://cactus.nci.nih.gov/services/translate/
More counter examples can be found at the web site, http://ncdm171.lac.uic.edu/neglur/USMILES/USMILES.html
NCI database, retrieved from http://129.43.27.140/ncidb2/ on (March 2, 2005)
Sample adjacency list used -{1:[[’C’,1,6,’0’,3],[[1,2]]], 2:[[’C’,2,6,’0’,2],[[1,1],[1,3]]], 3:[[’C’,4,6,’0’,0],[[1,2],[2,4],[1,11]]], 4:[[’C’,3,6,’0’,1],[[2,3],[1,5]]], 5:[[’C’,4,6,’0’,0],[[1,4],[1,6],[2,8]]], 6:[[’C’,2,6,’0’,2],[[1,5],[1,7]]], 7:[[’C’,1,6,’0’,3],[[1,6]]], 8:[[’C’,3,6,’0’,1],[[2,5],[1,9]]], 9:[[’C’,4,6,’0’,0],[[1,8],[1,10],[2,11]]], 10:[[’C’,1,6,’0’,3],[[1,9]]], 11:[[’C’,3,6,’0’,1],[[1,3],[2,9]]]}
Google Scholar
CANON Algorithm (Extract from Reference [2])- (1) Set the atomic vector to initial invariants. Go to step 3. (2) Set vector to product of primes corresponding to neighbors’ ranks. (3) Sort vector, maintaining stability over previous ranks. (4) Rank atomic vector. (5) If not invariant partitioning, go to step 2. (6) On first pass, save partitioning as symmetry classes. (7) If highest rank is smaller than number of nodes, break ties, go to step 2. (8)... else done
Google Scholar
See http://bioweb.dataspaceweb.org/chemicalKeys (retrieved on March 2, 2005)
Beyer, T., Proskurowski, A.: Symmetries in graph coding. In: Proceedings of Northwest 1976 ACM–CIPS Pacific Regional Symposium, pp. 198–203 (1976)
Google Scholar
HM, C.B., Santolini, A.: A quasi-decision algorithm for the p-equivalence of two matrices. ICC Bull. 8(1), 57–69 (1964)
Google Scholar
IUPAC, Nomenclature of Organic Chemistry. Pergamon Press, Oxford (1979)
Google Scholar
Klin, M.H., Lebedev, O.V., Pivina, T.S., Zefirov, N.S.: Nonisomorphic cycles of maximum length in a series of chemical graphs and the problem of application of IUPAC nomenclature rules. MATCH 27, 133–151 (1992)
MATH MathSciNet Google Scholar
See http://www.iupac.org/projects/2000/2000-025-1-800.html (retrieved on March 2, 2005)
Randic, M., Brissey, G.M., Wilkins, C.L.: Computer perception of topological symmetry via canonical numbering of atoms. Journal of Chemical Information and Computer Sciences 21(1), 52–59 (1981)
Google Scholar
McKay, B.: Practical Graph Isomorphism. Congr. Numer. 30, 45–87 (1981)
MathSciNet Google Scholar
Morgan, H.L.: The Generation of a Unique Machine Description for Chemical Structures – A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 5, 107–113 (1965)
Article Google Scholar
Braun, J., Gugisch, R., Kerber, A., Laue, R., Meringer, M., Rcker, C.: MOLGEN-CID, A Canonizer for Molecules and Graphs Accessible through the Internet. Journal of Chemical Information and Computer Sciences 44, 542–548 (2004)
Google Scholar
Grossman, R., Hamelberg, D., Kasturi, P., Liu, B.: Experimental Studies of the Universal Chemical Key (UCK) Algorithm on the NCI Database of Chemical Compounds. In: Proceedings of the 2003 IEEE Computer Society Bioinformatics Conference (CSB 2003), pp. 244–250. IEEE Computer Society, Los Alamitos (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory for Advanced Computing, University of Illinois at Chicago, Chicago, IL, 60607, USA
Greeshma Neglur & Robert L. Grossman
Department of Computer Science, University of Illinois at Chicago, Chicago, IL, 60607, USA
Bing Liu

Authors

Greeshma Neglur
View author publications
You can also search for this author in PubMed Google Scholar
Robert L. Grossman
View author publications
You can also search for this author in PubMed Google Scholar
Bing Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of California, Davis,
Bertram Ludäscher
University of Maryland, College Park, 20742, MD, USA
Louiqa Raschid

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Neglur, G., Grossman, R.L., Liu, B. (2005). Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples. In: Ludäscher, B., Raschid, L. (eds) Data Integration in the Life Sciences. DILS 2005. Lecture Notes in Computer Science(), vol 3615. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11530084_13

Download citation

DOI: https://doi.org/10.1007/11530084_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27967-9
Online ISBN: 978-3-540-31879-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics