Abstract
Integrating data involving chemical structures is simplified when unique identifiers (UIDs) can be associated with chemical structures. For example, these identifiers can be used as database keys. One common approach is to use the Unique SMILES notation introduced in [2]. The Unique SMILES views a chemical structure as a graph with atoms as nodes and bonds as edges and uses a depth first traversal of the graph to generate the SMILES strings. The algorithm establishes a node ordering by using certain symmetry properties of the graphs. In this paper, we present certain molecular graphs for which the algorithm fails to generate UIDs. Indeed, we show that different graphs in the same symmetry class employed by the Unique SMILES algorithm have different Unique SMILES IDs. We tested the algorithm on the National Cancer Institute (NCI) database [7] and found several molecular structures for which the algorithm also failed. We have also written a python script that generates molecular graphs for which the algorithm fails.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Weininger, D.: SMILES, a Chemical Language and Information System 1: Introduction to Methodology and Encoding Rules, Medicinal Chemistry Project, Pomona College (1988)
Weininger, D., Weininger, A., Weininger, J.L.: SMILES 2: Algorithm for Generation of Unique SMILES Notation, Daylight Chemical Information Systems, Irvine, California 92714 (1989); Note that although the Unique SMILES implementation has been changed by the Daylight Chemical Information System, this appears to be the most recent publication describing the algorithm
Weininger, D.: SMILES 3: Depicting Graphical Depiction of Chemical Structures, Daylight Chemical Information Systems, New Orleans, Louisiana
A SMILES to graph translation can be found at: http://www.daylight.com/daycgi/depict
A SMILES to UNIQUE SMILES translation can be found at, http://cactus.nci.nih.gov/services/translate/
More counter examples can be found at the web site, http://ncdm171.lac.uic.edu/neglur/USMILES/USMILES.html
NCI database, retrieved from http://129.43.27.140/ncidb2/ on (March 2, 2005)
Sample adjacency list used -{1:[[’C’,1,6,’0’,3],[[1,2]]], 2:[[’C’,2,6,’0’,2],[[1,1],[1,3]]], 3:[[’C’,4,6,’0’,0],[[1,2],[2,4],[1,11]]], 4:[[’C’,3,6,’0’,1],[[2,3],[1,5]]], 5:[[’C’,4,6,’0’,0],[[1,4],[1,6],[2,8]]], 6:[[’C’,2,6,’0’,2],[[1,5],[1,7]]], 7:[[’C’,1,6,’0’,3],[[1,6]]], 8:[[’C’,3,6,’0’,1],[[2,5],[1,9]]], 9:[[’C’,4,6,’0’,0],[[1,8],[1,10],[2,11]]], 10:[[’C’,1,6,’0’,3],[[1,9]]], 11:[[’C’,3,6,’0’,1],[[1,3],[2,9]]]}
CANON Algorithm (Extract from Reference [2])- (1) Set the atomic vector to initial invariants. Go to step 3. (2) Set vector to product of primes corresponding to neighbors’ ranks. (3) Sort vector, maintaining stability over previous ranks. (4) Rank atomic vector. (5) If not invariant partitioning, go to step 2. (6) On first pass, save partitioning as symmetry classes. (7) If highest rank is smaller than number of nodes, break ties, go to step 2. (8)... else done
See http://bioweb.dataspaceweb.org/chemicalKeys (retrieved on March 2, 2005)
Beyer, T., Proskurowski, A.: Symmetries in graph coding. In: Proceedings of Northwest 1976 ACM–CIPS Pacific Regional Symposium, pp. 198–203 (1976)
HM, C.B., Santolini, A.: A quasi-decision algorithm for the p-equivalence of two matrices. ICC Bull. 8(1), 57–69 (1964)
IUPAC, Nomenclature of Organic Chemistry. Pergamon Press, Oxford (1979)
Klin, M.H., Lebedev, O.V., Pivina, T.S., Zefirov, N.S.: Nonisomorphic cycles of maximum length in a series of chemical graphs and the problem of application of IUPAC nomenclature rules. MATCH 27, 133–151 (1992)
See http://www.iupac.org/projects/2000/2000-025-1-800.html (retrieved on March 2, 2005)
Randic, M., Brissey, G.M., Wilkins, C.L.: Computer perception of topological symmetry via canonical numbering of atoms. Journal of Chemical Information and Computer Sciences 21(1), 52–59 (1981)
McKay, B.: Practical Graph Isomorphism. Congr. Numer. 30, 45–87 (1981)
Morgan, H.L.: The Generation of a Unique Machine Description for Chemical Structures – A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 5, 107–113 (1965)
Braun, J., Gugisch, R., Kerber, A., Laue, R., Meringer, M., Rcker, C.: MOLGEN-CID, A Canonizer for Molecules and Graphs Accessible through the Internet. Journal of Chemical Information and Computer Sciences 44, 542–548 (2004)
Grossman, R., Hamelberg, D., Kasturi, P., Liu, B.: Experimental Studies of the Universal Chemical Key (UCK) Algorithm on the NCI Database of Chemical Compounds. In: Proceedings of the 2003 IEEE Computer Society Bioinformatics Conference (CSB 2003), pp. 244–250. IEEE Computer Society, Los Alamitos (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Neglur, G., Grossman, R.L., Liu, B. (2005). Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples. In: Ludäscher, B., Raschid, L. (eds) Data Integration in the Life Sciences. DILS 2005. Lecture Notes in Computer Science(), vol 3615. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11530084_13
Download citation
DOI: https://doi.org/10.1007/11530084_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27967-9
Online ISBN: 978-3-540-31879-8
eBook Packages: Computer ScienceComputer Science (R0)