Abstract
We present strand and codeword design schemes for a DNA database capable of approximate similarity search over a multidimensional dataset of content-rich media. Our strand designs address cross-talk in associative DNA databases, and we demonstrate a novel method for learning DNA sequence encodings from data, applying it to a dataset of tens of thousands of images. We test our design in the wetlab using one hundred target images and ten query images, and show that our database is capable of performing similarity-based enrichment: on average, visually similar images account for 30% of the sequencing reads for each query, despite making up only 10% of the database.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Given a set of n pairs of binary labels \(y \in \{0,1\}\) and retrieval probabilities p, the cross-entropy loss is:
$$l(y,p) = - \frac{1}{n}\sum _{i=1}^{n} y_i \cdot \log (p_i) + (1-y_i) \cdot \log (1-p_i).$$ - 2.
Functions of the type:
$$ f(x) = \frac{1}{1+\exp (ax - b)}. $$ - 3.
Given two vectors \(\mathbf {u}\) and \(\mathbf {v}\), the cosine distance is:
$$ d(\mathbf {u},\mathbf {v}) = 1 - \frac{\mathbf {u} \cdot \mathbf {v}}{||\mathbf {u}||\ ||\mathbf {v}||}. $$ - 4.
Given an N-dimensional vector \(\mathbf {u}\), the softmax function is defined element-wise as follows:
$$ \mathrm {softmax}{(\mathbf {u})}_i = \frac{e^{u_i}}{\sum _{j=1}^{N} e^{u_j}}. $$ - 5.
The ReLU function is defined as:
$$ \mathrm {ReLU}(x) = \max (x, 0). $$
References
Adleman, L.M.: Molecular computation of solutions to combinatorial problems. Science 266(5187), 1021–1024 (1994)
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Baum, E.B.: Building an associative memory vastly larger than the brain. Science 268(5210), 583–585 (1995)
Church, G.M., Gao, Y., Kosuri, S.: Next-generation digital information storage in DNA. Science 337(6102), 1628–1628 (2012)
Dirks, R.M., Bois, J.S., Schaeffer, J.M., Winfree, E., Pierce, N.A.: Thermodynamic analysis of interacting nucleic acid strands. SIAM Rev. 49(1), 56–88 (2007)
Erlich, Y., Zielinski, D.: DNA fountain enables a robust and efficient storage architecture. Science 355(6328), 950–954 (2017)
Garzon, M.H., Bobba, K., Neel, A.: Efficiency and reliability of semantic retrieval in DNA-based memories. In: Chen, J., Reif, J. (eds.) DNA 2003. LNCS, vol. 2943, pp. 157–169. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24628-2_15
Goldman, N., et al.: Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494(7435), 77–80 (2013)
Grass, R.N., Heckel, R., Puddu, M., Paunescu, D., Stark, W.J.: Robust chemical preservation of digital information on dna in silica with error-correcting codes. Angew. Chem. Int. Ed. 54(8), 2552–2555 (2015)
Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical report, California Institute of Technology (2007)
IDC: Where in the world is storage (2013). http://www.idc.com/downloads/where_is_storage_infographic_243338.pdf
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC 1998, pp. 604–613. ACM, New York (1998). https://doi.org/10.1145/276698.276876
Kawashimo, S., Ono, H., Sadakane, K., Yamashita, M.: Dynamic neighborhood searches for thermodynamically designing DNA sequence. In: Garzon, M.H., Yan, H. (eds.) DNA 2007. LNCS, vol. 4848, pp. 130–139. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77962-9_13
Lee, V.T., Kotalik, J., del Mundo, C.C., Alaghi, A., Ceze, L., Oskin, M.: Similarity search on automata processors. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 523–534 (2017)
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013)
Neel, A., Garzon, M.: Semantic retrieval in DNA-based memories with Gibbs energy models. Biotechnol. Prog. 22(1), 86–90 (2006)
Neel, A., Garzon, M., Penumatsa, P.: Soundness and quality of semantic retrieval in DNA-based memories with abiotic data. In: 2004 Congress on Evolutionary Computation, pp. 1889–1895. IEEE (2004)
Organick, L., et al.: Random access in large-scale DNA data storage. Nat. Biotechnol. 36(3), 242–248 (2018)
Reif, J.H., LaBean, T.H.: Computationally inspired biotechnologies: improved DNA synthesis and associative search using error-correcting codes and vector-quantization? In: Condon, A., Rozenberg, G. (eds.) DNA 2000. LNCS, vol. 2054, pp. 145–172. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44992-2_11
Reif, J.H., et al.: Experimental construction of very large scale DNA databases with associative search capability. In: Jonoska, N., Seeman, N.C. (eds.) DNA 2001. LNCS, vol. 2340, pp. 231–247. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-48017-X_22
Salakhutdinov, R., Hinton, G.: Semantic hashing. Int. J. Approx. Reason. 50(7), 969–978 (2009)
Tsaftaris, S.A., Hatzimanikatis, V., Katsaggelos, A.K.: DNA hybridization as a similarity criterion for querying digital signals stored in DNA databases. In: 2006 IEEE International Conference on Acoustics Speed and Signal Processing, pp. II-1084–II-1087. IEEE (2006)
Tsaftaris, S.A., Katsaggelos, A.K., Pappas, T.N., Papoutsakis, T.E.: DNA-based matching of digital signals. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. V-581–V-584. IEEE (2004)
Tulpan, D., et al.: Thermodynamically based DNA strand design. Nucleic Acids Res. 33(15), 4951–4964 (2005)
Wan, J., et al.: Deep learning for content-based image retrieval: a comprehensive study, pp. 157–166 (2014). https://doi.org/10.1145/2647868.2654948
Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS 2008, pp. 1753–1760. Curran Associates Inc. (2008)
Wu, L.R.: Continuously tunable nucleic acid hybridization probes. Nat. Methods 12(12), 1191–1196 (2015)
Yazdi, S.M.H.T., Gabrys, R., Milenkovic, O.: Portable and error-free DNA-based data storage. Sci. Rep. 7(1), 1433 (2017)
Zadeh, J.N., et al.: NUPACK: analysis and design of nucleic acid systems. J. Comput. Chem. 32(1), 170–173 (2011)
Zhang, D.Y., Chen, S.X., Yin, P.: Optimizing the specificity of nucleic acid hybridization. Nat. Chem. 4(3), 208–214 (2012)
Acknowledgments
We would like to thank the anonymous reviewers for their input, which were very helpful to improve the manuscript. We also thank the Molecular Information Systems Lab and Seelig Lab members for their input, especially Max Willsey, who helped frame an early version. We thank Dr. Anne Fischer for suggesting a better way to present some of the data. This work was supported in part by Microsoft, and a grant from DARPA under the Molecular Informatics Program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Stewart, K. et al. (2018). A Content-Addressable DNA Database with Learned Sequence Encodings. In: Doty, D., Dietz, H. (eds) DNA Computing and Molecular Programming. DNA 2018. Lecture Notes in Computer Science(), vol 11145. Springer, Cham. https://doi.org/10.1007/978-3-030-00030-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-00030-1_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00029-5
Online ISBN: 978-3-030-00030-1
eBook Packages: Computer ScienceComputer Science (R0)