A Content-Addressable DNA Database with Learned Sequence Encodings

Stewart, Kendall; Chen, Yuan-Jyue; Ward, David; Liu, Xiaomeng; Seelig, Georg; Strauss, Karin; Ceze, Luis

doi:10.1007/978-3-030-00030-1_4

Kendall Stewart¹⁵,
Yuan-Jyue Chen¹⁶,
David Ward¹⁵,
Xiaomeng Liu¹⁵,
Georg Seelig¹⁵,
Karin Strauss^15,16 &
…
Luis Ceze¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11145))

Included in the following conference series:

International Conference on DNA Computing and Molecular Programming

1307 Accesses
18 Citations
6 Altmetric

Abstract

We present strand and codeword design schemes for a DNA database capable of approximate similarity search over a multidimensional dataset of content-rich media. Our strand designs address cross-talk in associative DNA databases, and we demonstrate a novel method for learning DNA sequence encodings from data, applying it to a dataset of tens of thousands of images. We test our design in the wetlab using one hundred target images and ten query images, and show that our database is capable of performing similarity-based enrichment: on average, visually similar images account for 30% of the sequencing reads for each query, despite making up only 10% of the database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Given a set of n pairs of binary labels $y \in \{0,1\}$ and retrieval probabilities p, the cross-entropy loss is:
$$l(y,p) = - \frac{1}{n}\sum _{i=1}^{n} y_i \cdot \log (p_i) + (1-y_i) \cdot \log (1-p_i).$$
2.
Functions of the type:
$$ f(x) = \frac{1}{1+\exp (ax - b)}. $$
3.
Given two vectors $\mathbf {u}$ and $\mathbf {v}$, the cosine distance is:
$$ d(\mathbf {u},\mathbf {v}) = 1 - \frac{\mathbf {u} \cdot \mathbf {v}}{||\mathbf {u}||\ ||\mathbf {v}||}. $$
4.
Given an N-dimensional vector $\mathbf {u}$, the softmax function is defined element-wise as follows:
$$ \mathrm {softmax}{(\mathbf {u})}_i = \frac{e^{u_i}}{\sum _{j=1}^{N} e^{u_j}}. $$
5.
The ReLU function is defined as:
$$ \mathrm {ReLU}(x) = \max (x, 0). $$

References

Adleman, L.M.: Molecular computation of solutions to combinatorial problems. Science 266(5187), 1021–1024 (1994)
Article Google Scholar
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Article Google Scholar
Baum, E.B.: Building an associative memory vastly larger than the brain. Science 268(5210), 583–585 (1995)
Article Google Scholar
Church, G.M., Gao, Y., Kosuri, S.: Next-generation digital information storage in DNA. Science 337(6102), 1628–1628 (2012)
Article Google Scholar
Dirks, R.M., Bois, J.S., Schaeffer, J.M., Winfree, E., Pierce, N.A.: Thermodynamic analysis of interacting nucleic acid strands. SIAM Rev. 49(1), 56–88 (2007)
Article MathSciNet Google Scholar
Erlich, Y., Zielinski, D.: DNA fountain enables a robust and efficient storage architecture. Science 355(6328), 950–954 (2017)
Article Google Scholar
Garzon, M.H., Bobba, K., Neel, A.: Efficiency and reliability of semantic retrieval in DNA-based memories. In: Chen, J., Reif, J. (eds.) DNA 2003. LNCS, vol. 2943, pp. 157–169. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24628-2_15
Chapter MATH Google Scholar
Goldman, N., et al.: Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494(7435), 77–80 (2013)
Article Google Scholar
Grass, R.N., Heckel, R., Puddu, M., Paunescu, D., Stark, W.J.: Robust chemical preservation of digital information on dna in silica with error-correcting codes. Angew. Chem. Int. Ed. 54(8), 2552–2555 (2015)
Article Google Scholar
Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical report, California Institute of Technology (2007)
Google Scholar
IDC: Where in the world is storage (2013). http://www.idc.com/downloads/where_is_storage_infographic_243338.pdf
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC 1998, pp. 604–613. ACM, New York (1998). https://doi.org/10.1145/276698.276876
Kawashimo, S., Ono, H., Sadakane, K., Yamashita, M.: Dynamic neighborhood searches for thermodynamically designing DNA sequence. In: Garzon, M.H., Yan, H. (eds.) DNA 2007. LNCS, vol. 4848, pp. 130–139. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77962-9_13
Chapter MATH Google Scholar
Lee, V.T., Kotalik, J., del Mundo, C.C., Alaghi, A., Ceze, L., Oskin, M.: Similarity search on automata processors. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 523–534 (2017)
Google Scholar
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013)
Google Scholar
Neel, A., Garzon, M.: Semantic retrieval in DNA-based memories with Gibbs energy models. Biotechnol. Prog. 22(1), 86–90 (2006)
Article Google Scholar
Neel, A., Garzon, M., Penumatsa, P.: Soundness and quality of semantic retrieval in DNA-based memories with abiotic data. In: 2004 Congress on Evolutionary Computation, pp. 1889–1895. IEEE (2004)
Google Scholar
Organick, L., et al.: Random access in large-scale DNA data storage. Nat. Biotechnol. 36(3), 242–248 (2018)
Article Google Scholar
Reif, J.H., LaBean, T.H.: Computationally inspired biotechnologies: improved DNA synthesis and associative search using error-correcting codes and vector-quantization? In: Condon, A., Rozenberg, G. (eds.) DNA 2000. LNCS, vol. 2054, pp. 145–172. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44992-2_11
Chapter MATH Google Scholar
Reif, J.H., et al.: Experimental construction of very large scale DNA databases with associative search capability. In: Jonoska, N., Seeman, N.C. (eds.) DNA 2001. LNCS, vol. 2340, pp. 231–247. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-48017-X_22
Chapter Google Scholar
Salakhutdinov, R., Hinton, G.: Semantic hashing. Int. J. Approx. Reason. 50(7), 969–978 (2009)
Article Google Scholar
Tsaftaris, S.A., Hatzimanikatis, V., Katsaggelos, A.K.: DNA hybridization as a similarity criterion for querying digital signals stored in DNA databases. In: 2006 IEEE International Conference on Acoustics Speed and Signal Processing, pp. II-1084–II-1087. IEEE (2006)
Google Scholar
Tsaftaris, S.A., Katsaggelos, A.K., Pappas, T.N., Papoutsakis, T.E.: DNA-based matching of digital signals. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. V-581–V-584. IEEE (2004)
Google Scholar
Tulpan, D., et al.: Thermodynamically based DNA strand design. Nucleic Acids Res. 33(15), 4951–4964 (2005)
Article Google Scholar
Wan, J., et al.: Deep learning for content-based image retrieval: a comprehensive study, pp. 157–166 (2014). https://doi.org/10.1145/2647868.2654948
Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS 2008, pp. 1753–1760. Curran Associates Inc. (2008)
Google Scholar
Wu, L.R.: Continuously tunable nucleic acid hybridization probes. Nat. Methods 12(12), 1191–1196 (2015)
Article Google Scholar
Yazdi, S.M.H.T., Gabrys, R., Milenkovic, O.: Portable and error-free DNA-based data storage. Sci. Rep. 7(1), 1433 (2017)
Article Google Scholar
Zadeh, J.N., et al.: NUPACK: analysis and design of nucleic acid systems. J. Comput. Chem. 32(1), 170–173 (2011)
Article Google Scholar
Zhang, D.Y., Chen, S.X., Yin, P.: Optimizing the specificity of nucleic acid hybridization. Nat. Chem. 4(3), 208–214 (2012)
Article Google Scholar

Download references

Acknowledgments

We would like to thank the anonymous reviewers for their input, which were very helpful to improve the manuscript. We also thank the Molecular Information Systems Lab and Seelig Lab members for their input, especially Max Willsey, who helped frame an early version. We thank Dr. Anne Fischer for suggesting a better way to present some of the data. This work was supported in part by Microsoft, and a grant from DARPA under the Molecular Informatics Program.

Author information

Authors and Affiliations

Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, USA
Kendall Stewart, David Ward, Xiaomeng Liu, Georg Seelig, Karin Strauss & Luis Ceze
Microsoft Research, Redmond, WA, USA
Yuan-Jyue Chen & Karin Strauss

Authors

Kendall Stewart
View author publications
You can also search for this author in PubMed Google Scholar
Yuan-Jyue Chen
View author publications
You can also search for this author in PubMed Google Scholar
David Ward
View author publications
You can also search for this author in PubMed Google Scholar
Xiaomeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Georg Seelig
View author publications
You can also search for this author in PubMed Google Scholar
Karin Strauss
View author publications
You can also search for this author in PubMed Google Scholar
Luis Ceze
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kendall Stewart .

Editor information

Editors and Affiliations

University of California, Davis, CA, USA
David Doty
Technical University Munich, Garching, Germany
Hendrik Dietz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stewart, K. et al. (2018). A Content-Addressable DNA Database with Learned Sequence Encodings. In: Doty, D., Dietz, H. (eds) DNA Computing and Molecular Programming. DNA 2018. Lecture Notes in Computer Science(), vol 11145. Springer, Cham. https://doi.org/10.1007/978-3-030-00030-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-00030-1_4
Published: 07 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00029-5
Online ISBN: 978-3-030-00030-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics