Skip to main content

A Content-Addressable DNA Database with Learned Sequence Encodings

  • Conference paper
  • First Online:
DNA Computing and Molecular Programming (DNA 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11145))

Included in the following conference series:

Abstract

We present strand and codeword design schemes for a DNA database capable of approximate similarity search over a multidimensional dataset of content-rich media. Our strand designs address cross-talk in associative DNA databases, and we demonstrate a novel method for learning DNA sequence encodings from data, applying it to a dataset of tens of thousands of images. We test our design in the wetlab using one hundred target images and ten query images, and show that our database is capable of performing similarity-based enrichment: on average, visually similar images account for 30% of the sequencing reads for each query, despite making up only 10% of the database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Given a set of n pairs of binary labels \(y \in \{0,1\}\) and retrieval probabilities p, the cross-entropy loss is:

    $$l(y,p) = - \frac{1}{n}\sum _{i=1}^{n} y_i \cdot \log (p_i) + (1-y_i) \cdot \log (1-p_i).$$
  2. 2.

    Functions of the type:

    $$ f(x) = \frac{1}{1+\exp (ax - b)}. $$
  3. 3.

    Given two vectors \(\mathbf {u}\) and \(\mathbf {v}\), the cosine distance is:

    $$ d(\mathbf {u},\mathbf {v}) = 1 - \frac{\mathbf {u} \cdot \mathbf {v}}{||\mathbf {u}||\ ||\mathbf {v}||}. $$
  4. 4.

    Given an N-dimensional vector \(\mathbf {u}\), the softmax function is defined element-wise as follows:

    $$ \mathrm {softmax}{(\mathbf {u})}_i = \frac{e^{u_i}}{\sum _{j=1}^{N} e^{u_j}}. $$
  5. 5.

    The ReLU function is defined as:

    $$ \mathrm {ReLU}(x) = \max (x, 0). $$

References

  1. Adleman, L.M.: Molecular computation of solutions to combinatorial problems. Science 266(5187), 1021–1024 (1994)

    Article  Google Scholar 

  2. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)

    Article  Google Scholar 

  3. Baum, E.B.: Building an associative memory vastly larger than the brain. Science 268(5210), 583–585 (1995)

    Article  Google Scholar 

  4. Church, G.M., Gao, Y., Kosuri, S.: Next-generation digital information storage in DNA. Science 337(6102), 1628–1628 (2012)

    Article  Google Scholar 

  5. Dirks, R.M., Bois, J.S., Schaeffer, J.M., Winfree, E., Pierce, N.A.: Thermodynamic analysis of interacting nucleic acid strands. SIAM Rev. 49(1), 56–88 (2007)

    Article  MathSciNet  Google Scholar 

  6. Erlich, Y., Zielinski, D.: DNA fountain enables a robust and efficient storage architecture. Science 355(6328), 950–954 (2017)

    Article  Google Scholar 

  7. Garzon, M.H., Bobba, K., Neel, A.: Efficiency and reliability of semantic retrieval in DNA-based memories. In: Chen, J., Reif, J. (eds.) DNA 2003. LNCS, vol. 2943, pp. 157–169. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24628-2_15

    Chapter  MATH  Google Scholar 

  8. Goldman, N., et al.: Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494(7435), 77–80 (2013)

    Article  Google Scholar 

  9. Grass, R.N., Heckel, R., Puddu, M., Paunescu, D., Stark, W.J.: Robust chemical preservation of digital information on dna in silica with error-correcting codes. Angew. Chem. Int. Ed. 54(8), 2552–2555 (2015)

    Article  Google Scholar 

  10. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical report, California Institute of Technology (2007)

    Google Scholar 

  11. IDC: Where in the world is storage (2013). http://www.idc.com/downloads/where_is_storage_infographic_243338.pdf

  12. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC 1998, pp. 604–613. ACM, New York (1998). https://doi.org/10.1145/276698.276876

  13. Kawashimo, S., Ono, H., Sadakane, K., Yamashita, M.: Dynamic neighborhood searches for thermodynamically designing DNA sequence. In: Garzon, M.H., Yan, H. (eds.) DNA 2007. LNCS, vol. 4848, pp. 130–139. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77962-9_13

    Chapter  MATH  Google Scholar 

  14. Lee, V.T., Kotalik, J., del Mundo, C.C., Alaghi, A., Ceze, L., Oskin, M.: Similarity search on automata processors. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 523–534 (2017)

    Google Scholar 

  15. Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013)

    Google Scholar 

  16. Neel, A., Garzon, M.: Semantic retrieval in DNA-based memories with Gibbs energy models. Biotechnol. Prog. 22(1), 86–90 (2006)

    Article  Google Scholar 

  17. Neel, A., Garzon, M., Penumatsa, P.: Soundness and quality of semantic retrieval in DNA-based memories with abiotic data. In: 2004 Congress on Evolutionary Computation, pp. 1889–1895. IEEE (2004)

    Google Scholar 

  18. Organick, L., et al.: Random access in large-scale DNA data storage. Nat. Biotechnol. 36(3), 242–248 (2018)

    Article  Google Scholar 

  19. Reif, J.H., LaBean, T.H.: Computationally inspired biotechnologies: improved DNA synthesis and associative search using error-correcting codes and vector-quantization? In: Condon, A., Rozenberg, G. (eds.) DNA 2000. LNCS, vol. 2054, pp. 145–172. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44992-2_11

    Chapter  MATH  Google Scholar 

  20. Reif, J.H., et al.: Experimental construction of very large scale DNA databases with associative search capability. In: Jonoska, N., Seeman, N.C. (eds.) DNA 2001. LNCS, vol. 2340, pp. 231–247. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-48017-X_22

    Chapter  Google Scholar 

  21. Salakhutdinov, R., Hinton, G.: Semantic hashing. Int. J. Approx. Reason. 50(7), 969–978 (2009)

    Article  Google Scholar 

  22. Tsaftaris, S.A., Hatzimanikatis, V., Katsaggelos, A.K.: DNA hybridization as a similarity criterion for querying digital signals stored in DNA databases. In: 2006 IEEE International Conference on Acoustics Speed and Signal Processing, pp. II-1084–II-1087. IEEE (2006)

    Google Scholar 

  23. Tsaftaris, S.A., Katsaggelos, A.K., Pappas, T.N., Papoutsakis, T.E.: DNA-based matching of digital signals. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. V-581–V-584. IEEE (2004)

    Google Scholar 

  24. Tulpan, D., et al.: Thermodynamically based DNA strand design. Nucleic Acids Res. 33(15), 4951–4964 (2005)

    Article  Google Scholar 

  25. Wan, J., et al.: Deep learning for content-based image retrieval: a comprehensive study, pp. 157–166 (2014). https://doi.org/10.1145/2647868.2654948

  26. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS 2008, pp. 1753–1760. Curran Associates Inc. (2008)

    Google Scholar 

  27. Wu, L.R.: Continuously tunable nucleic acid hybridization probes. Nat. Methods 12(12), 1191–1196 (2015)

    Article  Google Scholar 

  28. Yazdi, S.M.H.T., Gabrys, R., Milenkovic, O.: Portable and error-free DNA-based data storage. Sci. Rep. 7(1), 1433 (2017)

    Article  Google Scholar 

  29. Zadeh, J.N., et al.: NUPACK: analysis and design of nucleic acid systems. J. Comput. Chem. 32(1), 170–173 (2011)

    Article  Google Scholar 

  30. Zhang, D.Y., Chen, S.X., Yin, P.: Optimizing the specificity of nucleic acid hybridization. Nat. Chem. 4(3), 208–214 (2012)

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank the anonymous reviewers for their input, which were very helpful to improve the manuscript. We also thank the Molecular Information Systems Lab and Seelig Lab members for their input, especially Max Willsey, who helped frame an early version. We thank Dr. Anne Fischer for suggesting a better way to present some of the data. This work was supported in part by Microsoft, and a grant from DARPA under the Molecular Informatics Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kendall Stewart .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Stewart, K. et al. (2018). A Content-Addressable DNA Database with Learned Sequence Encodings. In: Doty, D., Dietz, H. (eds) DNA Computing and Molecular Programming. DNA 2018. Lecture Notes in Computer Science(), vol 11145. Springer, Cham. https://doi.org/10.1007/978-3-030-00030-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00030-1_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00029-5

  • Online ISBN: 978-3-030-00030-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics