Elsevier

Biosystems

Volume 106, Issue 1, October 2011, Pages 51-56
Biosystems

A DNA assembly model of sentence generation

https://doi.org/10.1016/j.biosystems.2011.06.007Get rights and content

Abstract

Recent results of corpus-based linguistics demonstrate that context-appropriate sentences can be generated by a stochastic constraint satisfaction process. Exploiting the similarity of constraint satisfaction and DNA self-assembly, we explore a DNA assembly model of sentence generation. The words and phrases in a language corpus are encoded as DNA molecules to build a language model of the corpus. Given a seed word, the new sentences are constructed by a parallel DNA assembly process based on the probability distribution of the word and phrase molecules. Here, we present our DNA code word design and report on successful demonstration of their feasibility in wet DNA experiments of a small scale.

Introduction

DNA molecules have been used for various purposes. Much of their use exploits the self-assembly property of DNA (Adleman, 1994, Mao et al., 2000, Maune et al., 2010, Paun et al., 1998, Reif and LaBean, 2007, Seeman, 2003, Winfree et al., 1998, Yan et al., 2003, Zheng et al., 2009). The nucleotides of A, T, G, and C recognize their counterparts by Watson–Crick complementarity (A–T and G–C) and this molecular recognition property can be used to assemble small DNA-molecular structures into larger structures. Many information processes can be formulated as a DNA self-assembly process once the molecular building blocks are defined and their target structures appropriately represented. In this paper we explore the potential of using DNA self-assembly for sentence generation based on a language corpus. Recent studies in usage-based linguistics (Tomasello, 2003, Tummers et al., 2005) and corpus-based language processing (Biber et al., 1998, Foster, 2007, Heylen et al., 2008, McEnery et al., 2006) show that language models can be built from a language corpus, and sentences can be generated by performing a stochastic constraint satisfaction process, i.e. satisfying the conditions or constraints encoded by micro-rules or features extracted from the language corpus (Rosenfeld et al., 2001).

One of the cores of human linguistic capabilities is to assemble words into the proper sequence, decompose sentences into small fragments and words, and reconstruct sentences with phrases and words. Once the words are encoded as DNA molecules we may exploit the molecular recognition properties of DNA to assemble the words into phrases and sentences. By designing the molecules to reflect the constraints expressed in the language corpus, the language generation can be naturally simulated by the DNA assembly process. Recently, simulation studies were conducted to evaluate this potential (Lee et al., 2009, Zhang and Park, 2008). They collected drama dialogues to build a language corpus and used a specialized probabilistic graphical model called a hypernetwork to build a language model. Once the hypernetwork model is constructed, new sentences are generated from the model by assembling the linguistic fragments that match with the given seed words. Since the encoding and decoding of sentences are modeled after the DNA hypernetworks (Zhang and Kim, 2006), the whole process can be implemented in DNA wet experiments.

In this paper we present a method for encoding the DNA sequences to represent the words and phrases. We also present a DNA-based model for assembling the words into phrases to generate sentences. The procedures are verified by in vitro DNA experiments on a vocabulary of 8 words. The results are confirmed by sequencing the DNA products resulting from the assembly processes. We also illustrate and demonstrate the potential of the DNA assembly procedures for language processing by performing in silico simulations on a large corpus of sentences collected from TV dramas.

Section snippets

The DNA assembly model of language

The language model is constructed using the DNA hypernetwork structure (Zhang, 2008). The language hypernetwork consists of a large number of hyperedges which correspond to phrases, where each hyperedge consists of words constituting the phrase. Each word can be encoded as a DNA sequence and each phrase or hyperedge as a consecutive sequence of DNA molecules. Given a corpus of sentences, a language hypernetwork is built by sampling many hyperedges from each sentence. In generating a new

Experiments and results

To verify the feasibility of our DNA assembly model, we performed a restricted wet experiment. The nucleotide sequences for sentence generation are given in Table 2. All sequences were generated by the sequence design method as described in Section 2.3. Their free energy frequencies were checked as shown in Fig. 6.

All sequences were purchased from Bioneer (Daejeon, Korea), and each sequence was brought to a stock concentration 100 uM in distilled water and stored at −20 °C. The hybridization was

Concluding remarks

We have designed a DNA encoding method for words and phrases, and presented a DNA assembly model for sentence generation which is based on the hypernetwork structure. We performed in vitro DNA experiments and demonstrated that context-appropriate sentences are generated from the DNA language model. Since the DNA language model can be constructed from a large corpus of sentences, the DNA assembly model can be used to simulate usage-based linguistics and corpus-based language processing. In

Acknowledgements

This work was supported in part by the Ministry of Education, Science, and Technology through NRF (KRF-2008-314-D00377, 2010-0017734, 0421-20110032, 2010K001137, 2010-0020821, 2011-0000331, 2011-0001643), the Ministry of Knowledge and Economy through KEIT (IITA-2009-A1100-0901-1639), the BK21-IT Program, and the Korea Student Aid Foundation (KOSAF) (No. S2-2009-000-01116-1).

References (23)

  • R. Rosenfeld et al.

    Whole-sentence exponential language models: a vehicle for linguistic-statistical integration

    Computer Speech and Language

    (2001)
  • L.M. Adleman

    Molecular computation of solutions to combinatorial problems

    Science

    (1994)
  • Biber D. et al.

    Corpus Linguistics

    (1998)
  • M. Foster

    Issues for corpus-based multimodal generation

    Citeseer

    (2007)
  • K. Heylen et al.

    Methodological issues in corpus-based cognitive linguistics

    Cognitive Sociolinguistics: Language Variation Cultural Models, Social Systems

    (2008)
  • J.-H. Lee et al.

    Hypernetwork memory-based model for infant's language learning

    Journal of KIISE: Computing Practices and Letters

    (2009)
  • C. Mao et al.

    Logical computation using algorithmic self-assembly of DNA triple-crossover molecules

    Nature

    (2000)
  • N.R. Markham et al.

    UNAFold: software for nucleic acid folding and hybridization

    Methods in Molecular Biololgy

    (2008)
  • H.T. Maune et al.

    Self-assembly of carbon nanotubes into two-dimensional geometries using DNA origami templates

    Nature Nanotechnology

    (2010)
  • T. McEnery et al.

    Corpus-based Language Studies: An Advanced Resource Book

    (2006)
  • G. Paun et al.

    DNA Computing: New Computing Paradigms

    (1998)
  • Cited by (5)

    • In vitro molecular machine learning algorithm via symmetric internal loops of DNA

      2017, BioSystems
      Citation Excerpt :

      Therefore, to bridge the gap between humans and machines, machine learning should ideally be implemented at the molecular level. However, previous studies have reported findings regarding molecular computing (Adleman, 1994; Amir et al., 2014; Benenson et al., 2001; Brown et al., 2014; Chen et al., 2013; Mao et al., 2000; Pei et al., 2010; Qian et al., 2011; Seelig et al., 2006; Stojanovic and Stefanovic, 2003; Wang et al., 2014; Winfree et al., 1998; Yurke et al., 2000; Zhang and Kim, 2006; Zhang and Seelig, 2011), pattern classification, and associative recall (Chen et al., 2005; de Murieta and Rodríguez-Patón, 2012; Lakin et al., 2012; Lee et al., 2011a; Lee et al., 2011b; Lim et al., 2010; Lim et al., 2002; Pei et al., 2010; Qian et al., 2011; Zhang, 2008; Zhang and Jang, 2004; Zhang and Kim, 2006). Although these models have demonstrated the potential to reduce the gap between human and machines through implementation of molecules to the machine, there are some limitations, particularly the absence of learning and generalization.

    • DNA based computing for understanding complex shapes

      2014, BioSystems
      Citation Excerpt :

      This remarkable feature has been brought into the attention by Adleman in 1990s who showed how to solve a traveling salesman problem by using the DNA strands in wet-media. Since then, DBC has been found effective in solving such computational problems as NP-hard, pattern recognition, scheduling, clustering, in developing such structures as nano-scale mechanisms, self-repairing/adaptive robots, logic gates, futuristic computers, in generating random numbers, in processing natural language and image, in developing cryptographic systems, and so on (Adleman, 1994; Lipton, 1995; Sakamoto et al., 2000; Wasiewicz et al., 2001; Guo et al., 2005; Hsieh et al., 2008; Ran et al., 2009; Ullah et al., 1997; Sakakibara, 2003; Xu et al., 2006; Nie and Zhong, 2012; Muhammad et al., 2006; Stojanovic et al., 2002; Murata and Stojanovic, 2008; Gerasimova and Kolpashchikov, 2012; Murata et al., 2013; Chen and Yang, 2010; Gearheart et al., 2010; Wu et al., 2009; Bakar et al., 2008; Lin et al., 2007; Komiya et al., 2006; Yeh et al., 2006; Lee et al., 2011; Babaei, 2013; Xiao et al., 2006). In most cases, the DBC is performed in wet-media (in vitro) through the hybridization of the relatively short strands of genetic molecules (short DNA, m/tRNA, protein strands).

    • Biomolecular computation with molecular beacons for quantitative analysis of target nucleic acids

      2013, BioSystems
      Citation Excerpt :

      Molecular signal amplifiers (Zhang and Seelig, 2010) to preprocess raw level information, more intuitive visualization techniques (Lee et al., 2008a), or DNA-based learning (Zhang and Kim, 2006; Lee et al., 2011) for learning weights for classifiers would be good examples.

    View full text