A DNA assembly model of sentence generation
Introduction
DNA molecules have been used for various purposes. Much of their use exploits the self-assembly property of DNA (Adleman, 1994, Mao et al., 2000, Maune et al., 2010, Paun et al., 1998, Reif and LaBean, 2007, Seeman, 2003, Winfree et al., 1998, Yan et al., 2003, Zheng et al., 2009). The nucleotides of A, T, G, and C recognize their counterparts by Watson–Crick complementarity (A–T and G–C) and this molecular recognition property can be used to assemble small DNA-molecular structures into larger structures. Many information processes can be formulated as a DNA self-assembly process once the molecular building blocks are defined and their target structures appropriately represented. In this paper we explore the potential of using DNA self-assembly for sentence generation based on a language corpus. Recent studies in usage-based linguistics (Tomasello, 2003, Tummers et al., 2005) and corpus-based language processing (Biber et al., 1998, Foster, 2007, Heylen et al., 2008, McEnery et al., 2006) show that language models can be built from a language corpus, and sentences can be generated by performing a stochastic constraint satisfaction process, i.e. satisfying the conditions or constraints encoded by micro-rules or features extracted from the language corpus (Rosenfeld et al., 2001).
One of the cores of human linguistic capabilities is to assemble words into the proper sequence, decompose sentences into small fragments and words, and reconstruct sentences with phrases and words. Once the words are encoded as DNA molecules we may exploit the molecular recognition properties of DNA to assemble the words into phrases and sentences. By designing the molecules to reflect the constraints expressed in the language corpus, the language generation can be naturally simulated by the DNA assembly process. Recently, simulation studies were conducted to evaluate this potential (Lee et al., 2009, Zhang and Park, 2008). They collected drama dialogues to build a language corpus and used a specialized probabilistic graphical model called a hypernetwork to build a language model. Once the hypernetwork model is constructed, new sentences are generated from the model by assembling the linguistic fragments that match with the given seed words. Since the encoding and decoding of sentences are modeled after the DNA hypernetworks (Zhang and Kim, 2006), the whole process can be implemented in DNA wet experiments.
In this paper we present a method for encoding the DNA sequences to represent the words and phrases. We also present a DNA-based model for assembling the words into phrases to generate sentences. The procedures are verified by in vitro DNA experiments on a vocabulary of 8 words. The results are confirmed by sequencing the DNA products resulting from the assembly processes. We also illustrate and demonstrate the potential of the DNA assembly procedures for language processing by performing in silico simulations on a large corpus of sentences collected from TV dramas.
Section snippets
The DNA assembly model of language
The language model is constructed using the DNA hypernetwork structure (Zhang, 2008). The language hypernetwork consists of a large number of hyperedges which correspond to phrases, where each hyperedge consists of words constituting the phrase. Each word can be encoded as a DNA sequence and each phrase or hyperedge as a consecutive sequence of DNA molecules. Given a corpus of sentences, a language hypernetwork is built by sampling many hyperedges from each sentence. In generating a new
Experiments and results
To verify the feasibility of our DNA assembly model, we performed a restricted wet experiment. The nucleotide sequences for sentence generation are given in Table 2. All sequences were generated by the sequence design method as described in Section 2.3. Their free energy frequencies were checked as shown in Fig. 6.
All sequences were purchased from Bioneer (Daejeon, Korea), and each sequence was brought to a stock concentration 100 uM in distilled water and stored at −20 °C. The hybridization was
Concluding remarks
We have designed a DNA encoding method for words and phrases, and presented a DNA assembly model for sentence generation which is based on the hypernetwork structure. We performed in vitro DNA experiments and demonstrated that context-appropriate sentences are generated from the DNA language model. Since the DNA language model can be constructed from a large corpus of sentences, the DNA assembly model can be used to simulate usage-based linguistics and corpus-based language processing. In
Acknowledgements
This work was supported in part by the Ministry of Education, Science, and Technology through NRF (KRF-2008-314-D00377, 2010-0017734, 0421-20110032, 2010K001137, 2010-0020821, 2011-0000331, 2011-0001643), the Ministry of Knowledge and Economy through KEIT (IITA-2009-A1100-0901-1639), the BK21-IT Program, and the Korea Student Aid Foundation (KOSAF) (No. S2-2009-000-01116-1).
References (23)
- et al.
Whole-sentence exponential language models: a vehicle for linguistic-statistical integration
Computer Speech and Language
(2001) Molecular computation of solutions to combinatorial problems
Science
(1994)- et al.
Corpus Linguistics
(1998) Issues for corpus-based multimodal generation
Citeseer
(2007)- et al.
Methodological issues in corpus-based cognitive linguistics
Cognitive Sociolinguistics: Language Variation Cultural Models, Social Systems
(2008) - et al.
Hypernetwork memory-based model for infant's language learning
Journal of KIISE: Computing Practices and Letters
(2009) - et al.
Logical computation using algorithmic self-assembly of DNA triple-crossover molecules
Nature
(2000) - et al.
UNAFold: software for nucleic acid folding and hybridization
Methods in Molecular Biololgy
(2008) - et al.
Self-assembly of carbon nanotubes into two-dimensional geometries using DNA origami templates
Nature Nanotechnology
(2010) - et al.
Corpus-based Language Studies: An Advanced Resource Book
(2006)
DNA Computing: New Computing Paradigms
Cited by (5)
In vitro molecular machine learning algorithm via symmetric internal loops of DNA
2017, BioSystemsCitation Excerpt :Therefore, to bridge the gap between humans and machines, machine learning should ideally be implemented at the molecular level. However, previous studies have reported findings regarding molecular computing (Adleman, 1994; Amir et al., 2014; Benenson et al., 2001; Brown et al., 2014; Chen et al., 2013; Mao et al., 2000; Pei et al., 2010; Qian et al., 2011; Seelig et al., 2006; Stojanovic and Stefanovic, 2003; Wang et al., 2014; Winfree et al., 1998; Yurke et al., 2000; Zhang and Kim, 2006; Zhang and Seelig, 2011), pattern classification, and associative recall (Chen et al., 2005; de Murieta and Rodríguez-Patón, 2012; Lakin et al., 2012; Lee et al., 2011a; Lee et al., 2011b; Lim et al., 2010; Lim et al., 2002; Pei et al., 2010; Qian et al., 2011; Zhang, 2008; Zhang and Jang, 2004; Zhang and Kim, 2006). Although these models have demonstrated the potential to reduce the gap between human and machines through implementation of molecules to the machine, there are some limitations, particularly the absence of learning and generalization.
DNA based computing for understanding complex shapes
2014, BioSystemsCitation Excerpt :This remarkable feature has been brought into the attention by Adleman in 1990s who showed how to solve a traveling salesman problem by using the DNA strands in wet-media. Since then, DBC has been found effective in solving such computational problems as NP-hard, pattern recognition, scheduling, clustering, in developing such structures as nano-scale mechanisms, self-repairing/adaptive robots, logic gates, futuristic computers, in generating random numbers, in processing natural language and image, in developing cryptographic systems, and so on (Adleman, 1994; Lipton, 1995; Sakamoto et al., 2000; Wasiewicz et al., 2001; Guo et al., 2005; Hsieh et al., 2008; Ran et al., 2009; Ullah et al., 1997; Sakakibara, 2003; Xu et al., 2006; Nie and Zhong, 2012; Muhammad et al., 2006; Stojanovic et al., 2002; Murata and Stojanovic, 2008; Gerasimova and Kolpashchikov, 2012; Murata et al., 2013; Chen and Yang, 2010; Gearheart et al., 2010; Wu et al., 2009; Bakar et al., 2008; Lin et al., 2007; Komiya et al., 2006; Yeh et al., 2006; Lee et al., 2011; Babaei, 2013; Xiao et al., 2006). In most cases, the DBC is performed in wet-media (in vitro) through the hybridization of the relatively short strands of genetic molecules (short DNA, m/tRNA, protein strands).
Biomolecular computation with molecular beacons for quantitative analysis of target nucleic acids
2013, BioSystemsCitation Excerpt :Molecular signal amplifiers (Zhang and Seelig, 2010) to preprocess raw level information, more intuitive visualization techniques (Lee et al., 2008a), or DNA-based learning (Zhang and Kim, 2006; Lee et al., 2011) for learning weights for classifiers would be good examples.
Investigating a mixed-Initiative workflow for digital mind-Mapping
2020, Journal of Mechanical DesignRoad speed limit sign recognition using HSV color space and evolutionary hypernetwork
2013, Gaojishu Tongxin/Chinese High Technology Letters