GenePath: a system for inference of genetic networks and proposal of genetic experiments
Introduction
Genes are basic units of genetic material. They are coded in double-stranded deoxyribonucleic acids (DNAs) and reside within chromosomes of each living cell. Generally, each gene encodes a specific protein. In the process of protein synthesis a gene is first transcribed to a single-stranded ribonucleic acid (RNA) molecule called messenger RNA (mRNA), which is then translated to a protein. Genes can excite or inhibit each other at various levels. The molecular mechanisms of excitation or inhibition can involve direct regulation of gene expression where the translation product (protein) of one gene directly regulates the transcription of another gene through protein–DNA interactions. They can also involve protein–RNA interactions that regulate mRNA translation, or protein–protein interactions that regulate protein activity, stability or localization. In this manuscript, we refer to excitatory or inhibitory relationships between genes regardless of the molecular mechanism.
Biologists often use genetic networks to express and study relations between genes and the biological processes they regulate. Genetic networks are models that, in their often very simplified way, describe some biological phenomenon from the viewpoint of interactions between genes. They provide a high-level view and disregard most details on how exactly one gene regulates the activity of another. They include genes under study and some biological processes (in most cases only one) that is regulated by these genes. A genetic network is a graph whose nodes correspond to genes and biological processes, and arcs correspond to influences of genes on other genes and on biological processes. In this paper, we consider the most commonly used and reported type of genetic networks, qualitative genetic networks. In qualitative genetic networks, the interaction between elements of the network is simplified to sole inhibition (negative influence) or excitation of activity (positive influence), while the magnitude of the influence or the particular regulation function are not shown.
Consider for example the genetic network shown in Fig. 1. It includes four genes (regA, pufA, pkaR and pkaC) and a single biological process (agg, aggregation of soil amoeba Dictyostelium discoideum). Arcs marked with denote inhibition, and those with → denote excitation. Hypothetically, keeping all other activities constant, activity of pkaR should rise upon elevated activity of regA, or decrease under the reduction of activity of regA. Conversely, and according to the network, activity of pkaC should decrease under increase of activity of pkaR (or pufA). Notice that the qualitative network does not denote to what extent and exactly how the activity of one gene should change to notice the change of activity of other genes. It depicts only the type of relation. Perhaps even more importantly, genetic networks encode the genetic pathways, e.g. chains (paths) of gene regulation. For instance, from Fig. 1 it is clear that gene pkaC is regulated by gene regA, but the regulation is not direct as it goes through another gene (pkaR). On the other hand, the genes pufA and pkaR directly influence the activity of pkaC.
The notion of “directness” is based on the closed-world assumption: genetic networks consider only a (small) subset of the cell’s genes and ignore all other genes. For instance, the network from Fig. 1 depicts only relations between regA, pufA, pkaR, pkaC and aggregation. Therefore, the arrow between regA and pufA only states that none of the other considered genes mediates between regA and pufA. There may be other genes in the cell—ones that are not included in the study and the network—on the path from regA to pufA.
A geneticist’s main tool to investigate biological phenomena and build corresponding genetic networks is inducing mutations. There are two most often used types of mutations. In gene inactivation, geneticists remove a gene from the DNA (this is also called a “knock-out”) or use some other means to prevent it from being expressed. An opposite mutation that is often harder to induce in the laboratory is activation of a gene (also referred to as overexpression), that results in a gene being significantly more active than under normal conditions. Although an organism can have more than one gene mutated, geneticists mostly use only single and double mutants since inducing a greater number of mutations can be technically difficult. Geneticists then infer genetic networks by comparing the resulting biological processes on different strains of mutated organisms, or by comparing mutant organisms to organisms without any artificially induced mutations (so-called wild-type organisms).
For instance, upon starvation, the soil amoeba D. discoideum aggregates and forms a multi-cellular organism. But a mutant with a knocked-out gene pkaC does not aggregate; this provides the evidence that pkaC is essential for the aggregation process and that its effect on aggregation is positive (excitatory). Conversely, a mutant with a knocked-out pufA aggregates excessively, also indicating the involvement of this gene in the aggregation process, but this time with a negative (inhibitory) effect. Determining the order of genes in the network and discovering the potential parallelism of effect of genes requires at least a triplet of experiments where at least one should have more than a single mutated gene. The logic and inference patterns behind this are explained later in the paper.
At this point, we should note that at present biological experiments with mutants are time consuming. It usually takes several weeks from preparing a mutant organism to obtaining experimental results in wet labs. For this reason, the largest of genetic networks reported to-date include no more than several tens of genes, and collecting experiments that provide the evidence often takes several man–years of effort. Biologists therefore reason about a limited number of genes at a time, and usually construct their networks from no more than a few tens of experiments. But even for these, the analysis of the experimental data is not trivial. The combinatorial explosion of number of possible relations and gene orderings calls for a tool that would assist in construction of genetic networks and that would incorporate the basic principles from genetic data analysis.
In this paper, we report on GenePath (http://www.genepath.org), a program that may be regarded as an intelligent assistant in genetic data analysis. It constructs genetic networks from gathered experimental results and the already known relations between genes and processes (background knowledge). Besides that, it can propose new experiments to refine the discovered networks or gather additional proofs for its relations. Its primary source of data are genetic experiments, in which genes are either knocked out or activated, and the behavior of the system—qualitative descriptions of biological processes—under such mutations. To construct a genetic network, GenePath uses a set of expert-defined inference patterns in the form “IF a certain combination of experiments is found in the data, THEN a certain relationship between genes and a biological processes is concluded”. When applied to the data, these inference patterns identify constraints for the corresponding genetic network. GenePath then uses these constraints to construct a genetic network that explains the data. The biological motivation of GenePath and details on patterns it uses for abduction were first described in [23]; the current paper focuses on methodological issues, and extends original GenePath framework with qualitative reasoning and methods to propose alternative networks and additional experiments.
To propose genetic networks, GenePath performs abductive reasoning, as opposed to the more common deductive reasoning. Deductive reasoning starts with some given logical formula A and derives a new formula B such that B logically follows from A. That is, if A is true, then B must be true. In a sense, deductive reasoning does not produce any new information: all the information contained in the result B is already implied by the information given in A. On the other hand, abductive reasoning produces results that do not necessarily follow from the given information, but are in some respect relevant to it. Thus, abduction can result in a new, useful information, but that information is not necessarily true. The most common application of abductive reasoning is generation of explanation for observations. We have some background knowledge BK and some experimental observations E. The observations E do not logically follow from the existing knowledge BK, so we say that BK does not (entirely) explain the observations (although the two are not in contradiction). We are looking for a hypothesis H, so that BK together with H imply E (formally, ). That is, H (in the context of BK) explains the given experiments. In GenePath, E represents the genetic experiments, BK is prior biological knowledge, and H is the desired genetic network.
The logic behind GenePath’s algorithm is the same as used by geneticists world-wide when they manually construct network models (see, for example [2]). A major contribution of our work is in the formalisation and partial automation of geneticists’ reasoning. Thus, a geneticist can follow the reasoning performed by GenePath and clearly understand the justification, in terms of experimental data, for including particular parts of the constructed network. The approach was successfully tested on several real-life applications. Two such applications—aggregation and sporulation of the soil amoeba D. discoideum—are reported in this paper. Another original contribution of GenePath originates from its particular implementation and interface: GenePath allows the user to examine the experimental evidence and the logic that were used to determine each particular relationship between genes. This allows a geneticist to trace back every finding (relation) to the set of original experiments that provided the evidence for it. Furthermore, GenePath can identify which couples of genes could not be related based on experimental data. It can then propose new experiments that would make it possible to relate each of these gene couples, and propose, depending on the outcome of such experiments, revisions of the genetic network. As such, GenePath is not intended to replace the researcher, but rather to support the processes of cataloging and interpreting genetic experiments, the derivation of genetic networks, and proposal of new experiments.
We start our description of GenePath with an example of genetic data on aggregation of D. discoideum that is used throughout the paper to illustrate the utility of developed techniques. A general description of GenePath’s framework is given next, followed by a description of particular mechanisms for abduction of relations, construction of genetic networks and proposal of genetic experiments. Finally, we mention another successful case of GenePath’s use (sporulation of Dictyostelium), provide some discussion of experimental results, utility and computational efficiency of GenePath, and conclude with some ideas for potential extensions of developed methodology.
It should be noted that the examples of Dictyostelium aggregation and sporulation in this paper are not useful only as illustrations of GenePath’s mechanisms, but they are of realistic complexity with respect to genetic research and both represent relevant and currently investigated genetic problems. The complexity of these examples and the resulting genetic networks are indicative of the complexity of genetic network theories in current research in functional genomics based on mutation experiments. At the end of the paper we also discuss the scalability of GenePath to much larger datasets with, say, a thousand genes. However, for reasons mentioned earlier, there is relatively little practical need, in mutation-based research, for scalability beyond a few tens of genes which is easily accommodated within GenePath’s present capability.
Section snippets
Data and background knowledge
Experiments that GenePath can consider are represented as tuples of mutations and outcomes. A mutation is specified as a set of one or more genes with information on the type of mutation. A gene can be either knocked-out or overexpressed. The outcome of an experiment is a geneticist’s description of biological processes, also termed as a phenotype (e.g. “cells grow”, “cells do not grow”, etc.). It should be noted that the abstract representation used in GenePath is precisely the representation
Overview of GenePath
GenePath implements a framework for reasoning about genetic experiments and hypothesizing genetic networks. This framework, illustrated in Fig. 3, consists of the following entities:
- •
genetic data, i.e. experiments with mutations and corresponding outcomes,
- •
background knowledge in the form of known relations between genes and biological processes,
- •
expert-defined reasoning patterns, used by GenePath to abduce gene relations from genetic experiments,
- •
abductive inference engine that matches the encoded
Expert-defined abduction patterns
GenePath’s inference patterns define how various relations between genes may be abduced from the data. These inference patterns are implemented as clauses in Prolog programming language and are used to determine the relation between two genes or the relation between a gene and a biological process. Every relation found from data is accompanied by the evidence in the form of the name of the inference pattern that was used to find the relation and the experiments that support it. Where the
Synthesis of genetic networks
GenePath constructs a genetic network by considering all the relations it found as constraints over the possible networks and attempting to find a network that would satisfy all the constraints.
The mutual consistency of abduced network constraints is tested first. If conflicts are found—such as, for instance, a gene reported not to influence the biological process and at the same type involved in some epistasis relation for the same process—they are shown to the domain expert and resolved
Proposal of genetic experiments
The nature of experimental work in genetics and biology is most often incremental. Experiments are usually costly, and after performing the initial ones, these are analyzed and decisions are taken about which experiments to perform next. To support such scientific discovery process, a particular module in GenePath was implemented that can propose additional experiments that would (potentially) refine the network and resolve ambiguities.
After analyzing the initial data set and developing a
Another case study: sporulation of Dictyostelium
For another case study that we report in this paper, GenePath was used to infer a genetic network that controls sporulation in Dictyostelium. Sporulation is a phase that follows aggregation, where Dictyostelium amoeba specialize to either those forming stalks or spores (Fig. 2). The data (Table 2) include 19 experiments involving six genes and the biological process (sporulation) was graded from absence of sporulation through slow and normal (wild-type) sporulation to rapid sporulation. A
Implementation and web-based interface
GenePath is implemented in Prolog, and its substantial part is available through web-based interface (http://www.genepath.org).
Discussion
Besides the Dictyostelium aggregation and sporulation, in its two years-long development and testing GenePath has been tried on a number of other problems. For instance, using 28 experiments, GenePath found a correct genetic network for growth and development of Dictyostelium that includes five genes [16], [17]. From the data on 20 experiments, it successfully reconstructed the programmed cell death genetic network of C. elegans that involves four genes [11]. In its most elaborate test, 79
Related work
GenePath uses a genetic logic similar to that described by Avery and Wasserman [2] for determining gene order by epistasis in regulatory networks. It uses abduction (see, for instance [8]) as an inference mechanism, and principles of expert systems and Artificial Intelligence for combining expert and domain knowledge and data.
The best-known AI system intended for application in classical genetics is Mark Stefik’s MOLGEN [18]. MOLGEN is an expert system for planning gene-cloning experiments in
Conclusion
GenePath has been a subject of experimental verification on real-life data and current problems in genetics for over two years, and is now approaching the point where it can assist geneticists, scholars and students in discovering and reasoning on genetic networks. In the paper, we have presented and discussed several original and essential GenePath’s mechanisms: abduction of relations between genes and biological processes, construction of genetic networks and proposal of genetic experiments.
Acknowledgements
This work was supported by a grant from Slovene Ministry of Education, Science and Sport (J2-3387-1539), by a grant from the National Institute of Child Health and Human Development (P01 HD39691-01) and by a travel grant from the National Academy of Sciences under the Collaboration in Basic Science and Engineering supported by Contract No. INT-0002341 from the National Science Foundation. G.S. is a recipient of the Basil O’Connor research award from the March of Dimes Birth Defects Foundation
References (23)
- et al.
Inferring qualitative relations in genetic networks and methabolic pathways
Bioinformatics
(2000) - et al.
Ordering gene function: the interpretation of epistasis in regulatory hierarchies
Trends Genet.
(1992) - Bratko I. Prolog programming for artificial intelligence. 3rd ed. Reading (MA): Addison-Wesley;...
- Riddle DL, Swanson MM, Albert PS. Interacting genes in nematode dauer larva formation. Nature...
- Dutilh B. Analysis of data from microarray experiments, the state of the art in gene network reconstruction. Literature...
- et al.
Using Bayesian networks to analyze expression data
J. Comp. Biol.
(2000) - et al.
Functional discovery via a compendium of expression profiles
Cell
(2000) - Kakas AC, Kowalski RA, Toni F. The role of abduction in logic programming. Handbook of logic in artificial intelligence...
- et al.
The promise of a protist: the Dictyostelium genome project
Funct. Integr. Genom.
(2001) Role of PKA in the timing of developmental events in Dictyostelium cells
Microbiol. Mol. Biol. Rev.
(1998)
Genetics of programmed cell death in C. elegans: past, present and future
Trends Genet.
Cited by (20)
Dealing with biology systems in the framework of answer set programming
2020, Procedia Computer ScienceHypothesizing about signaling networks
2009, Journal of Applied LogicCitation Excerpt :Knowledge-based hypothesis formation has been a focus of Artificial Intelligence (AI) research in the past [14,77]. In regard to molecular biology and in particular signaling networks, the related works in hypothesis formation include HYPGENE [39,40], HinCyc [42], TRANSGENE [14,15], GENEPATH [82], and PathoLogic [43]. These works are built upon knowledge representation languages that are limited to monotonic reasoning.
Artificial intelligence and robotics in high throughput post-genomics
2005, Drug Discovery TodayCitation Excerpt :This makes AI research that develops automated inference systems relevant to the problem of experimental design based on background knowledge or previous experimental results. Zupan et al. [4] have applied abductive inference to elucidate network constraints based on background knowledge and experimental results in a system called ‘GenePath’, which can infer genetic networks and propose genetic experiments that might refine the discovered network and establish relations between genes that could not be related purely based on the original experiment. Zupan et al. [4] illustrate the approach of GenePath and its use for the analysis of data on aggregation and sporulation of the soil amoeba Dictyostelium discoideum (GenePath; http://www.genepath.org/).
Visualization of microarray results to assist interpretation
2004, TuberculosisAn evidence-based approach to identify aging-related genes in Caenorhabditis elegans
2015, BMC BioinformaticsPrinciples of Biomedical Informatics, Second Edition
2013, Principles of Biomedical Informatics, Second Edition