1 Systems for the Automated Search of Literature

Systems for the thematic search of the literature are extremely important in almost any kind of scientific research. Their main task is automated determination of the level of relevance between electronic publications and information of interest for specialists. The most common techniques for the development of such systems are the use of logic and vector models as well as mining techniques; often they are combined in order to improve the search.

1.1 Logical Query Model

Logical queries allow to perform search of documents with a user-specified strings of keywords associated by logical operators AND/OR/NOT by comparison of user-specified queries with all available documents. In case of full or partial string matches, considering logical operators in the query, the document will be defined as satisfying to the query or not. For example, the query «p53 AND open-angle glaucoma» will display all documents containing a name of the protein “p53” and “open-angle glaucoma” disease at the same time the query «p53 NOT cancer» will show only documents that involve “p53,” but not involve a “cancer” disease. The advantage of this method is its easy implementation. At the same time, its main drawbacks are the lack of features for the formation of complex queries that can allow, for example, to consider the relationships between objects, as well as excessive search redundancy [1].

1.2 Vector Query Model

Vector query algorithm was proposed by Joyce and Needham [2]. It is based on the idea that similar documents should meet to the simular requirements.

The algorithm is based on a representation of each document as a mathematical vector of terms in which each term is corresponding to the frequency number of its matches in the text, such vectors are related to control vector that is formed with a user-specified query, and as a result the establishment of the extent to which articles specified to the provided subject area takes place. This approach offers the feature of the combination of similar documents into clusters, which can significantly improve the time and the quality of search. The first query vector algorithm has been implemented by Salton in SMART (Salton’s Magical Automatic Retriever of Text) search engine [3].

1.3 Mining Methods

These methods include:

  • Methods of the statistical correlations intended for the formation of rules for establishment of relations between documents and prespecified categories [4].

  • Clustering methods based on different semantic attributes of the document set with the use of linguistic and mathematical methods without a priori knowledge; as a result of such analysis, a taxonomy of documents or visual map, providing effective coverage of large amounts of data, is created [5].

  • Methods for analysis of the relationships for identification of descriptors (key phrases) in the documents that provide flexible navigation in text [6].

  • Methods for the identification of facts designed for the extraction of knowledge, in order to improve the classification, retrieval, and clusterization of documents [7].

1.4 Existing Search Systems

The Entrez system [8] (http://www.ncbi.nlm.nih.gov/sites/gquery) allows to make search of information on biological databases supported by NCBI, such as PubMed, GenBank, Structure, and Genome. It is based on a model of vector and logical queries as well as mining techniques.

Muller and his colleagues developed a search engine, Textpresso [9] (http://www.textpresso.org), specialized on Caenorhabditis elegans that includes over 3,800 of full-text articles and 16,000 of abstracts. It is based on the modified vector method, containing articles that were previously divided on separate sentences as well as on terms appropriate to C. elegans and stored at the database; the search queries are divided into words, and such approach allowed authors to improve the quality of search in comparison with classical method of vector queries.

PubMatrix [10] (http://pubmatrix.grc.nia.nih.gov) is a system that allows to make search on the PubMed database by comparing a user-specified sample of terms, such as gene or protein names with a set of their functions. As a result, the system provides a list of abstracts of scientific publications containing links between these genes or proteins and their functions.

2 Systems for Automated Identification of Biological Objects in Texts

Krauthammer and Nenadic distinguished three stages of automatic recognition of biological objects [11]:

  • Extraction of names, synonyms, and abbreviations in the unstructured text

  • Identification and establishment of relationships between objects

  • Representation of obtained information in the formalized form

There are three main approaches for automated identification of the names of biological objects in text:

  • With the use of rules and templates

  • With the use of statistical and machine learning methods

  • With the use of thematic vocabularies

2.1 Methods for Identification of the Biological Objects with the Use of Rules and Templates

These methods are based on the use of a set of regular expressions (rules or patterns) that normally are formed manually by specialists [12] and intended to identification of terms according to their syntactic and semantic features. Ananiadou and McNaught in their work concluded that systems implemented by use of these methods can get the better-quality results in comparison to other approaches [13]. The main disadvantage of methods based on templates is a poor quality in analysis of complex sentences.

2.2 Recognition of Biological Entities with the Use of Statistical Algorithms and Machine Learning Methods

The use of statistical approaches allows the identification of terms based on the frequency values of their occurrence in text; these methods are effective in solving the problem of keywords. Systems based on machine learning methods are designed to search for specified classes of terms in the text and allow to do direct identification of objects with their classification with the help of “training samples.” These samples are used to “train” method and allow them to produce high-quality object recognition and classification on a specific biological problem. The main problems of machine learning methods are poor availability of training samples and the high need of a large amount of high-quality data [13].

Collier and colleagues used a hidden Markov model and automated analysis of orthographic word features for the extraction of the terms related to the ten predesignated classes [14]. The results of this system were highly dependent from the quality of samples. Thus, for a class of proteins, F-score value was 75.9 %, while F-score value for RNA was much less due to their low representation in the training set. Similar results (F-score of 75 %) occurred in Morgan and colleagues’ analysis of gene names for Drosophila genus (small flies) [15]. They used a hidden Markov model in conjunction with contextual analysis and simple spelling rules.

Kazama and colleagues used the method of the support vector machine with a GENIA training set [16, 17]. The so-called B-I-O tags were used for the annotation: the B tag allowed to identify preterm structures, the I tag contrasted the words forming the part of the term, and the O tag was used for words going after terms. Tags were supplemented with information related to different classes of molecular-genetic objects. For instance, tag B protein was associated with the words that were situated in front of the names of proteins. The F-score value for this method was 50 %.

2.3 Recognition of Biological Objects with Dictionaries

These methods are based on the use of thematical words for the search for biological object names by the comparing of text with terms from the dictionary. The advantage of such approach over other methods is fast term classification by types with their reference to the various databases. The main disadvantages are the inability to recognize the novel names and a high degree of false-positive results related to short and nonunique names [18].

The BioThesaurus web-based system [19] (http://pir.georgetown.edu/pirwww/iprolink/biothesaurus.shtml) was designed for the establishment of interactions between genes and proteins in unstructured text. The system was based on the use of vocabulary compiled from the different databases: UniProt [20]; NCBI resources devoted to genes and proteins [21], including Entrez Gene, RefSeq, and GenPept; and genomic databases of model organisms such as MGD [22], SGD [23], RGD [24], FlyBase [25], and WormBase [26] and some other sources. The total volume of the dictionary was about 2.8 million of unique gene and protein names.

2.4 Recognition of Biological Objects with the Combining of Different Methods

For today most of modern systems designed for the identification of names of biological objects in texts are combining several different approaches. For example, popular is a combination of methods based on patterns with machine learning. This allows to achieve more higher values of completeness and accuracy. Tsuruoka and Tsujii used the search with the dictionary along with machine learning methods [27]. On the first step (recognition phase), the text was scanned using a dictionary for protein name candidates. The problem of spelling variation was solved with an approximate string-matching technique. On the second step (filtering phase), each candidate was checked if it is a name of a protein or not with a machine learning method. The classifier was trained on an annotated corpus GENIA [28] and used the context of the term and the term itself as the features for the classification. Only “accepted” candidates were recognized as names of proteins. The F-measure (the harmonic value of the precision and recall values) for this system was 70.2 %.

Hakenberg with colleagues developed a GNAT system [29] (http://cbioc.eas.asu.edu/gnat/) for the identification of the names of genes from various organisms in the texts of abstracts of scientific publications. For the identification of gene names, dictionaries (for each of the 25 organisms, a separate dictionary was compiled) and machine learning methods were used. The search of noncanonical forms of names was done using automates with the ending number of states, while the identification of canonical names was done with the help of dictionaries based on Entrez, GO, UniProt, and other databases. The F-score of the system was 81.4 % (the precision and recall were 90.8 % and 73.8 %, respectively).

3 Systems for the Recognition of Interactions Between Biological Objects

For the solution of task of automated extraction of information about the molecular and genetic interactions between biological objects from the literature, the following methods are widely used:

  • Methods based on the co-occurrence of objects in the text

  • Methods based on a set of rules and patterns (shallow parsing)

  • Methods based on a deep syntactic analysis of the separated sentences (full or deep parsing)

The co-occurrence method is based on a calculation of the frequency of co-occurrence of object names in the text. It is assumed that the more two objects can be mentioned in the same text, the more likely they are related with each other. The main advantages of these methods are the easy implementation and high value of recall. But on the other side, the precision of such method is not very high and this method does not allow the identification of type of relationships between objects. Coremine Medical (http://www.coremine.com/medical) and FACTA [30] (http://text0.mib.man.ac.uk/software/facta/main.html) are examples of such systems. At the BRENDA database (http://www.brenda-enzymes.org), co-occurrence method was used for the extraction of data about associations between diseases and enzymes [31].

The shallow parsing is based on the extraction of information from texts with the use of partial relations between words in a sentence using a set of specific patterns and rules. A SUISEKI (System for Information Extraction on Interactions) [32] designed for the automated analysis of the syntactic structure of phrases and other developments for the extraction of protein interactions is based on this method. The core of the system is the number of rules that allow capturing different language constructions that are commonly used to describe interactions. The rules are implemented as frames of the form “[protein/gene] binds/associates/… [protein/gene]” as well as the form describing specific relations, such as “[noun indicating interaction] of [protein/gene] with [protein/gene].” The Chilibot [33] (http://www.chilibot.net) extracts sentences from abstracts of scientific publications related to a pair or a list of genes, proteins, or keywords and uses shallow parsing for the classification of the extracted sentences as noninteractive, interactive, or simple abstract co-occurrence.

Information extraction systems based on the full-sentence parsing approach tend to be more precise as they deal with the structure of an entire sentence, and variations of the full parsing-based approach have been applied for biomedical information retrieval. However, full parsers are significantly slower and require more memory than shallow analyses because they have to deal with general syntactic ambiguity and handle the full set of possible structures of whole sentences.

The full (deep) parsing is based on the language description with the help of formal grammas. Such approach is usually more accurate than shallow parsing as it is working with the structure of an entire sentence. On the other hand, the main disadvantages are the full dependence from the quality and fullness of the training set and high requirements to memory. The MedScan system [34] (http://www.elsevier.com/online-tools/pathway-studio/training-support#faqs) from Pathway Studio used a full syntactic parser for the analysis of the semantic and lexical structure of sentences and search of interactions between various biological objects, including small molecules, genes, proteins, protein functional classes, diseases, and cell processes.

4 The ANDSystem Tool

The ANDSystem tool incorporates methods for automated extraction of knowledge from the PubMed abstracts of scientific publications and factographic databases [35]. The ANDSystem consists of three main modules: module of linguistic text analysis and extraction of knowledge from text; the ANDCell database, containing the results of knowledge extraction from PubMed in the form of associative networks; and the ANDVisio tool that provides a graphical interface for ANDCell, intended for the graphical visualization and analysis of associative gene networks comprising relationships between biological processes, diseases, and molecular-genetic objects (proteins, genes, metabolites). The vertices of such networks are molecular-genetic objects, diseases, and processes while the edges between the vertices represent types of associations. Considered are the following objects: genes, proteins, microRNAs, metabolites, molecular processes and pathways, cellular components, and diseases (Fig. 6.1).

Fig. 6.1
figure 1

The associative network of relationship between human genes and proteins associated with open-angle glaucoma and myopia generated with ANDVisio

The following types of relationships are established between molecular-genetic objects: association, interactions, co-expression, treatment, сatalytic reactions, conversion of molecules, degradation of a protein, regulation of gene expression, regulation of activity or function, regulation of transport, regulation of stability or degradation, and regulation of molecular-biological processes and diseases.

Algorithms for extraction of knowledge from text implemented in ANDSystem are based on the use of dictionaries and templates [36]. A thesaurus of genes was compiled with the use of the NCBI gene database; for the protein dictionary, a Swiss-Prot database was used; a list of diseases was extracted from the PharmGKB; for the metabolites, a ChEBI database was analyzed; biological processes and cellular components were obtained from Gene Ontology; and for microRNA, miRBase was used. The extraction of relationships between described biological objects from text was done with a help of about 4,000 manually created templates. The obtained knowledge base now consists of over five million facts about relationships between diseases, molecular-genetic objects, and biological processes.

With the ANDVisio, an associative network describing relationship between human genes and proteins associated with open-angle glaucoma and myopia diseases [37]. The built network contains 15 genes and 50 proteins that are associated with myopia and open-angle glaucoma at the same time and over 400 relationships between them (Fig. 6.1). It identified 26 pathways between myopia and open-angle glaucoma containing the most important objects and relationships, including SMAD3, PAX6, IPO13, GCR, NOE3, MYOC proteins, and the OLFM3 gene.