Abstract
A technology of linguistic analysis with the use of computer methods is called a text mining.
Computer tools based on this technology can provide a wide range of tasks, including:
-
1.
The task of finding a relevant literature with the user-specified criteria and determination of the correspondence between single article or manually specified picks of articles and researching area of knowledge or a set of predesignated areas
-
2.
The task of identification and extraction of names of biological objects that can be found in the raw text (e.g., genes, proteins, metabolites) with extra information on them, such as the type of object and names of its synonyms
-
3.
The task of establishment of relationships between objects that had been automatically recognized in text with the representation of the obtained data in a form convenient for the further analysis, for example, in the form of associative networks
1 Systems for the Automated Search of Literature
Systems for the thematic search of the literature are extremely important in almost any kind of scientific research. Their main task is automated determination of the level of relevance between electronic publications and information of interest for specialists. The most common techniques for the development of such systems are the use of logic and vector models as well as mining techniques; often they are combined in order to improve the search.
1.1 Logical Query Model
Logical queries allow to perform search of documents with a user-specified strings of keywords associated by logical operators AND/OR/NOT by comparison of user-specified queries with all available documents. In case of full or partial string matches, considering logical operators in the query, the document will be defined as satisfying to the query or not. For example, the query «p53 AND open-angle glaucoma» will display all documents containing a name of the protein “p53” and “open-angle glaucoma” disease at the same time the query «p53 NOT cancer» will show only documents that involve “p53,” but not involve a “cancer” disease. The advantage of this method is its easy implementation. At the same time, its main drawbacks are the lack of features for the formation of complex queries that can allow, for example, to consider the relationships between objects, as well as excessive search redundancy [1].
1.2 Vector Query Model
Vector query algorithm was proposed by Joyce and Needham [2]. It is based on the idea that similar documents should meet to the simular requirements.
The algorithm is based on a representation of each document as a mathematical vector of terms in which each term is corresponding to the frequency number of its matches in the text, such vectors are related to control vector that is formed with a user-specified query, and as a result the establishment of the extent to which articles specified to the provided subject area takes place. This approach offers the feature of the combination of similar documents into clusters, which can significantly improve the time and the quality of search. The first query vector algorithm has been implemented by Salton in SMART (Salton’s Magical Automatic Retriever of Text) search engine [3].
1.3 Mining Methods
These methods include:
-
Methods of the statistical correlations intended for the formation of rules for establishment of relations between documents and prespecified categories [4].
-
Clustering methods based on different semantic attributes of the document set with the use of linguistic and mathematical methods without a priori knowledge; as a result of such analysis, a taxonomy of documents or visual map, providing effective coverage of large amounts of data, is created [5].
-
Methods for analysis of the relationships for identification of descriptors (key phrases) in the documents that provide flexible navigation in text [6].
-
Methods for the identification of facts designed for the extraction of knowledge, in order to improve the classification, retrieval, and clusterization of documents [7].
1.4 Existing Search Systems
The Entrez system [8] (http://www.ncbi.nlm.nih.gov/sites/gquery) allows to make search of information on biological databases supported by NCBI, such as PubMed, GenBank, Structure, and Genome. It is based on a model of vector and logical queries as well as mining techniques.
Muller and his colleagues developed a search engine, Textpresso [9] (http://www.textpresso.org), specialized on Caenorhabditis elegans that includes over 3,800 of full-text articles and 16,000 of abstracts. It is based on the modified vector method, containing articles that were previously divided on separate sentences as well as on terms appropriate to C. elegans and stored at the database; the search queries are divided into words, and such approach allowed authors to improve the quality of search in comparison with classical method of vector queries.
PubMatrix [10] (http://pubmatrix.grc.nia.nih.gov) is a system that allows to make search on the PubMed database by comparing a user-specified sample of terms, such as gene or protein names with a set of their functions. As a result, the system provides a list of abstracts of scientific publications containing links between these genes or proteins and their functions.
2 Systems for Automated Identification of Biological Objects in Texts
Krauthammer and Nenadic distinguished three stages of automatic recognition of biological objects [11]:
-
Extraction of names, synonyms, and abbreviations in the unstructured text
-
Identification and establishment of relationships between objects
-
Representation of obtained information in the formalized form
There are three main approaches for automated identification of the names of biological objects in text:
-
With the use of rules and templates
-
With the use of statistical and machine learning methods
-
With the use of thematic vocabularies
2.1 Methods for Identification of the Biological Objects with the Use of Rules and Templates
These methods are based on the use of a set of regular expressions (rules or patterns) that normally are formed manually by specialists [12] and intended to identification of terms according to their syntactic and semantic features. Ananiadou and McNaught in their work concluded that systems implemented by use of these methods can get the better-quality results in comparison to other approaches [13]. The main disadvantage of methods based on templates is a poor quality in analysis of complex sentences.
2.2 Recognition of Biological Entities with the Use of Statistical Algorithms and Machine Learning Methods
The use of statistical approaches allows the identification of terms based on the frequency values of their occurrence in text; these methods are effective in solving the problem of keywords. Systems based on machine learning methods are designed to search for specified classes of terms in the text and allow to do direct identification of objects with their classification with the help of “training samples.” These samples are used to “train” method and allow them to produce high-quality object recognition and classification on a specific biological problem. The main problems of machine learning methods are poor availability of training samples and the high need of a large amount of high-quality data [13].
Collier and colleagues used a hidden Markov model and automated analysis of orthographic word features for the extraction of the terms related to the ten predesignated classes [14]. The results of this system were highly dependent from the quality of samples. Thus, for a class of proteins, F-score value was 75.9 %, while F-score value for RNA was much less due to their low representation in the training set. Similar results (F-score of 75 %) occurred in Morgan and colleagues’ analysis of gene names for Drosophila genus (small flies) [15]. They used a hidden Markov model in conjunction with contextual analysis and simple spelling rules.
Kazama and colleagues used the method of the support vector machine with a GENIA training set [16, 17]. The so-called B-I-O tags were used for the annotation: the B tag allowed to identify preterm structures, the I tag contrasted the words forming the part of the term, and the O tag was used for words going after terms. Tags were supplemented with information related to different classes of molecular-genetic objects. For instance, tag B protein was associated with the words that were situated in front of the names of proteins. The F-score value for this method was 50 %.
2.3 Recognition of Biological Objects with Dictionaries
These methods are based on the use of thematical words for the search for biological object names by the comparing of text with terms from the dictionary. The advantage of such approach over other methods is fast term classification by types with their reference to the various databases. The main disadvantages are the inability to recognize the novel names and a high degree of false-positive results related to short and nonunique names [18].
The BioThesaurus web-based system [19] (http://pir.georgetown.edu/pirwww/iprolink/biothesaurus.shtml) was designed for the establishment of interactions between genes and proteins in unstructured text. The system was based on the use of vocabulary compiled from the different databases: UniProt [20]; NCBI resources devoted to genes and proteins [21], including Entrez Gene, RefSeq, and GenPept; and genomic databases of model organisms such as MGD [22], SGD [23], RGD [24], FlyBase [25], and WormBase [26] and some other sources. The total volume of the dictionary was about 2.8 million of unique gene and protein names.
2.4 Recognition of Biological Objects with the Combining of Different Methods
For today most of modern systems designed for the identification of names of biological objects in texts are combining several different approaches. For example, popular is a combination of methods based on patterns with machine learning. This allows to achieve more higher values of completeness and accuracy. Tsuruoka and Tsujii used the search with the dictionary along with machine learning methods [27]. On the first step (recognition phase), the text was scanned using a dictionary for protein name candidates. The problem of spelling variation was solved with an approximate string-matching technique. On the second step (filtering phase), each candidate was checked if it is a name of a protein or not with a machine learning method. The classifier was trained on an annotated corpus GENIA [28] and used the context of the term and the term itself as the features for the classification. Only “accepted” candidates were recognized as names of proteins. The F-measure (the harmonic value of the precision and recall values) for this system was 70.2 %.
Hakenberg with colleagues developed a GNAT system [29] (http://cbioc.eas.asu.edu/gnat/) for the identification of the names of genes from various organisms in the texts of abstracts of scientific publications. For the identification of gene names, dictionaries (for each of the 25 organisms, a separate dictionary was compiled) and machine learning methods were used. The search of noncanonical forms of names was done using automates with the ending number of states, while the identification of canonical names was done with the help of dictionaries based on Entrez, GO, UniProt, and other databases. The F-score of the system was 81.4 % (the precision and recall were 90.8 % and 73.8 %, respectively).
3 Systems for the Recognition of Interactions Between Biological Objects
For the solution of task of automated extraction of information about the molecular and genetic interactions between biological objects from the literature, the following methods are widely used:
-
Methods based on the co-occurrence of objects in the text
-
Methods based on a set of rules and patterns (shallow parsing)
-
Methods based on a deep syntactic analysis of the separated sentences (full or deep parsing)
The co-occurrence method is based on a calculation of the frequency of co-occurrence of object names in the text. It is assumed that the more two objects can be mentioned in the same text, the more likely they are related with each other. The main advantages of these methods are the easy implementation and high value of recall. But on the other side, the precision of such method is not very high and this method does not allow the identification of type of relationships between objects. Coremine Medical (http://www.coremine.com/medical) and FACTA [30] (http://text0.mib.man.ac.uk/software/facta/main.html) are examples of such systems. At the BRENDA database (http://www.brenda-enzymes.org), co-occurrence method was used for the extraction of data about associations between diseases and enzymes [31].
The shallow parsing is based on the extraction of information from texts with the use of partial relations between words in a sentence using a set of specific patterns and rules. A SUISEKI (System for Information Extraction on Interactions) [32] designed for the automated analysis of the syntactic structure of phrases and other developments for the extraction of protein interactions is based on this method. The core of the system is the number of rules that allow capturing different language constructions that are commonly used to describe interactions. The rules are implemented as frames of the form “[protein/gene] binds/associates/… [protein/gene]” as well as the form describing specific relations, such as “[noun indicating interaction] of [protein/gene] with [protein/gene].” The Chilibot [33] (http://www.chilibot.net) extracts sentences from abstracts of scientific publications related to a pair or a list of genes, proteins, or keywords and uses shallow parsing for the classification of the extracted sentences as noninteractive, interactive, or simple abstract co-occurrence.
Information extraction systems based on the full-sentence parsing approach tend to be more precise as they deal with the structure of an entire sentence, and variations of the full parsing-based approach have been applied for biomedical information retrieval. However, full parsers are significantly slower and require more memory than shallow analyses because they have to deal with general syntactic ambiguity and handle the full set of possible structures of whole sentences.
The full (deep) parsing is based on the language description with the help of formal grammas. Such approach is usually more accurate than shallow parsing as it is working with the structure of an entire sentence. On the other hand, the main disadvantages are the full dependence from the quality and fullness of the training set and high requirements to memory. The MedScan system [34] (http://www.elsevier.com/online-tools/pathway-studio/training-support#faqs) from Pathway Studio used a full syntactic parser for the analysis of the semantic and lexical structure of sentences and search of interactions between various biological objects, including small molecules, genes, proteins, protein functional classes, diseases, and cell processes.
4 The ANDSystem Tool
The ANDSystem tool incorporates methods for automated extraction of knowledge from the PubMed abstracts of scientific publications and factographic databases [35]. The ANDSystem consists of three main modules: module of linguistic text analysis and extraction of knowledge from text; the ANDCell database, containing the results of knowledge extraction from PubMed in the form of associative networks; and the ANDVisio tool that provides a graphical interface for ANDCell, intended for the graphical visualization and analysis of associative gene networks comprising relationships between biological processes, diseases, and molecular-genetic objects (proteins, genes, metabolites). The vertices of such networks are molecular-genetic objects, diseases, and processes while the edges between the vertices represent types of associations. Considered are the following objects: genes, proteins, microRNAs, metabolites, molecular processes and pathways, cellular components, and diseases (Fig. 6.1).
The following types of relationships are established between molecular-genetic objects: association, interactions, co-expression, treatment, сatalytic reactions, conversion of molecules, degradation of a protein, regulation of gene expression, regulation of activity or function, regulation of transport, regulation of stability or degradation, and regulation of molecular-biological processes and diseases.
Algorithms for extraction of knowledge from text implemented in ANDSystem are based on the use of dictionaries and templates [36]. A thesaurus of genes was compiled with the use of the NCBI gene database; for the protein dictionary, a Swiss-Prot database was used; a list of diseases was extracted from the PharmGKB; for the metabolites, a ChEBI database was analyzed; biological processes and cellular components were obtained from Gene Ontology; and for microRNA, miRBase was used. The extraction of relationships between described biological objects from text was done with a help of about 4,000 manually created templates. The obtained knowledge base now consists of over five million facts about relationships between diseases, molecular-genetic objects, and biological processes.
With the ANDVisio, an associative network describing relationship between human genes and proteins associated with open-angle glaucoma and myopia diseases [37]. The built network contains 15 genes and 50 proteins that are associated with myopia and open-angle glaucoma at the same time and over 400 relationships between them (Fig. 6.1). It identified 26 pathways between myopia and open-angle glaucoma containing the most important objects and relationships, including SMAD3, PAX6, IPO13, GCR, NOE3, MYOC proteins, and the OLFM3 gene.
References
Shatkay H, Wilbur WJ (2000) Finding themes in medline documents: probabilistic similarity search. In: Hoppenbro J, Souza Lima T, Papazoglou M, Sheth A (eds) Proceedings IEEE advances in digital libraries 2000, Washington DC, May 2000, pp 183–192
Joyce T, Needham RM (1997) The thesaurus approach to information retrieval. American documentation (1958) 9:192–197. In: Sparck Jones K, Willet P (eds) Readings in information retrieval. Morgan Kaufmann Publishers Inc, California (1997), pp 15–20
Salton G (1968) Automatic information organization and retrieval. McGraw Hill, New York
Sebastiani F (1999) Machine learning in automated text categorization. Technical report IEI-B4-31-1999, Istituto di Elaborazione dell’Informazione. CNR, Pisa
Кириченко КМ, Герасимов МБ (2001) Обзор методов кластеризации текстовых документов. Материалы международной конференции Диалог, т 2, Аксаково, 2001
Гаврилова ТА, Хорошевский ВФ (2000) Базы знаний интеллектуальных систем. Учебник, Питер, Санкт-Петербург, 2000
Ильин Н, Киселëв С, Танков С, Рябышкин В (2006) Технологии извлечения знаний из текста, Открытые системы, 6, 2006
Schuler G, Epstein J, Ohkawa H, Kans J (1996) Entrez: molecular biology database and retrieval system. Methods Enzymol 266:141–162
Muller HM, Kenny EE, Sternberg PW (2004) Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2:309
Becker K et al (2003) PubMatrix: a tool for multiplex literature mining. BMC Bioinforma 4:61
Krauthammer M, Nenadic G (2004) Term identification in the biomedical literature. J Biomed Inform 37:512–526
Krallinger M, Morgan A, Smith L, Leitner F, Tanabe L, Wilbur J, Hirschman L, Valencia A (2008) Evaluation of text mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol 9(2):1
Ananiadou S, McNaught J (eds) (2006) Text mining for biology and biomedicine. Artech House, Norwood
Collier N, Nobata C, Tsujii J (2000) Extracting the names of genes and gene products with a hidden Markov model. In: Proceedings of COLING 2000, Saarbruecken, pp 201–207
Morgan A, Yeh A, Hirschman L, Colosimo M (2003) Gene name extraction using FlyBase resources. In: Proceedings of NLP in biomedicine. ACL 2003, Sapporo, pp 1–8
Kazama J, Makino T, Ohta Y, Tsujii J (2002) Tuning support vector machines for biomedical named entity recognition. In: ACL-02 workshop on natural language processing in biomedical applications, Pennsylvania, July 2002
Kim JD, Ohta T, Tateisi Y, Tsujii J (2003) GENIA corpus – a semantically annotated corpus for bio-textmining. Bioinformatics 19(1):180–182
Cohen KB, Hunter L (2005) Natural language processing and systems biology. In: Dubitzky W, Azuaje F (eds) Artificial intelligence and systems biology. Springer, Dordrecht
Liu H, Hu ZZ, Zhang J, Wu C (2006) BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics 22:103–105
Bairoch A, Apweiler R, Wu CH et al (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res 35:193–197
Wheeler D, Church D, Federhen S et al (2003) Database resources of the National Center for Biotechnology. Nucleic Acids Res 31:28–33
Eppig JT et al (2005) The Mouse Genome Database (MGD): from genes to mice — a community resource for mouse biology. Nucleic Acids Res 33:471–475
Christie KR et al (2004) Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res 32:311–314
De la Cruz N et al (2005) The Rat Genome Database (RGD): developments towards a phenome database. Nucleic Acids Res 33:485–491
Drysdale RA, Crosby MA (2005) FlyBase: genes and gene models. Nucleic Acids Res 33:390–395
Chen N et al (2005) WormBase: a comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res 33:383–389
Tsuruoka Y, Tsujii J (2003) Boosting precision and recall of dictionary-based protein name recognition. In: Ananiadou S, Tsujii J (eds) Proceedings of the ACL 2003 workshop on natural language processing in biomedicine, Stroudsburg, July 2003, vol 13. Association for Computational Linguistics, Stroudsburg, pp 41–48
Ohta T, Tateishi Y, Mima H, Tsujii J (2002) Genia corpus: an annotated research abstract corpus in molecular biology domain. In: Proceedings of the human language technology conference, San Diego, March 2002
Hakenberg J et al (2008) Inter-species normalization of gene mentions with Gnat. Bioinformatics 24:126–132
Tsuruoka Y, Tsujii J, Ananiadou S (2008) FACTA: a text search engine for finding associated biomedical concepts. Oxford J 24(21):2559–2560
Scheer M, Grote A, Chang A et al (2011) BRENDA, the enzyme information system in 2011. Nucleic Acids Res 39:670–676
Blaschke C, Valencia A (2001) The potential use of SUISEKI as a protein interaction discovery tool. Genome Inform 12:123–134
Chen H, Sharp BM (2004) Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinforma 5:147
Nikitin A, Egorov S, Daraselia N, Mazo I (2003) Pathway studio – the analysis and navigation of molecular networks. Bioinformatics 19(16):2155–2157
Demenkov PS, Ivanisenko TV, Kolchanov NA, Ivanisenko VA (2012) ANDVisio: a new tool for graphic visualization and analysis of literature mined associative gene networks in the ANDSystem. Silico Biol 11(3):149–161
Demenkov PS, Aman EE, Ivanisenko VA (2008) Associative network discovery (AND) – the computer system for automated reconstruction networks of associative knowledge about molecular-genetic interactions. Comput Technol 13(2):15–19
Podkolodnaya OA, Yarkova EE, Demenkov PS, Konovalova OS, Ivanisenko VA, Kolchanov NA (2011) Application of the ANDCell computer system to reconstruction and analysis of associative networks describing potential relationships between myopia and glaucoma. Russ J Genet 1(1):21–28
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Ivanisenko, T.V., Demenkov, P.S., Ivanisenko, V.A. (2014). Text Mining on PubMed. In: Chen, M., Hofestädt, R. (eds) Approaches in Integrative Bioinformatics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41281-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-41281-3_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41280-6
Online ISBN: 978-3-642-41281-3
eBook Packages: Computer ScienceComputer Science (R0)