Abstract
Proteins perform their functions by interacting with other proteins. Protein–protein interaction (PPI) is critical for understanding the functions of individual proteins, the mechanisms of biological processes, and the disease mechanisms. High-throughput experiments accumulated a huge number of PPIs in PubMed articles, and their extraction is possible only through automated approaches. The standard text-mining protocol includes four major tasks, namely, recognizing protein mentions, normalizing protein names and aliases to unique identifiers such as gene symbol, extracting PPIs, and visualizing the PPI network using Cytoscape or other visualization tools. Each task is challenging and has been revised over several years to improve the performance. We present a protocol based on our hybrid approaches and show the possibility of presenting each task as an independent web-based tool, NAGGNER for protein name recognition, ProNormz for protein name normalization, PPInterFinder for PPI extraction, and HPIminer for PPI network visualization. The protocol is specific to human but can be generalized to other organisms. We include KinderMiner, our most recent text-mining tool that predicts PPIs by retrieving significant co-occurring protein pairs. The algorithm is simple, easy to implement, and generalizable to other biological challenges.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Raja K, Patrick M, Gao Y, Madu D, Yang Y, Tsoi LC (2017) A review of recent advancement in integrating Omics data with literature mining towards biomedical discoveries. Int J Genomics 2017:10. https://doi.org/10.1155/2017/6213474
Raja K, Subramani S, Natarajan J (2013) PPInterFinder—a mining tool for extracting causal relations on human proteins from literature. Database (Oxford) 2013:bas052. https://doi.org/10.1093/database/bas052
Subramani S, Kalpana R, Monickaraj PM, Natarajan J (2015) HPIminer: a text mining system for building and visualizing human protein interaction networks and pathways. J Biomed Inform 54:121–131. https://doi.org/10.1016/j.jbi.2015.01.006
Kuusisto F, Steill J, Kuang Z, Thomson J, Page D, Stewart R (2017) A simple text mining approach for ranking pairwise associations in biomedical applications. AMIA Jt Summits Transl Sci Proc 2017:166–174
Liu Y, Liang Y, Wishart D (2015) PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more. Nucleic Acids Res 43(W1):W535–W542. https://doi.org/10.1093/nar/gkv383
Huang M, Zhu X, Hao Y, Payan DG, Qu K, Li M (2004) Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics 20(18):3604–3612. https://doi.org/10.1093/bioinformatics/bth451
Chowdhary R, Zhang J, Liu JS (2009) Bayesian inference of protein-protein interactions from biological literature. Bioinformatics 25(12):1536–1542. https://doi.org/10.1093/bioinformatics/btp245
Bui QC, Katrenko S, Sloot PM (2011) A hybrid approach to extract protein-protein interactions. Bioinformatics 27(2):259–265. https://doi.org/10.1093/bioinformatics/btq620
Jenssen TK, Laegreid A, Komorowski J, Hovig E (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28(1):21–28. https://doi.org/10.1038/88213
Ananiadou S, Mcnaught J (2005) Text mining for biology and biomedicine. Artech House, Inc., Boston
Kabiljo R, Clegg AB, Shepherd AJ (2009) A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinformatics 10:233. https://doi.org/10.1186/1471-2105-10-233
Raja K, Subramani S, Natarajan J (2014) A hybrid named entity tagger for tagging human proteins/genes. Int J Data Min Bioinform 10(3):315–328
Krallinger M, Valencia A (2005) Text-mining and information-retrieval services for molecular biology. Genome Biol 6(7):224. https://doi.org/10.1186/gb-2005-6-7-224
Blaschke C, Andrade MA, Ouzounis C, Valencia A (1999) Automatic extraction of biological information from scientific text: protein-protein interactions. In: Proceedings international conference on intelligent systems for molecular biology, pp 60–67
Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN (1999) MedMiner: an internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques 27(6):1210–1214. 1216-1217
Aloy P, Russell RB (2004) Ten thousand interactions for the molecular biologist. Nat Biotechnol 22(10):1317–1321. https://doi.org/10.1038/nbt1018
Gao M, Skolnick J (2010) Structural space of protein-protein interfaces is degenerate, close to complete, and highly connected. Proc Natl Acad Sci U S A 107(52):22517–22522. https://doi.org/10.1073/pnas.1012820107
Zhou D, He Y (2008) Extracting interactions between proteins from the literature. J Biomed Inform 41(2):393–407. https://doi.org/10.1016/j.jbi.2007.11.008
Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N, Duesbury M, Dumousseau M, Galeota E, Hinz U, Iannuccelli M, Jagannathan S, Jimenez R, Khadake J, Lagreid A, Licata L, Lovering RC, Meldal B, Melidoni AN, Milagros M, Peluso D, Perfetto L, Porras P, Raghunath A, Ricard-Blum S, Roechert B, Stutz A, Tognolli M, van Roey K, Cesareni G, Hermjakob H (2014) The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 42(Database issue):D358–D363. https://doi.org/10.1093/nar/gkt1115
Ceol A, Chatr Aryamontri A, Licata L, Peluso D, Briganti L, Perfetto L, Castagnoli L, Cesareni G (2010) MINT, the molecular interaction database: 2009 update. Nucleic Acids Res 38(Database issue):D532–D539. https://doi.org/10.1093/nar/gkp983
Bader GD, Betel D, Hogue CW (2003) BIND: the biomolecular interaction network database. Nucleic Acids Res 31(1):248–250
Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The database of interacting proteins: 2004 update. Nucleic Acids Res 32(Database issue):D449–D451. https://doi.org/10.1093/nar/gkh086
Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, Santos A, Doncheva NT, Roth A, Bork P, Jensen LJ, von Mering C (2017) The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res 45(D1):D362–d368. https://doi.org/10.1093/nar/gkw937
Subramani S, Raja K, Natarajan J (2014) ProNormz--an integrated approach for human proteins and protein kinases normalization. J Biomed Inform 47:131–138. https://doi.org/10.1016/j.jbi.2013.10.003
Levy R, Andrew G (2006) Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In: Proceedings of the fifth international conference on language resources and evaluation. doi:citeulike-article-id:3441831
Raja K, Natarajan J (2018) Mining protein phosphorylation information from biomedical literature using NLP parsing and support vector machines. Comput Methods Prog Biomed 160:57–64. https://doi.org/10.1016/j.cmpb.2018.03.022
Mukherjea S, Subramaniam LV, Chanda G, Sankararaman S, Kothari R, Batra VS, Bhardwaj DN, Srivastava B (2004) Enhancing a biomedical information extraction system with dictionary mining and context disambiguation. IBM J Res Dev 48:693–702
Erhardt RA, Schneider R, Blaschke C (2006) Status of text-mining techniques applied to biomedical text. Drug Discov Today 11(7–8):315–325. https://doi.org/10.1016/j.drudis.2006.02.011
Xia JR, Liu NF, Zhu NX (2008) Specific siRNA targeting the receptor for advanced glycation end products inhibits experimental hepatic fibrosis in rats. Int J Mol Sci 9(4):638–661
Hasegawa S, Harada K, Morokoshi Y, Tsukamoto S, Furukawa T, Saga T (2013) Growth retardation and hair loss in transgenic mice overexpressing human H-ferritin gene. Transgenic Res 22(3):651–658. https://doi.org/10.1007/s11248-012-9669-0
Park Y, Byrd RJ (2001) Hybrid text mining for finding abbreviations and their definitions. Paper presented at the 6th conference on empirical methods in natural language processing, Pittsburgh, USA
Schwartz AS, Hearst MA (2003) A simple algorithm for identifying abbreviation definitions in biomedical text. Pac Symp Biocomput:451–462
Raja K, Patrick M, Elder JT, Tsoi LC (2017) Machine learning workflow to enhance predictions of adverse drug reactions (ADRs) through drug-gene interactions: application to drugs for cutaneous diseases. Sci Rep 7(1):3690. https://doi.org/10.1038/s41598-017-03914-3
Settles B (2005) ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14):3191–3192. https://doi.org/10.1093/bioinformatics/bti475
Tsuruoka Y, Tateishi Y, Kim J-D, Ohta T, McNaught J, Ananiadou S, Tsujii J (2005) Developing a robust part-of-speech tagger for biomedical text. In: Bozanis P, Houstis EN (eds) Advances in informatics. Springer, Berlin, Heidelberg, pp 382–392
Mika S, Rost B (2004) NLProt: extracting protein names and sequences from papers. Nucleic Acids Res 32(Web Server issue):W634–W637. https://doi.org/10.1093/nar/gkh427
Leaman R, Gonzalez G (2008) BANNER: an executable survey of advances in biomedical named entity recognition. Pac Symp Biocomput:652–663
Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, Divoli A, Fundel K, Leaman R, Hakenberg J, Sun C, Liu HH, Torres R, Krauthammer M, Lau WW, Liu H, Hsu CN, Schuemie M, Cohen KB, Hirschman L (2008) Overview of BioCreative II gene normalization. Genome Biol 9(Suppl 2):S3. https://doi.org/10.1186/gb-2008-9-s2-s3
The human protein/gene name dictionary from NCBI. http://www.ncbi.nlm.nih.gov/gene
The universal protein resource (UniProt) (2008) Nucleic acids research. 36(Database issue):D190–D195. https://doi.org/10.1093/nar/gkm895
Yates B, Braschi B, Gray KA, Seal RL, Tweedie S, Bruford EA (2017) Genenames.org: the HGNC and VGNC resources in 2017. Nucleic Acids Res 45(D1):D619–d625. https://doi.org/10.1093/nar/gkw1033
Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S (2002) The protein kinase complement of the human genome. Science (New York, NY) 298(5600):1912–1934. https://doi.org/10.1126/science.1075762
Milanesi L, Petrillo M, Sepe L, Boccia A, D’Agostino N, Passamano M, Di Nardo S, Tasco G, Casadio R, Paolella G (2005) Systematic analysis of human kinase genes: a large number of genes and alternative splicing events result in functional and structural diversity. BMC Bioinformatics 6(Suppl 4):S20. https://doi.org/10.1186/1471-2105-6-s4-s20
Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, Divoli A, Fundel K, Leaman R, Hakenberg J, Sun C, H-h L, Torres R, Krauthammer M, Lau WW, Liu H, Hsu C-N, Schuemie M, Cohen KB, Hirschman L (2008) Overview of BioCreative II gene normalization. Genome Biol 9(Suppl 2):S3–S3. https://doi.org/10.1186/gb-2008-9-s2-s3
Koike A, Takagi T (2004) Gene/protein/family name recognition in biomedical literature. Paper presented at the HLT-NAACL 2004 workshop: biolink 2004, linking biological literature, ontologies and databases (BioLink 2004)
Henry VJ, Bandrowski AE, Pepin A-S, Gonzalez BJ, Desfeux A (2014) OMICtools: an informative directory for multi-omic data analysis. Database (Oxford) 2014:bau069. https://doi.org/10.1093/database/bau069
Temkin JM, Gilder MR (2003) Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19(16):2046–2053
Ono T, Hishigaki H, Tanigami A, Takagi T (2001) Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17(2):155–161
Oughtred R, Stark C, Breitkreutz BJ, Rust J, Boucher L, Chang C, Kolas N, O’Donnell L, Leung G, McAdam R, Zhang F, Dolma S, Willems A, Coulombe-Huntington J, Chatr-Aryamontri A, Dolinski K, Tyers M (2018) The BioGRID interaction database: 2019 update. Nucleic Acids Res. https://doi.org/10.1093/nar/gky1079
Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28(1):27–30
Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G, Jassal B, Jupe S, Kalatskaya I, Mahajan S, May B, Ndegwa N, Schmidt E, Shamovsky V, Yung C, Birney E, Hermjakob H, D’Eustachio P, Stein L (2011) Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res 39(Database issue):D691–D697. https://doi.org/10.1093/nar/gkq1018
Caspi R, Altman T, Dreher K, Fulcher CA, Subhraveti P, Keseler IM, Kothari A, Krummenacker M, Latendresse M, Mueller LA, Ong Q, Paley S, Pujar A, Shearer AG, Travers M, Weerasinghe D, Zhang P, Karp PD (2012) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res 40(Database issue):D742–D753. https://doi.org/10.1093/nar/gkr1014
Goel R, Harsha HC, Pandey A, Prasad TS (2012) Human protein reference database and human proteinpedia as resources for phosphoproteome analysis. Mol BioSyst 8(2):453–463. https://doi.org/10.1039/c1mb05340j
Floyd BJ, Wilkerson EM, Veling MT, Minogue CE, Xia C, Beebe ET, Wrobel RL, Cho H, Kremer LS, Alston CL, Gromek KA, Dolan BK, Ulbrich A, Stefely JA, Bohl SL, Werner KM, Jochem A, Westphall MS, Rensvold JW, Taylor RW, Prokisch H, Kim JP, Coon JJ, Pagliarini DJ (2016) Mitochondrial protein interaction mapping identifies regulators of respiratory chain function. Mol Cell 63(4):621–632. https://doi.org/10.1016/j.molcel.2016.06.033
Weber TA, Koob S, Heide H, Wittig I, Head B, van der Bliek A, Brandt U, Mittelbronn M, Reichert AS (2013) APOOL is a cardiolipin-binding constituent of the Mitofilin/MINOS protein complex determining cristae morphology in mammalian mitochondria. PLoS One 8(5):e63683. https://doi.org/10.1371/journal.pone.0063683
Anand R, Strecker V, Urbach J, Wittig I, Reichert AS (2016) Mic13 is essential for formation of crista junctions in mammalian cells. PLoS One 11(8):e0160258. https://doi.org/10.1371/journal.pone.0160258
Huynen MA, Muhlmeister M, Gotthardt K, Guerrero-Castillo S, Brandt U (2016) Evolution and structural organization of the mitochondrial contact site (MICOS) complex and the mitochondrial intermembrane space bridging (MIB) complex. Biochim Biophys Acta 1863(1):91–101. https://doi.org/10.1016/j.bbamcr.2015.10.009
Acknowledgments
K.R., F.K., J.S., J.T., and R.S. acknowledge funding from the Morgridge Institute for Research and a grant from Marv Conney. I.R. acknowledges the GeoDeepDive Infrastructure, funded by NSF ICER 1343760.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Raja, K. et al. (2020). Automated Extraction and Visualization of Protein–Protein Interaction Networks and Beyond: A Text-Mining Protocol. In: Canzar, S., Ringeling, F. (eds) Protein-Protein Interaction Networks. Methods in Molecular Biology, vol 2074. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9873-9_2
Download citation
DOI: https://doi.org/10.1007/978-1-4939-9873-9_2
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-4939-9872-2
Online ISBN: 978-1-4939-9873-9
eBook Packages: Springer Protocols