ABSTRACT
Information technology is advancing faster than anticipated. The amount of data captured and stored in electronic form by far exceeds the capabilities available for comprehensive analysis and effective knowledge discovery. There is always a need for new sophisticated techniques that could extract more of the knowledge hidden in the raw data collected continuously in huge repositories. Biomedicine and computational biology is one of the domains overwhelmed with huge amounts of data that should be carefully analyzed for valuable knowledge that may help uncovering many of the still unknown information related to various diseases threatening the human body. Biomarker detection is one of the areas which have received considerable attention in the research community. There are two sources of data that could be analyzed for biomarker detection, namely gene expression data and the rich literature related to the domain. Our research group has reported achievements analyzing both domains. In this paper, we concentrate on the latter domain by describing a powerful tool which is capable of extracting from the content of a repository (like PubMed) the parts related to a given specific domain like cancer, analyze the retrieved text to extract the key terms with high frequency, present the extracted terms to domain experts for selecting those most relevant to the investigated domain, retrieve from the analyzed text molecules related to the domain by considering the relevant terms, derive the network which will be analyzed to identify potential biomarkers. For the work described in this paper, we considered PubMed and extracted abstracts related to prostate and breast cancer. The reported results are promising; they demonstrate the effectiveness and applicability of the proposed approach.
- D. Applet, et al. SRI International FASTUS system: Muc-6 test results and analysis. Proc. of the Message Understanding Conference, pp. 237--248, 199. Google ScholarDigital Library
- A. Cohen, et al. Using co-occurrence network structure to extract synonymous gene and protein names from medline abstracts. BMC Bioinformatics, 6(1): 103, 2005.Google ScholarCross Ref
- B. Domon, R. Aebersold. Mass spectrometry and protein analysis. Science; 312(5771): 212--7. 2006.Google Scholar
- P. B. Dobrokhotov, et al. Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation. Bioinformatics, 19 Suppl 1, 2003.Google Scholar
- I. Donaldson, et al. Prebind and textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics, 4(1): 11, 2003.Google ScholarCross Ref
- U. Fayyad and K. Irani. Multi-interval discretization of continuous-valued attributes for classification learning, Proc. of IJCAI, pp. 1022--1029, 1993.Google Scholar
- I. Gat-Viks, A. Tanay and R. Shamir. Modeling and analysis of heterogeneousregulation in biological networks. Journal of Computational Biology, 11(6): 1034--49, 2004.Google ScholarCross Ref
- P. Glenisson, et al. Evaluation of the vector space representation in text-based gene clustering. In Proc of PSB, pp. 391--402, 2003.Google Scholar
- P. Glenisson, et al. TXTGate: profiling gene groups with text-based information. Genome Biology, 5: R43+, 2004.Google Scholar
- L. Huiqing, L. Jinyan and W. Limsoon. A Comparative Study on Feature Selection and Classification Methods Using Gene Expression Profiles and Proteomic Patterns. Genome Informatics 13: 51--60, 2002.Google Scholar
- D. Hanisch, et al. Playing biology's name game: identifying protein names in scientific text. Proc. of PSB, pp. 403--414, Lihue, Hawaii, 2003.Google Scholar
- L. Hirschman, A. A. Morgan and A. S. Yeh. Rutabaga by any other name: extracting biological names. Journal of Biomedica Informatics, 35(4): 247--259, Aug. 2002. Google ScholarDigital Library
- V. Kulasingam and E. P. Diamandis. Strategies for discovering novel cancer biomarkers through utilization of emerging technologies. Nature Clinical Practice Oncology. 2008. (10): 588--99.Google Scholar
- Y. Lu and J. Han. Cancer classification using gene expression data. Information Systems; 28(4): 243--68, 2003. Google ScholarDigital Library
- H. Liu and C. Friedman. Mining terminological knowledge in large biomedical corpora. Proc. of PSB, pp. 415--426, 2003.Google Scholar
- S. Novichkova, S. Egorov and N. Daraselia. MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics, 19(13): 1699--1706, Sept. 2003.Google ScholarCross Ref
- T. Ono, et al. Automated extraction of information on protein - protein interactions from the biological literature.Google Scholar
- J. H. Park, et al. Protein Expr. Purif. 22, 60--6, 2001.Google ScholarCross Ref
- C. M. Perou, et al. Molecular portraits of human breast tumours. Nature. 2000; 406(6797): 747--52.Google Scholar
- S. Raychaudhuri, H. Schutze, and R. B. Altman. Using text analysis to identify functionally coherent gene groups. Genome Research, 12(10): 1582--1590, 2002.Google ScholarCross Ref
- T. Sekimizu, H. S. Park, T. Jun'ichi. Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts. Genome Inform Ser Workshop, 9: 62--71, 1998.Google Scholar
- L. Tanabe and W. J. Wilbur. Tagging gene and protein names in biomedical text. Bioinformatics, 18(8): 1124--1132, Aug. 2002.Google ScholarCross Ref
- J. P. Vert and M. Kanehisa. Extracting active pathways from gene expression data. Bioinformatics 2003, 19(Suppl 2): II238--II244.Google Scholar
- R. Varshavsky, et al. Novel unsupervised feature filtering of biological data. Bioinformatics, 22, e507-e513, 2006. Google ScholarDigital Library
- B. Weigelt, et al. Molecular portraits and 70-gene prognosis signature are preserved throughout the metastatic process of breast cancer. Cancer Research. 2005; 65(20): 9155--8.Google Scholar
- M. Weeber, et al. Text-based discovery in biomedicine: the architecture of the DAD-system. Proceedings / AMIA... Annual Symposium. AMIA Symposium, pages 903--907, 2000.Google Scholar
- H. Xu, et al. Facilitating cancer research using natural language processing of pathology reports. Studies in health technology and informatics, 107(Pt 1): 565--572, 2004.Google Scholar
- Y. Xu, Z. Wang, Y. Lei, Y. Zhao, and Y. Xue. Mba: a literature mining system for extracting biomedical abbreviations. BMC Bioinformatics, 10(1): 14, 2009.Google ScholarCross Ref
- A. Yakushiji, et al. Event extraction from biomedical papers using a full parser. Proc. of PSB. 6, 408--419 2001.Google Scholar
- H. Yu and E. Agichtein. Extracting synonymous gene and protein terms from biological literature. Bioinformatics, 19 Suppl 1(suppl 1): i340--i349, July 2003.Google ScholarCross Ref
- H. Yu, et al. Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. Proc AMIA Symp, pages 919--923, 2002.Google Scholar
- G. Zhou, et al. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics, 20(7): 1178--1190, 2004. Google ScholarDigital Library
Index Terms
- Combining information extraction and text mining for cancer biomarker detection
Recommendations
Community Based Cancer Biomarker Identification from Gene Co-expression Network
BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health InformaticsFinding the biomarkers of cancers and the analysis of cancer-driving genes that are involved in these biomarkers are essential for understanding the dynamics of cancer. Gene expression profiling has been widely used for cancer research, and its patterns,...
Text analysis of MEDLINE for discovering functional relationships among genes: evaluation of keyword extraction weighting schemes
One of the key challenges of microarray studies is to derive biological insights from the gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the functional links among genes. However, the ...
Comments