Abstract
Retrieval of biomedical articles about specific research issues (e.g., gene-disease associations) is an essential and routine job for biomedical researchers. An article a can be said to be about a research issue r only if its core content (goal, background, and conclusion of a) focuses on r. In this paper, we present a technique CoreCE (Core Content Extractor) that, given a biomedical article a, extracts the textual core content of a. The core contents extracted from biomedical articles can be used to index the articles so that articles about specific research issues can be retrieved by search engines more properly. Development of CoreCE is challenging, because the core content of an article a may be expressed in different ways and scattered in a. We tackle the challenge by considering titles of the references cited by a, as well as the passages (in a) used to explain why the references are cited (i.e., the citation passages). Empirical evaluation shows that, by representing biomedical articles with the core contents extracted by CoreCE, retrieval of those articles that are judged (by biomedical experts) to be about specific gene-disease associations can be significantly improved. CoreCE can thus be a front-end processor for search engines to preprocess biomedical scholarly articles for subsequent indexing and retrieval. The contribution is of technical significance to the retrieval and mining of the evidence already published in biomedical literature.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The ways of database update by Genetic Home Reference and Online Mendelian Inheritance in Human can be found at http://ghr.nlm.nih.gov/ExpertReviewers and http://www.omim.org/about, respectively.
- 2.
Google Scholar is available at https://scholar.google.com.
- 3.
PubMed is available at http://www.ncbi.nlm.nih.gov/pubmed.
- 4.
The way PubMed employs to retrieve related articles can be found at http://www.ncbi.nlm.nih.gov/books/NBK3827/#pubmedhelp.Computation_of_Similar_Articl.
- 5.
When extracting the α words, stopwords are excluded and hence not counted.
- 6.
- 7.
DisGeNET is available at http://www.disgenet.org/web/DisGeNET/menu/home.
- 8.
GAD is available at http://geneticassociationdb.nih.gov.
- 9.
CTD is available at http://ctdbase.org.
- 10.
PMC provides full-text biomedical articles at http://www.ncbi.nlm.nih.gov/pmc. All articles that are not included in PMC are excluded in the experiments.
References
Aljaber, B., Stokes, N., Bailey, J., Pei, J.: Document clustering of scientific texts using citation contexts. Inf. Retrieval 13(2), 101–131 (2010)
Amsler R.A.: Application of citation-based automatic classification. Technical report, Linguistics Research Center, University of Texas at Austin (1972)
Becker, K.G., Barnes, K.C., Bright, T.J., Wang, S.A.: The genetic association database. Nat. Genet. 36(5), 431–432 (2004)
Boyack, K.W., Small, H., Klavans, R.: Improving the accuracy of co-citation clustering using full text. J. Am. Soc. Inform. Sci. Technol. 64(9), 1759–1767 (2013)
Boyack, K.W., Newman, D., Duhon, R.J., Klavans, R., Patek, M., Biberstine, J.R., et al.: Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS One 6(3), e18029 (2011)
Boyack, K.W., Klavans, R.: Co-citation analysis, bibliographic coupling, and direct citation: which citation approach represents the research front most accurately? J. Am. Soc. Inform. Sci. Technol. 61(12), 2389–2404 (2010)
Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., Goncalves, M.A.: Combining link-based and content-based methods for web document classification. In: Proceedings of the 2003 ACM CIKM International Conference on Information and Knowledge Management, New Orleans, Louisiana, USA (2003)
Couto, T., Cristo, M., Gonçalves, M.A., Calado, P., Nivio Ziviani, N., Moura, E., Ribeiro-Neto, B.: A comparative study of citations and links in document classification. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 75–84 (2006)
Gipp, B., Meuschke, N.: Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence. In: Proceedings of 11th ACM Symposium on Document Engineering, Mountain View, CA, USA (2011)
Gipp, B., Beel, J.: Citation proximity analysis (CPA) – a new approach for identifying related work based on co-citation analysis. In: Proceedings of the 12th International Conference on Scientometrics and Informetrics, vol. 2, pp. 571–575 (2009)
Glenisson, P., Glanzel, W., Janssens, F., De Moor, B.: Combining full text and bibliometric information in mapping scientific disciplines. Inf. Process. Manag. 41, 1548–1572 (2005)
Heck, T.: Combining social information for academic networking. In: Proceedings of the 16th ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW), San Antonio, Texas, USA (2013)
Kumar, S., Reddy, P.K., Reddy, V.B., Singh, A.: Similarity analysis of legal judgments. In: Proceedings of the Fourth Annual ACM Bangalore Conference (COMPUTE), Bangalore, Karnataka, India (2011)
Kessler, M.M.: Bibliographic coupling between scientific papers. Am. Documentation 14(1), 10–25 (1963)
Landauer, T.K., Laham, D., Derr, M.: From paragraph to graph: latent semantic analysis for information visualization. Proc. Natl. Acad. Sci. U.S.A. 101(Suppl 1), 5214–5219 (2004)
Liu, R.-L.: Passage-based bibliographic coupling: an inter-article similarity measure for biomedical articles. PLoS One 10(10), e0139245 (2015)
Liu, S., Chen, C., Ding, K., Wang, B., Xu, K., Lin, Y.: Literature retrieval based on citation context. Scientometrics 101(2), 1293–1307 (2014)
Liu, X., Zhang, J., Guo, C.: Full-text citation analysis: a new method to enhance scholarly networks. J. Am. Soc. Inform. Sci. Technol. 64(9), 1852–1863 (2013)
Nakov, P.I., Schwartz, A.S., Hearst, M.: Citances: citation sentences for semantic analysis of bioscience text. In: Proceedings of the SIGIR 2004 Workshop on Search and Discovery in Bioinformatics, pp. 81–88 (2004)
Robertson, S.E., Walker, S., Beaulieu, M.: Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive. In: Proceedings of the 7th Text REtrieval Conference (TREC 7), Gaithersburg, USA, pp. 253–264 (1998)
Small, H.G.: Co-citation in the scientific literature: a new measure of relationship between two documents. J. Am. Soc. Inform. Sci. Technol. 24(4), 265–269 (1973)
Whissell, J.S., Clarke, C.L.A.: Effective measures for inter-document similarity. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pp. 1361–1370 (2013)
White, H.D., Griffith, B.C.: Author cocitation: a literature measure of intellectual structure. J. Am. Soc. Inform. Sci. Technol. 32(3), 163–171 (1981)
Wiegers, T.C., Davis, A.P., Cohen, K.B., Hirschman, L., Mattingly, C.J.: Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD). BMC Bioinformatics 10, 326 (2009)
Yoon, S.-H., Kim, S.-W., Park, S.: A link-based similarity measure for scientific literature. In: Proceedings of the 19th International World Wide Web Conference (WWW), North Carolina, USA (2010)
Zhao, P., Han, J., Sun, Y.: P-Rank: a comprehensive structural similarity measure over information networks. In: Proceedings of the International Conference on Information and Knowledge Management, pp. 553–562 (2009)
Acknowledgment
This research was supported by the Ministry of Science and Technology of Taiwan under the grant MOST 104-2221-E-320-005.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Liu, RL. (2016). Citation-Based Extraction of Core Contents from Biomedical Articles. In: Fujita, H., Ali, M., Selamat, A., Sasaki, J., Kurematsu, M. (eds) Trends in Applied Knowledge-Based Systems and Data Science. IEA/AIE 2016. Lecture Notes in Computer Science(), vol 9799. Springer, Cham. https://doi.org/10.1007/978-3-319-42007-3_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-42007-3_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42006-6
Online ISBN: 978-3-319-42007-3
eBook Packages: Computer ScienceComputer Science (R0)