Abstract
Relevant information extraction is a dire need of the scholarly community. There are a number of systems available to find relevant information from scientific literature such as search engines, citation indexes, digital libraries etc. For a search query, a long list of irrelevant documents is presented to the users mainly due to the huge number of availability of the full-text document, and furthermore due to the unstructured nature of indexed scientific resources. The contemporary systems have formally defined the structure of scientific documents. However, populating the already available enriched scientific structure from unstructured/semi-structured scientific documents has not been addressed previously. In this research paper, we have designed, implemented, and evaluated an automated technique that is able to tag each paper’s content with logical sections appearing in the scientific document. The proposed system has been evaluated against the benchmark, subsequently, the proposed system have been also compared with machine learning techniques that may be used for the same task. It has been empirically shown that the overall correctness and completeness of our proposed technique is 0.78 and 0.79 respectively and thus the overall accuracy of about 0.78 was achieved. The achieved results are good as compared to machine learning based classification. The developed system may help future information retrieval systems, digital libraries, and citation indexes to index, retrieve, rank and visualize most relevant scientific documents for the scientific community.






Similar content being viewed by others
References
Larsen, P.O., Ins, M.V.: The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index. Scientometrics 84, 575–603 (2010)
Bollacker, K.D., Lawrence, S., Giles, C.L.: Discovering relevant scientific literature on the Web. IEEE Intell. Syst. 15, 4247 (2000)
Giles, C.L., Bollacker, K.D., Lawrence, S., CiteSeer: An automatic citation indexing system. In: Proceedings of Third ACM Conference on Digital Libraries, Pittsburgh, Pennsylvania, United States, 23–26 (1998)
Beel, J., Gipp, B.: Google scholars ranking algorithm: an introductory overview. In: Proceedings of the 12th International Conference on Scientometrics and Informetrics, 230241(2009)
Blumberg, R., Atre, S.: The problem with unstructured data. Inform Manag. 6287, 42–46 (2003)
Roberts, R.J., Varmus, H.E., Ashburner, M., Brown, P.O., Eisen, M.B., Khosla, C., Kirschner, M., Nusse, R., Scott, M., Wold, B.: Building a “GenBank” of the published literature. Science 291, 2318–2319 (2001)
Kafkas, S., Pi, X., Marinos, N., Talo, F., Morrison, A., McEntyre, J.R.: Section level search functionality in Europe PMC. J. Biomed. Semant. 6(1), 3–7 (2015)
Guo, Y., Korhonen, A., Liakata, M., Silins, I., Hogberg, J., Stenius, U.: A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment. BMC Bioinform. 12(1), 7–17 (2011)
Lin, J., Karakos, D., Demner-Fushman, D., Khudanpur, S.: Generative content models for structural analysis of medical abstracts. In: Proceedings of BioNLP-06, New York, USA, pp. 65–72 (2006)
Hirohata, K., Okazaki, N., Ananiadou, S., Ishizuka, M.: Identifying sections in scientific abstracts using conditional random fields. In: Proceedings of 3rd International Joint Conference on Natural Language Processing, pp. 381–388 (2008)
Lin, R.T.K., Dai, H.J., Bow, Y.Y., Chiu, J.L.T., Tsai, R.T.H.: Using conditional randomfields for result identification in biomedical abstracts. Integr. Comput. Aided Eng. 16(4), 339–352 (2009)
Teufel, S., Siddharthan, A., Batchelor, C.: Towards domain-independent argumentative zoning. Evidence from chemistry and computational linguistics. In: Proceedings of EMNLP, pp. 1493–1502 (2009)
Teufel, S., Moens, M.: Summarizing scientific articles: experiments with relevance and rhetorical status. Comput. Linguist. 28, 409–445 (2002)
Liakata, M., Teufel, S., Siddharthan, A., Batchelor, C.: Corpora for the conceptualisation and zoning of scientific papers. In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC10) (2010)
Teufel, S.: Citations and sentiment. In: Workshop on Text mining for Scholarly Communications and Repositories, University of Manchester, UK (2009)
Maricic, S., Spaventi, J., Pavicic, L., Pifat-Mrzljak, G.: Citation context versus the frequency counts of citation histories. J. Am. Soc. Inf. Sci. 49, 530–540 (1998)
Shahid, A., Afzal, M.T., Qadir, M.A.: Discovering semantic relatedness between scientific articles through citation frequency. In: Workshop on Text mining for Scholarly Communications and Repositories, Australian Journal of Basic Applied Sciences, vol. 5, pp. 1599–1604 (2011)
Peroni, S., Shotton, D., Vitali, F.: Faceted documents: describing document characteristics using semantic lenses. In: ACM Symposium on Document Engineering, pp. 191–194 (2012)
Shotton, D., Portwin, K., Klyne, G., Miles, A.: Adventures in semantic publishing: exemplar semantic enhancements of a research article. PLoS Comput. Biol. 5, e1000361 (2009). doi:10.1371/journal.pcbi.1000361
Mizuta, Y., Korhonen, A., Mullen, T., Collier, N.: Zone analysis in biology articles as a basis for information extraction. Int. J. Med. Inf. Nat. Lang. Process. Biomed. Appl. 75(6), 468–487 (2006)
Cohen, J., Ahmad, M.T., Qadir, M.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20, 37–46 (1960)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)
Seringhaus, M.R., Gerstein, M.B.: Publishing perishing? Towards tomorrows information architecture. BMC Bioinform. 8, 17 (2007). doi:10.1186/1471-2105-8-17
Gerstein, M., Seringhaus, M., Fields, S.: Structured digital abstract makes text mining easy. Nature 447, 142 (2007). doi:10.1038/447142a
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130137 (1980)
Afzal, M.T., Maurer, H., Balke, W.T., Kulathuramaiyer, N.: Rule based autonomous citation mining with TIERL. J. Dig. Inf. Manag. 8(3), 96–204 (2010)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(23), 103–134 (2000)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: European Conference on Machine Learning, pp. 137–142. Springer, Berlin (1998)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shahid, A., Afzal, M.T. Section-wise indexing and retrieval of research articles. Cluster Comput 21, 481–492 (2018). https://doi.org/10.1007/s10586-017-0914-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-0914-4