Abstract
Discourse parsing of scholarly documents is the premise and basis for standardizing the writing of scholarly documents, understanding their content, and quickly locating and extracting specific information from them. With the continuous emergence of a large number of scholarly documents, how to automatically analyze scholarly documents quickly and effectively has become a research hotspot. In this paper, we propose a hybrid model, which considers both section headers and body texts, to recognize generic sections in scholarly documents automatically. We conduct a comprehensive analysis of the semantic difference between short phrases and long narrative text chunks on the SectLabel dataset. The experimental results show that our model achieves 91.67% \(F_{1}\)-value in the generic section recognization, which is better than the baseline.
Similar content being viewed by others
Notes
Pdftotext command line tools https://www.xpdfreader.com/pdftotext-man.html.
References
Afshar, H.S., Doosti, M., Movassagh, H.: A comparative study of generic structure of applied linguistics and chemistry research articles: the case of discussions (2018)
BinMakhashen, G.M., Mahmoud, S.: Document layout analysis. ACM Comput. Surv. (CSUR) 52, 1–36 (2020)
Bosc, T., Cabrio, E., Villata, S.: Tweeties squabbling: positive and negative results in applying argument mining on social media. In: COMMA (2016)
Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078 (2014). http://arxiv.org/abs/1406.1078
Cocarascu, O., Toni, F.: Combining deep learning and argumentative reasoning for the analysis of social media textual content using small data sets. Comput. Linguist. 44(4), 833–858 (2018)
Constantin, A., Pettifer, S., Voronkov, A.: Pdfx: fully-automated pdf-to-xml conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering (2013)
Dasigi, P., Burns, G., Hovy, E., Waard, A.D.: Experiment segmentation in scientific discourse as clause-level structured prediction using recurrent neural networks. arXiv abs/1702.05398 (2017)
Dayan, P., Abbott, L.: Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems 15 (2001)
Devlin, J., Chang, M.W., Lee Kenton andToutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423
Dongbo, W., Ruiqing, G., Wenhao, Y., Xin, Z., Danhao, Z.: Research on the structure recognition of academic texts under different characteristics. J. China Soc. Sci. Tech. Inf. 37, 997–1008 (2018)
Dumais, S.T., Banko, M., Brill, E., Lin, J.J., Ng, A.Y.: Web question answering: is more always better? In: SIGIR’02 (2002)
Guo, Y., Korhonen, A., Liakata, M., Silins, I., Högberg, J., Stenius, U.: A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment. BMC Bioinf. 12, 69 (2010)
Hailin, D., Huan, X.: Generic structure of research article abstracts. Cross-Cult. Commun. 6, 36–44 (2010)
He, D., Cohen, S., Price, B.L., Kifer, D., Giles, C.L.: Multi-scale multi-task FCN for semantic page segmentation and table detection. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 254–261 (2017)
Hirohata, K., Okazaki, N., Ananiadou, S., Ishizuka, M.: Identifying sections in scientific abstracts using conditional random fields. In: Proceedings of the 3rd International Joint Conference on Natural Language Processing, vol. I (2008). https://www.aclweb.org/anthology/I08-1050
Hirohata, K., Okazaki, N., Ananiadou, S., Ishizuka, M.: Identifying sections in scientific abstracts using conditional random fields. In: Proc of the IJCNLP (2008)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–80 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/P18-1031. https://www.aclweb.org/anthology/P18-1031
Kafes, H.: Generic structure of the method sections of research articles and ma thesis by Turkish academic writers (2016)
Kosaraju, S.: Document layout analysis and recognition systems (2019)
Li, W., Liu, P., Zhang, Q., Liu, W.: An improved approach for text sentiment classification based on a deep neural network via a sentiment attention mechanism. Future Internet 11, 96 (2019)
Lin, J., Karakos, D., Demner-Fushman, D., Khudanpur, S.: Generative content models for structural analysis of medical abstracts. In: BioNLP@NAACL-HLT (2006)
thang Luong, M., Nguyen, T.D., yen Kan, M.: Logical structure recovery in scholarly articles with rich document features (2010)
Mullen, T., Mizuta, Y., Collier, N.: A baseline feature set for learning rhetorical zones using full articles in the biomedical domain. SIGKDD Explor. 7, 52–58 (2005)
Nasar, Z., Jaffry, S.W., Malik, M.K.: Information extraction from scientific articles: a survey. Scientometrics 117 (2018)
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. arXiv abs/1802.05365 (2018)
Rahman, M., Darus, S., Amir, Z.: Rhetorical structure of introduction in applied linguistics research articles (2017)
Sulistyo, I.: An analysis of generic structure of narrative text written by the tenth year students of sma yasiha gubug. English Teach. J. 4 (2013)
Teufel, S.: Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics (2009)
Teufel, S., Carletta, J., Moens, M.: An annotation scheme for discourse-level argumentation in research articles. In: EACL (1999)
Teufel, S., Moens, M.: Summarizing scientific articles: experiments with relevance and rhetorical status. Comput. Linguist. 28(4), 409–446 (2002)
Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, L.: Cermine: automatic extraction of structured metadata from scientific literature. Int. J. Doc. Anal. Recognit. 18(4), 317–335 (2015)
Waard, A.D., Kircz, J.: Modeling scientific research articles—shifting perspectives and persistent issues (2008)
Waard, A.D., Maat, H.P.: Verb form indicates discourse segment type in biological research papers: experimental evidence. J. Engl. Acad. Purp. 11, 357–366 (2012)
WANG Li-fei, L.X.: Constructing a model for the automatic identification of move structure in english research article abstracts, pp. 45–50 (2017)
Wei, L., Yong, H., Qikai, C.: The structure function of academic text and its classification. J. China Soc. Sci. Tech. Inf. 33, 979–985 (2014)
Yong, H., Wei, L., Qikai, C., Sisi, G.: The structure function recognition of academic text application in academic search. J. China Soc. Sci. Tech. Inf. 35, 425–431 (2016)
Zhong, X., Tang, J., Jimeno-Yepes, A.: Publaynet: largest dataset ever for document layout analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1015–1022 (2019)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, S., Wang, Q. A hybrid approach to recognize generic sections in scholarly documents. IJDAR 24, 339–348 (2021). https://doi.org/10.1007/s10032-021-00381-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-021-00381-5