Skip to main content
Log in

Exploiting pivot words to classify and summarize discourse facets of scientific papers

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

The ever-increasing number of published scientific articles has prompted the need for automated, data-driven approaches to summarizing the content of scientific articles. The Computational Linguistics Scientific Document Summarization Shared Task (CL-SciSumm 2019) has recently fostered the study and development of new text mining and machine learning solutions to the summarization problem customized to the academic domain. In CL-SciSumm, a Reference Paper (RP) is associated with a set of Citing Papers (CPs), all containing citations to the RP. In each CP, the text spans (i.e., citances) have been identified that pertain to a particular citation to the RP. The task of identifying the spans of text in the RP that most accurately reflect the citance is addressed using supervised approaches. This paper proposes a new, more effective solution to the CL-SciSumm discourse facet classification task, which entails identifying for each cited text span what facet of the paper it belongs to from a predefined set of facets. It proposes also to extend the set of traditional CL-SciSumm tasks with a new one, namely the discourse facet summarization task. The idea behind is to extract facet-specific descriptions of each RP consisting of a fixed-length collection of RP’s text spans. To tackle both the standard and the new tasks, we propose machine learning supported solutions based on the extraction of a selection of discriminating words, called pivot words. Predictive features based on pivot words are shown to be of great importance to rate the pertinence and relevance of a text span to a given facet. The newly proposed facet classification method performs significantly better than the best performing CL-SciSumm 2019 participant (i.e., the classification accuracy has increased by + 8%), whereas regression methods achieved promising results for the newly proposed summarization task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. The formulated task has not been included among the official tasks of the CL-SciSumm challenges.

  2. https://tac.nist.gov/2014/BiomedSumm/index.html.

  3. This task was optional in the BIRNDL CL-SciSumm 2019 challenge.

  4. http://www.scikit-learn.org.

  5. https://github.com/FranxYao/pivot_analysis.

  6. We exploit the feature importance function of Scikit-Learn to measure the relevance of each input feature to the classification phase.

  7. The approach presented by Li et al. (2019) has been re-implemented at the best of the authors’ understanding.

References

  • Abu-Jbara, A., & Radev, D. (2011). Coherent citation-based summarization of scientific papers. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies —HLT ’11 (Vol. 1, pp. 500–509). USA: Association for Computational Linguistics.

  • Baralis, E., & Cagliero, L. (2018). Highlighter: Automatic highlighting of electronic learning documents. IEEE Transactions on Emerging Topics in Computing, 6(1), 7–19. https://doi.org/10.1109/TETC.2017.2681655.

    Article  Google Scholar 

  • Baruah, G., & Kolla, M. (2018). Klick labs at CL-SciSumm 2018. In BIRNDL@SIGIR, “CEUR” workshop proceedings (Vol. 2132, pp. 134–141). CEUR-WS.org.

  • Beltagy, I., Lo, K., & Cohan, A. (2019). Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 3606–3611).

  • Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. Sebastopol: O’Reilly Media, Inc.

    MATH  Google Scholar 

  • Cagliero, L., Farinetti, L., & Baralis, E. (2019). Recommending personalized summaries of teaching materials. IEEE Access, 7, 22729–22739. https://doi.org/10.1109/ACCESS.2019.2899655.

    Article  Google Scholar 

  • Cagliero, L., Garza, P., & Baralis, E. (2019). ELSA: A multilingual document summarization algorithm based on frequent itemsets and latent semantic analysis. ACM Transactions on Information Systems, 37(2), 21:1–21:33. https://doi.org/10.1145/3298987.

    Article  Google Scholar 

  • Chandrasekaran, M. K., Yasunaga, M., Radev, D., Freitag, D., & Kan, M. -Y. (2019). Overview and results: CL-SciSumm SharedTask. In Proceedings of the 4th joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2019) @ SIGIR 2019 (p. 2019). Paris: France.

  • Cheng, J., & Lapata, M. (2016). Neural summarization by extracting sentences and words. In Proceedings of the 54th annual meeting of the association for computational linguistics (Long papers) (Vol. 1, pp. 484–494). Berlin, Germany: Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1046. https://www.aclweb.org/anthology/P16-1046.

  • Collins, E., Augenstein, I., & Riedel, S. (2017). A supervised approach to extractive summarisation of scientific papers. In Proceedings of the 21st conference on computational natural language learning (CoNLL 2017) (pp. 195–205). Vancouver, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/K17-1021. https://www.aclweb.org/anthology/K17-1021.

  • Collins, E., Augenstein, I., & Riedel, S. (2017). A supervised approach to extractive summarisation of scientific papers. In Proceedings of the 21st conference on computational natural language learning (CoNLL 2017) (pp. 195–205).

  • Davoodi, E., Madan, K., & Gu, J. (2018). CLSciSumm shared task: On the contribution of similarity measure and natural language processing features for citing problem. In BIRNDL@SIGIR, “CEUR” workshop proceedings (Vol. 2132, pp. 96–101). CEUR-WS.org.

  • Fu, Y., Zhou, H., Chen, J., & Li, L. (2019). Rethinking text attribute transfer: A lexical analysis. In K. van Deemter, C. Lin, & H. Takamura (Eds.), Proceedings of the 12th international conference on natural language generation, INLG 2019, October 29–November 1, 2019 (pp. 24–33). Tokyo, Japan: Association for Computational Linguistics. https://aclweb.org/anthology/papers/W/W19/W19-8604/.

  • Giannakopoulos, G. (2013). Multi-document multilingual summarization and evaluation tracks in ACL 2013 multiling workshop. In Proceedings of the multiling 2013 workshop on multilingual multi-document summarization (pp. 20–28). Association for Computational Linguistics. http://www.aclweb.org/anthology/W13-3103.

  • Giannakopoulos, G., Kubina, J., Conroy, J. M., Steinberger, J., Favre, B., Kabadjov, M. A., Kruschwitz, U., & Poesio, M. (2015). MultiLing 2015: Multilingual summarization of single and multi-documents, on-line fora, and call-center conversations. In Proceedings of the “SIGDIAL” 2015 conference, the 16th annual meeting of the special interest group on discourse and dialogue, 2–4 September 2015 (pp. 270–274). Prague, Czech Republic. http://aclweb.org/anthology/W/W15/W15-4638.pdf.

  • Jaidka, K., Chandrasekaran, M. K., Rustagi, S., & Kan, M. -Y. (2016). Overview of the CL-SciSumm 2016 shared task. In Proceedings of joint workshop on bibliometric-enhanced information retrieval and NLP for digital libraries.

  • Jaidka, K., Yasunga, M., Chandrasekaran, M., Radev, D., & Kan, M. -Y. (2018). The CL-SciSumm shared task 2018: Results and key insights (pp. 1–10).

  • Jaidka, K., Yasunaga, M., Chandrasekaran, M. K., Radev, D., & Kan, M. Y. (2019). The CL-SciSumm shared task 2018: Results and key insights. arXiv preprint arXiv:1909.00764.

  • Kedzie, C., McKeown, K., & Daumé III, H. (2018). Content selection in deep learning models of summarization. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 1818–1828).

  • Kim, M., Moirangthem, D. S., & Lee, M. (2016). Towards abstraction from extraction: Multiple timescale gated recurrent unit for summarization. In Rep4NLP@ACL (pp. 70–77). Association for Computational Linguistics.

  • Kumar Chandrasekaran, M., Jaidka, K., & Mayr, P. (2018). Joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2018). In The 41st international ACM SIGIR conference on research & development in information retrieval, SIGIR ’18 (pp. 1415–1418). New York, NY, USA: ACM. https://doi.org/10.1145/3209978.3210194.

  • Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger, K. Q. (2015). From word embeddings to document distances. In Proceedings of the 32nd international conference on international conference on machine learning—ICML’15 (Vol. 37, pp. 957-966). JMLR.org.

  • La Quatra, M., Cagliero, L., & Baralis, E. (2019). Poli2sum@CL-SciSumm-19: Identify, classify, and summarize cited text spans by means of ensembles of supervised models (pp. 233–246). https://www2.scopus.com/inward/record.uri?eid=2-s2.0-85071194418&partnerID=40&md5=e8f54672c3477c87a07010397cc60d28.

  • Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets (2nd ed.). New York, NY: Cambridge University Press.

    Book  Google Scholar 

  • Li, L., Chi, J., Chen, M., Huang, Z., Zhu, Y., & Fu, X. (2018). CIST@CLSciSumm-18: Methods for computational linguistics scientific citation linkage, facet classification and summarization. In BIRNDL@SIGIR, “CEUR” workshop proceedings (Vol. 2132, pp. 84–95). CEUR-WS.org.

  • Li, L., Zhu, Y., Xie, Y., Huang, Z., Liu, W., Li, X., & Liu, Y. (2019). Cist@ CLSciSumm-19: Automatic scientific paper summarization with citances and facets. In BIRNDL@SIGIR.

  • Lin, C. -Y., & Hovy, E. (2003). Automatic evaluation of summaries using N-gram co-occurrence statistics. In Proceedings of the North American chapter of the association for computational linguistics on human language technology (Vol. 1, pp. 71–78).

  • Lloret, E., Romá-Ferri, M. T., & Palomar, M. (2013). Compendium: A text summarization system for generating abstracts of research papers. Data & Knowledge Engineering, 88, 164–175. https://doi.org/10.1016/j.datak.2013.08.005.

    Article  Google Scholar 

  • Ma, S., Jin, X., & Zhang, C. (2018). Automatic identification of cited text spans: A multi-classifier approach over imbalanced dataset. Scientometrics, 116(2), 1303–1330.

    Article  Google Scholar 

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

  • Naik, A. P., & Bojewar, S. (2017). Tweet analytics and tweet summarization using graph mining. In 2017 international conference of electronics, communication and aerospace technology (ICECA) (Vol. 1, pp. 17–21). https://doi.org/10.1109/ICECA.2017.8203674.

  • Naik, S., Lade, S., Mamidipelli, S., & Save, A. (2018). Tweet summarization: A new approach. In 2018 second international conference on inventive communication and computational technologies (ICICCT) (pp. 1022–1025). https://doi.org/10.1109/ICICCT.2018.8473327.

  • Nakov, P. I., Schwartz, A. S., & Hearst, M. A. (2004). Citances: Citation sentences for semantic analysis of bioscience text. In In Proceedings of the SIGIR’04 workshop on search and discovery in bioinformatics.

  • Nallapati, R., Zhai, F., & Zhou, B. (2017). Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Proceedings of the thirty-first AAAI conference on artificial intelligence, AAAI’17 (pp. 3075–3081). AAAI Press.

  • Nenkova, A., & McKeown, K. (2012). A survey of text summarization techniques. In C. C. Aggarwal & C. Zhai (Eds.), Mining text data (pp. 43–76). Berlin: Springer.

    Chapter  Google Scholar 

  • Nikolov, N. I., & Pfeiffer, M., & Hahnloser, R. H. R. (2018). Data-driven summarization of scientific articles. In Proceedings of the 7th international workshop on mining scientific publications, LREC 2018.

  • Ovadia, S. (2014). ResearchGate and Academia.edu: Academic social networks. Behavioral & Social Sciences Librarian, 33(3), 165–169. https://doi.org/10.1080/01639269.2014.934093.

    Article  Google Scholar 

  • Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web, Technical report. Stanford InfoLab.

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    MathSciNet  MATH  Google Scholar 

  • Qazvinian, V., & Radev, D. R. (2008). Scientific paper summarization using citation summary networks. In Proceedings of the 22nd international conference on computational linguistics (Coling 2008) (pp. 689–696). Manchester, UK: Coling 2008 Organizing Committee. https://www.aclweb.org/anthology/C08-1087.

  • Qazvinian, V., & Radev, D. R. (2010). Identifying non-explicit citing sentences for citation-based summarization. In Proceedings of the 48th annual meeting of the association for computational linguistics, ACL ’10 (pp. 555–564). USA: Association for Computational Linguistics.

  • Ronzano, F., & Saggion, H. (2016). An empirical assessment of citation information in scientific summarization. In E. Métais, F. Meziane, M. Saraee, V. Sugumaran, & S. Vadera (Eds.), Natural language processing and information systems (pp. 318–325). Cham: Springer.

    Chapter  Google Scholar 

  • Saggion, H., & Ronzano, F. (2017). Scholarly data mining: Making sense of scientific literature. In 2017 ACM/IEEE joint conference on digital libraries (JCDL) (pp. 1–2). https://doi.org/10.1109/JCDL.2017.7991622.

  • Schwartz, A. S., & Hearst, M. (2006). Summarizing key concepts using citation sentences. In Proceedings of the HLT-NAACL BioNLP workshop on linking natural language and biology, LNLBioNLP ’06 (pp. 134–135). USA: Association for Computational Linguistics.

  • Sollaci, L. B., & Pereira, M. G. (2004). The introduction, methods, results, and discussion (IMRAD) structure: A fifty-year survey. Journal of the Medical Library Association, 92(3), 364.

    Google Scholar 

  • Sun, X., & Zhuge, H. (2018). Summarization of scientific paper through reinforcement ranking on semantic link network. IEEE Access, 6, 40611–40625. https://doi.org/10.1109/ACCESS.2018.2856530.

    Article  Google Scholar 

  • Tan, P. N., Steinbach, M., Karpatne, A., & Kumar, V. (2018). Introduction to data mining (2nd ed.). New York: Pearson.

    Google Scholar 

  • Wan, S., Dale, R., Dras, M., & Paris, C. (2008). Seed and grow: Augmenting statistically generated summary sentences using schematic word patterns. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 543–552).

  • Wan, S., Paris, C., & Dale, R. (2009). Whetting the appetite of scientists: Producing summaries tailored to the citation context. In Proceedings of the 9th ACM/IEEE-CS joint conference on digital libraries (pp. 59–68). ACM.

  • Wan, S., Paris, C., & Dale, R. (2010). Invited paper: Supporting browsing-specific information needs: Introducing the citation-sensitive in-browser summariser. Web Semantics, 8(2–3), 196–202. https://doi.org/10.1016/j.websem.2010.03.002.

    Article  Google Scholar 

  • Wan, S., Paris, C., Muthukrishna, M., & Dale, R. (2009). Designing a citation-sensitive research tool: An initial study of browsing-specific information needs. In Proceedings of the 2009 workshop on text and citation analysis for scholarly digital libraries (NLPIR4DL) (pp. 45–53). Suntec City, Singapore: Association for Computational Linguistics. https://www.aclweb.org/anthology/W09-3606.

  • Wang, P., Li, S., Wang, T., Zhou, H., & Tang, J. (2018). “NUDT” @ CLSciSumm-18. In Proceedings of the 3rd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries “(BIRNDL” 2018) co-located with the 41st international “ACM” “SIGIR” conference on research and development in information retrieval “(SIGIR” 2018) (pp. 102–113). Ann Arbor, USA.

  • Yasunaga, M., Kasai, J., Zhang, R., Fabbri, A., Li, I., Friedman, D., & Radev, D. (2019). ScisummNet: A large annotated corpus and content-impact models for scientific paper summarization with citation networks. In Proceedings of AAAI 2019.

  • Yasunaga, M., Zhang, R., Meelu, K., Pareek, A., Srinivasan, K., & Radev, D. R. (2017). Graph-based neural multi-document summarization. In Proceedings of CoNLL 2017.

Download references

Acknowledgements

The research leading to these results has been partly funded by the Smart-Data@PoliTO center for Big Data and Machine Learning technologies. Computational resources were provided by HPC@POLITO, a project of Academic Computing within the Department of Control and Computer Engineering at the Politecnico di Torino (http://www.hpc.polito.it).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Moreno La Quatra.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

La Quatra, M., Cagliero, L. & Baralis, E. Exploiting pivot words to classify and summarize discourse facets of scientific papers. Scientometrics 125, 3139–3157 (2020). https://doi.org/10.1007/s11192-020-03532-3

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-020-03532-3

Keywords

Navigation