Exploiting pivot words to classify and summarize discourse facets of scientific papers

La Quatra, Moreno; Cagliero, Luca; Baralis, Elena

doi:10.1007/s11192-020-03532-3

Exploiting pivot words to classify and summarize discourse facets of scientific papers

Published: 13 June 2020

Volume 125, pages 3139–3157, (2020)
Cite this article

Scientometrics Aims and scope Submit manuscript

642 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

The ever-increasing number of published scientific articles has prompted the need for automated, data-driven approaches to summarizing the content of scientific articles. The Computational Linguistics Scientific Document Summarization Shared Task (CL-SciSumm 2019) has recently fostered the study and development of new text mining and machine learning solutions to the summarization problem customized to the academic domain. In CL-SciSumm, a Reference Paper (RP) is associated with a set of Citing Papers (CPs), all containing citations to the RP. In each CP, the text spans (i.e., citances) have been identified that pertain to a particular citation to the RP. The task of identifying the spans of text in the RP that most accurately reflect the citance is addressed using supervised approaches. This paper proposes a new, more effective solution to the CL-SciSumm discourse facet classification task, which entails identifying for each cited text span what facet of the paper it belongs to from a predefined set of facets. It proposes also to extend the set of traditional CL-SciSumm tasks with a new one, namely the discourse facet summarization task. The idea behind is to extract facet-specific descriptions of each RP consisting of a fixed-length collection of RP’s text spans. To tackle both the standard and the new tasks, we propose machine learning supported solutions based on the extraction of a selection of discriminating words, called pivot words. Predictive features based on pivot words are shown to be of great importance to rate the pertinence and relevance of a text span to a given facet. The newly proposed facet classification method performs significantly better than the best performing CL-SciSumm 2019 participant (i.e., the classification accuracy has increased by + 8%), whereas regression methods achieved promising results for the newly proposed summarization task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scientific document summarization via citation contextualization and scientific discourse

Article 09 May 2017

Computational linguistics literature and citations oriented citation linkage, classification and summarization

Article 13 June 2017

Insights from CL-SciSumm 2016: the faceted scientific document summarization Shared Task

Article 14 June 2017

Notes

The formulated task has not been included among the official tasks of the CL-SciSumm challenges.
https://tac.nist.gov/2014/BiomedSumm/index.html.
This task was optional in the BIRNDL CL-SciSumm 2019 challenge.
http://www.scikit-learn.org.
https://github.com/FranxYao/pivot_analysis.
We exploit the feature importance function of Scikit-Learn to measure the relevance of each input feature to the classification phase.
The approach presented by Li et al. (2019) has been re-implemented at the best of the authors’ understanding.

References

Abu-Jbara, A., & Radev, D. (2011). Coherent citation-based summarization of scientific papers. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies —HLT ’11 (Vol. 1, pp. 500–509). USA: Association for Computational Linguistics.
Baralis, E., & Cagliero, L. (2018). Highlighter: Automatic highlighting of electronic learning documents. IEEE Transactions on Emerging Topics in Computing, 6(1), 7–19. https://doi.org/10.1109/TETC.2017.2681655.
Article Google Scholar
Baruah, G., & Kolla, M. (2018). Klick labs at CL-SciSumm 2018. In BIRNDL@SIGIR, “CEUR” workshop proceedings (Vol. 2132, pp. 134–141). CEUR-WS.org.
Beltagy, I., Lo, K., & Cohan, A. (2019). Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 3606–3611).
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. Sebastopol: O’Reilly Media, Inc.
MATH Google Scholar
Cagliero, L., Farinetti, L., & Baralis, E. (2019). Recommending personalized summaries of teaching materials. IEEE Access, 7, 22729–22739. https://doi.org/10.1109/ACCESS.2019.2899655.
Article Google Scholar
Cagliero, L., Garza, P., & Baralis, E. (2019). ELSA: A multilingual document summarization algorithm based on frequent itemsets and latent semantic analysis. ACM Transactions on Information Systems, 37(2), 21:1–21:33. https://doi.org/10.1145/3298987.
Article Google Scholar
Chandrasekaran, M. K., Yasunaga, M., Radev, D., Freitag, D., & Kan, M. -Y. (2019). Overview and results: CL-SciSumm SharedTask. In Proceedings of the 4th joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2019) @ SIGIR 2019 (p. 2019). Paris: France.
Cheng, J., & Lapata, M. (2016). Neural summarization by extracting sentences and words. In Proceedings of the 54th annual meeting of the association for computational linguistics (Long papers) (Vol. 1, pp. 484–494). Berlin, Germany: Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1046. https://www.aclweb.org/anthology/P16-1046.
Collins, E., Augenstein, I., & Riedel, S. (2017). A supervised approach to extractive summarisation of scientific papers. In Proceedings of the 21st conference on computational natural language learning (CoNLL 2017) (pp. 195–205). Vancouver, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/K17-1021. https://www.aclweb.org/anthology/K17-1021.
Collins, E., Augenstein, I., & Riedel, S. (2017). A supervised approach to extractive summarisation of scientific papers. In Proceedings of the 21st conference on computational natural language learning (CoNLL 2017) (pp. 195–205).
Davoodi, E., Madan, K., & Gu, J. (2018). CLSciSumm shared task: On the contribution of similarity measure and natural language processing features for citing problem. In BIRNDL@SIGIR, “CEUR” workshop proceedings (Vol. 2132, pp. 96–101). CEUR-WS.org.
Fu, Y., Zhou, H., Chen, J., & Li, L. (2019). Rethinking text attribute transfer: A lexical analysis. In K. van Deemter, C. Lin, & H. Takamura (Eds.), Proceedings of the 12th international conference on natural language generation, INLG 2019, October 29–November 1, 2019 (pp. 24–33). Tokyo, Japan: Association for Computational Linguistics. https://aclweb.org/anthology/papers/W/W19/W19-8604/.
Giannakopoulos, G. (2013). Multi-document multilingual summarization and evaluation tracks in ACL 2013 multiling workshop. In Proceedings of the multiling 2013 workshop on multilingual multi-document summarization (pp. 20–28). Association for Computational Linguistics. http://www.aclweb.org/anthology/W13-3103.
Giannakopoulos, G., Kubina, J., Conroy, J. M., Steinberger, J., Favre, B., Kabadjov, M. A., Kruschwitz, U., & Poesio, M. (2015). MultiLing 2015: Multilingual summarization of single and multi-documents, on-line fora, and call-center conversations. In Proceedings of the “SIGDIAL” 2015 conference, the 16th annual meeting of the special interest group on discourse and dialogue, 2–4 September 2015 (pp. 270–274). Prague, Czech Republic. http://aclweb.org/anthology/W/W15/W15-4638.pdf.
Jaidka, K., Chandrasekaran, M. K., Rustagi, S., & Kan, M. -Y. (2016). Overview of the CL-SciSumm 2016 shared task. In Proceedings of joint workshop on bibliometric-enhanced information retrieval and NLP for digital libraries.
Jaidka, K., Yasunga, M., Chandrasekaran, M., Radev, D., & Kan, M. -Y. (2018). The CL-SciSumm shared task 2018: Results and key insights (pp. 1–10).
Jaidka, K., Yasunaga, M., Chandrasekaran, M. K., Radev, D., & Kan, M. Y. (2019). The CL-SciSumm shared task 2018: Results and key insights. arXiv preprint arXiv:1909.00764.
Kedzie, C., McKeown, K., & Daumé III, H. (2018). Content selection in deep learning models of summarization. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 1818–1828).
Kim, M., Moirangthem, D. S., & Lee, M. (2016). Towards abstraction from extraction: Multiple timescale gated recurrent unit for summarization. In Rep4NLP@ACL (pp. 70–77). Association for Computational Linguistics.
Kumar Chandrasekaran, M., Jaidka, K., & Mayr, P. (2018). Joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2018). In The 41st international ACM SIGIR conference on research & development in information retrieval, SIGIR ’18 (pp. 1415–1418). New York, NY, USA: ACM. https://doi.org/10.1145/3209978.3210194.
Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger, K. Q. (2015). From word embeddings to document distances. In Proceedings of the 32nd international conference on international conference on machine learning—ICML’15 (Vol. 37, pp. 957-966). JMLR.org.
La Quatra, M., Cagliero, L., & Baralis, E. (2019). Poli2sum@CL-SciSumm-19: Identify, classify, and summarize cited text spans by means of ensembles of supervised models (pp. 233–246). https://www2.scopus.com/inward/record.uri?eid=2-s2.0-85071194418&partnerID=40&md5=e8f54672c3477c87a07010397cc60d28.
Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets (2nd ed.). New York, NY: Cambridge University Press.
Book Google Scholar
Li, L., Chi, J., Chen, M., Huang, Z., Zhu, Y., & Fu, X. (2018). CIST@CLSciSumm-18: Methods for computational linguistics scientific citation linkage, facet classification and summarization. In BIRNDL@SIGIR, “CEUR” workshop proceedings (Vol. 2132, pp. 84–95). CEUR-WS.org.
Li, L., Zhu, Y., Xie, Y., Huang, Z., Liu, W., Li, X., & Liu, Y. (2019). Cist@ CLSciSumm-19: Automatic scientific paper summarization with citances and facets. In BIRNDL@SIGIR.
Lin, C. -Y., & Hovy, E. (2003). Automatic evaluation of summaries using N-gram co-occurrence statistics. In Proceedings of the North American chapter of the association for computational linguistics on human language technology (Vol. 1, pp. 71–78).
Lloret, E., Romá-Ferri, M. T., & Palomar, M. (2013). Compendium: A text summarization system for generating abstracts of research papers. Data & Knowledge Engineering, 88, 164–175. https://doi.org/10.1016/j.datak.2013.08.005.
Article Google Scholar
Ma, S., Jin, X., & Zhang, C. (2018). Automatic identification of cited text spans: A multi-classifier approach over imbalanced dataset. Scientometrics, 116(2), 1303–1330.
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
Naik, A. P., & Bojewar, S. (2017). Tweet analytics and tweet summarization using graph mining. In 2017 international conference of electronics, communication and aerospace technology (ICECA) (Vol. 1, pp. 17–21). https://doi.org/10.1109/ICECA.2017.8203674.
Naik, S., Lade, S., Mamidipelli, S., & Save, A. (2018). Tweet summarization: A new approach. In 2018 second international conference on inventive communication and computational technologies (ICICCT) (pp. 1022–1025). https://doi.org/10.1109/ICICCT.2018.8473327.
Nakov, P. I., Schwartz, A. S., & Hearst, M. A. (2004). Citances: Citation sentences for semantic analysis of bioscience text. In In Proceedings of the SIGIR’04 workshop on search and discovery in bioinformatics.
Nallapati, R., Zhai, F., & Zhou, B. (2017). Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Proceedings of the thirty-first AAAI conference on artificial intelligence, AAAI’17 (pp. 3075–3081). AAAI Press.
Nenkova, A., & McKeown, K. (2012). A survey of text summarization techniques. In C. C. Aggarwal & C. Zhai (Eds.), Mining text data (pp. 43–76). Berlin: Springer.
Chapter Google Scholar
Nikolov, N. I., & Pfeiffer, M., & Hahnloser, R. H. R. (2018). Data-driven summarization of scientific articles. In Proceedings of the 7th international workshop on mining scientific publications, LREC 2018.
Ovadia, S. (2014). ResearchGate and Academia.edu: Academic social networks. Behavioral & Social Sciences Librarian, 33(3), 165–169. https://doi.org/10.1080/01639269.2014.934093.
Article Google Scholar
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web, Technical report. Stanford InfoLab.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
MathSciNet MATH Google Scholar
Qazvinian, V., & Radev, D. R. (2008). Scientific paper summarization using citation summary networks. In Proceedings of the 22nd international conference on computational linguistics (Coling 2008) (pp. 689–696). Manchester, UK: Coling 2008 Organizing Committee. https://www.aclweb.org/anthology/C08-1087.
Qazvinian, V., & Radev, D. R. (2010). Identifying non-explicit citing sentences for citation-based summarization. In Proceedings of the 48th annual meeting of the association for computational linguistics, ACL ’10 (pp. 555–564). USA: Association for Computational Linguistics.
Ronzano, F., & Saggion, H. (2016). An empirical assessment of citation information in scientific summarization. In E. Métais, F. Meziane, M. Saraee, V. Sugumaran, & S. Vadera (Eds.), Natural language processing and information systems (pp. 318–325). Cham: Springer.
Chapter Google Scholar
Saggion, H., & Ronzano, F. (2017). Scholarly data mining: Making sense of scientific literature. In 2017 ACM/IEEE joint conference on digital libraries (JCDL) (pp. 1–2). https://doi.org/10.1109/JCDL.2017.7991622.
Schwartz, A. S., & Hearst, M. (2006). Summarizing key concepts using citation sentences. In Proceedings of the HLT-NAACL BioNLP workshop on linking natural language and biology, LNLBioNLP ’06 (pp. 134–135). USA: Association for Computational Linguistics.
Sollaci, L. B., & Pereira, M. G. (2004). The introduction, methods, results, and discussion (IMRAD) structure: A fifty-year survey. Journal of the Medical Library Association, 92(3), 364.
Google Scholar
Sun, X., & Zhuge, H. (2018). Summarization of scientific paper through reinforcement ranking on semantic link network. IEEE Access, 6, 40611–40625. https://doi.org/10.1109/ACCESS.2018.2856530.
Article Google Scholar
Tan, P. N., Steinbach, M., Karpatne, A., & Kumar, V. (2018). Introduction to data mining (2nd ed.). New York: Pearson.
Google Scholar
Wan, S., Dale, R., Dras, M., & Paris, C. (2008). Seed and grow: Augmenting statistically generated summary sentences using schematic word patterns. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 543–552).
Wan, S., Paris, C., & Dale, R. (2009). Whetting the appetite of scientists: Producing summaries tailored to the citation context. In Proceedings of the 9th ACM/IEEE-CS joint conference on digital libraries (pp. 59–68). ACM.
Wan, S., Paris, C., & Dale, R. (2010). Invited paper: Supporting browsing-specific information needs: Introducing the citation-sensitive in-browser summariser. Web Semantics, 8(2–3), 196–202. https://doi.org/10.1016/j.websem.2010.03.002.
Article Google Scholar
Wan, S., Paris, C., Muthukrishna, M., & Dale, R. (2009). Designing a citation-sensitive research tool: An initial study of browsing-specific information needs. In Proceedings of the 2009 workshop on text and citation analysis for scholarly digital libraries (NLPIR4DL) (pp. 45–53). Suntec City, Singapore: Association for Computational Linguistics. https://www.aclweb.org/anthology/W09-3606.
Wang, P., Li, S., Wang, T., Zhou, H., & Tang, J. (2018). “NUDT” @ CLSciSumm-18. In Proceedings of the 3rd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries “(BIRNDL” 2018) co-located with the 41st international “ACM” “SIGIR” conference on research and development in information retrieval “(SIGIR” 2018) (pp. 102–113). Ann Arbor, USA.
Yasunaga, M., Kasai, J., Zhang, R., Fabbri, A., Li, I., Friedman, D., & Radev, D. (2019). ScisummNet: A large annotated corpus and content-impact models for scientific paper summarization with citation networks. In Proceedings of AAAI 2019.
Yasunaga, M., Zhang, R., Meelu, K., Pareek, A., Srinivasan, K., & Radev, D. R. (2017). Graph-based neural multi-document summarization. In Proceedings of CoNLL 2017.

Download references

Acknowledgements

The research leading to these results has been partly funded by the Smart-Data@PoliTO center for Big Data and Machine Learning technologies. Computational resources were provided by HPC@POLITO, a project of Academic Computing within the Department of Control and Computer Engineering at the Politecnico di Torino (http://www.hpc.polito.it).

Author information

Authors and Affiliations

Politecnico di Torino, Corso Duca degli Abruzzi, 24, 10129, Turin, Italy
Moreno La Quatra, Luca Cagliero & Elena Baralis

Authors

Moreno La Quatra
View author publications
You can also search for this author in PubMed Google Scholar
Luca Cagliero
View author publications
You can also search for this author in PubMed Google Scholar
Elena Baralis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Moreno La Quatra.

Rights and permissions

Reprints and permissions

About this article

Cite this article

La Quatra, M., Cagliero, L. & Baralis, E. Exploiting pivot words to classify and summarize discourse facets of scientific papers. Scientometrics 125, 3139–3157 (2020). https://doi.org/10.1007/s11192-020-03532-3

Download citation

Received: 30 September 2019
Published: 13 June 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11192-020-03532-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting pivot words to classify and summarize discourse facets of scientific papers

Abstract

Access this article

Similar content being viewed by others

Scientific document summarization via citation contextualization and scientific discourse

Computational linguistics literature and citations oriented citation linkage, classification and summarization

Insights from CL-SciSumm 2016: the faceted scientific document summarization Shared Task

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploiting pivot words to classify and summarize discourse facets of scientific papers

Abstract

Access this article

Similar content being viewed by others

Scientific document summarization via citation contextualization and scientific discourse

Computational linguistics literature and citations oriented citation linkage, classification and summarization

Insights from CL-SciSumm 2016: the faceted scientific document summarization Shared Task

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation