Section mixture models for scientific document summarization

Conroy, John M.; Davis, Sashka T.

doi:10.1007/s00799-017-0218-6

Section mixture models for scientific document summarization

Published: 17 May 2017

Volume 19, pages 305–322, (2018)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

John M. Conroy¹ &
Sashka T. Davis^1,2

506 Accesses
8 Citations
Explore all metrics

Abstract

In this paper, we present a system for summarization of scientific and structured documents that has three components: section mixture models are used for estimation of the weights of terms; a hypothesis test to select a subset of these terms; and a sentence extractor based on techniques for combinatorial optimization. The section mixture models approach is an adaptation of a bigram mixture model based on the main sections of a scientific document and a collection of citing sentences (citances) from papers that reference the document. The model was adapted from earlier work done on Biomedical documents used in the summarization task of the 2014 Text Analysis Conference (TAC 2014). The mixture model trained on the Biomedical data was used also on the data for the Computational Linguistics scientific summarization task of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (CL-SciSumm 2016). This model gives rise to machine-generated summaries with ROUGE scores that are nearly as strong as those seen on the Biomedical data and was also the highest scoring submission to the task of generating a human summary. For sentence extraction, we use the OCCAMS algorithm (Davis et al., in: Vreeken, Ling, Zaki, Siebes, Yu, Goethals, Webb, Wu (eds) ICDM workshops, IEEE Computer Society, pp 454–463, 2012) which takes the sentences from the original document and the assignment of weights of the terms computed by the language models and outputs a set of minimally overlapping sentences whose combined term coverage is maximized. Finally, we explore the importance of an appropriate background model for the hypothesis test to select terms to achieve the best quality summaries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

Automated identification of media bias in news articles: an interdisciplinary literature review

Article Open access 16 November 2018

Notes

See https://tac.nist.gov/2014/BiomedSumm/guidelines.htmlfordetailedinformation.
https://catalog.ldc.upenn.edu.
An excellent overview of the paper by Ted Dunning can be accessed at http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html.
https://www.ldc.upenn.edu.
Figure derived from [2] with permission of author.
http://www.rxnlp.com/rouge-2-0.
https://github.com/chbrown/acl-anthology-network.

References

Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 500–509. Association for Computational Linguistics, Portland, Oregon, USA (2011). http://www.aclweb.org/anthology/P11-1051
Cabanac, G., Chandrasekaran, M.K., Frommholz, I., Jaidka, K., Kan, M.Y., Mayr, P., Wolfram, D.: Joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (birndl 2016). In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, JCDL ’16, pp. 299–300. ACM, New York, NY, USA (2016). doi:10.1145/2910896.2926734
Cohan, A., Goharian, N.: Scientific article summarization using citation-context and article’s discourse structure. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 390–400. Association for Computational Linguistics (2015). http://aclweb.org/anthology/D15-1045
Cohan, A., Soldaini, L., Goharian, N.: Matching citation text and cited spans in biomedical literature: a search-oriented approach. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1042–1048. Association for Computational Linguistics (2015). http://www.aclweb.org/anthology/N15-1110
Conroy, J.M., Davis, S., Kubina, J., Liu, Y.K., O’Leary, D.P., Schlesinger, J.D.: Multilingual summarization: dimensionality reduction and a step towards optimal term coverage. In: ACL, MultiLing Workshop, pp. 454–463 (2013)
Conroy, J.M., Davis, S.T.: Vector space and language models for scientific document summarization. In: Proceedings of NAACL-HLT, pp. 186–191 (2015)
Davis, S.T., Conroy, J.M., Schlesinger, J.D.: OCCAMS—an optimal combinatorial covering algorithm for multi-document summarization. In: Vreeken, J., Ling, C., Zaki, M.J., Siebes, A., Yu, J.X., Goethals, B., Webb, G.I., Wu, X., (eds.) ICDM Workshops, pp. 454–463. IEEE Computer Society (2012)
Divoli, A., Nakov, P., Hearst, M.A.: Do peers see more in a paper than its authors? Adv. Bioinform. pp. 750,214:1–750,214:15 (2012). http://dblp.uni-trier.de/db/journals/abi/abi2012.htmlDivoliNH12
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
Google Scholar
Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., Radev, D.: Blind men and elephants: What do citation summaries tell us about a research article? J. Am. Soc. Inf. Sci. Technol. 59(1), 51–62 (2008). doi:10.1002/asi.v59:1
Article Google Scholar
Gillick, D., Favre, B.: A scalable global model for summarization. In: Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing, ILP ’09, pp. 10–18. Association for Computational Linguistics, Stroudsburg, PA, USA (2009). http://dl.acm.org/citation.cfm?id=1611638.1611640
Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., Peng, H.: CIST system for CL-SciSumm 2016 shared task. In: Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL) co-located with the Joint Conference on Digital Libraries 2016 (JCDL 2016), Newark, NJ, USA, June 23, 2016., pp. 156–167 (2016). http://ceur-ws.org/Vol-1610/paper18.pdf
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: S.S. Marie-Francine Moens (ed.) Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004)
Lin, C.Y., Cao, G., Gao, J., Nie, J.Y.: An information-theoretic approach to automatic evaluation of summaries. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06, pp. 463–470. Association for Computational Linguistics, Stroudsburg, PA, USA (2006). doi:10.3115/1220835.1220894
Lin, C.Y., Hovy, E.: The automated acquisition of topic signatures for text summarization. In: Proceedings of the 18th Conference on Computational Linguistics, pp. 495–501. Association for Computational Linguistics, Morristown, NJ, USA (2000)
McDonald, R.: A study of global inference algorithms in multi-document summarization. In: Proceedings of ECIR, pp. 557–564 (2007)
Nakov, P.I., Schwartz, A.S., Hearst, M.A.: Citances: citation sentences for semantic analysis of bioscience text. In: In Proceedings of the SIGIR’04 Workshop on Search and Discovery in Bioinformatics (2004)
National Institute of Health PubMed (2014). http://www.ncbi.nlm.nih.gov/pubmed
Nishikawa, H., Hasegawa, T., Matsuo, Y., Kikui, G.: Opinion summarization with integer linear programming formulation for sentence extraction and ordering. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 910–918. Association for Computational Linguistics (2010)
Parveen, D., Mesgar, M., Strube, M.: Generating coherent summaries of scientific articles using coherence patterns. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 772–783. Association for Computational Linguistics (2016). https://aclweb.org/anthology/D16-1074
Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation summary networks. In: Proceedings of the 22Nd International Conference on Computational Linguistics - Volume 1, COLING ’08, pp. 689–696. Association for Computational Linguistics, Stroudsburg, PA, USA (2008). http://dl.acm.org/citation.cfm?id=1599081.1599168
Rankel, P., Conroy, J., Slud, E., O’Leary, D.: Ranking human and machine summarization systems. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 467–473. Association for Computational Linguistics, Edinburgh, Scotland, UK. (2011). http://www.aclweb.org/anthology/D11-1043
Seber, G.: Multivariate observations. Wiley series in probability and statistics. Wiley-Interscience (2004). http://books.google.com/books?id=4PCa-OIL34QC
Teufel, S., Moens, M.: Summarizing scientific articles—experiments with relevance and rhetorical status. Comput. Linguist. 28, 2002 (2002)
Article Google Scholar
Yates, F.: Contingency tables involving small numbers and the \(\chi ^2\) test. Supplement. J. R. Stat. Soc. 1, 217–235 (1934)
MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank Jeff Kubina for graciously gathering the background corpus for the Biomedical literature summarization task. In addition, extreme thanks to the anonymous reviewers for their careful and critical reviews! Their suggestions of changes as well as the additional experiments using differing background corpus as well as comparison to additional baselines greatly improved this work.

Author information

Authors and Affiliations

IDA Center for Computing Sciences, 17100 Science Drive, Bowie, MD, 20715, USA
John M. Conroy & Sashka T. Davis
RSA Security, Reston, VA, USA
Sashka T. Davis

Authors

John M. Conroy
View author publications
You can also search for this author in PubMed Google Scholar
Sashka T. Davis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to John M. Conroy.

Additional information

Rights and permissions

Reprints and permissions

About this article

Cite this article

Conroy, J.M., Davis, S.T. Section mixture models for scientific document summarization. Int J Digit Libr 19, 305–322 (2018). https://doi.org/10.1007/s00799-017-0218-6

Download citation

Received: 17 October 2016
Revised: 23 April 2017
Accepted: 01 May 2017
Published: 17 May 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s00799-017-0218-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Section mixture models for scientific document summarization

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Automated identification of media bias in news articles: an interdisciplinary literature review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Section mixture models for scientific document summarization

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Automated identification of media bias in news articles: an interdisciplinary literature review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation