Skip to main content

Advertisement

Log in

Section mixture models for scientific document summarization

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

In this paper, we present a system for summarization of scientific and structured documents that has three components: section mixture models are used for estimation of the weights of terms; a hypothesis test to select a subset of these terms; and a sentence extractor based on techniques for combinatorial optimization. The section mixture models approach is an adaptation of a bigram mixture model based on the main sections of a scientific document and a collection of citing sentences (citances) from papers that reference the document. The model was adapted from earlier work done on Biomedical documents used in the summarization task of the 2014 Text Analysis Conference (TAC 2014). The mixture model trained on the Biomedical data was used also on the data for the Computational Linguistics scientific summarization task of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (CL-SciSumm 2016). This model gives rise to machine-generated summaries with ROUGE scores that are nearly as strong as those seen on the Biomedical data and was also the highest scoring submission to the task of generating a human summary. For sentence extraction, we use the OCCAMS algorithm (Davis et al., in: Vreeken, Ling, Zaki, Siebes, Yu, Goethals, Webb, Wu (eds) ICDM workshops, IEEE Computer Society, pp 454–463, 2012) which takes the sentences from the original document and the assignment of weights of the terms computed by the language models and outputs a set of minimally overlapping sentences whose combined term coverage is maximized. Finally, we explore the importance of an appropriate background model for the hypothesis test to select terms to achieve the best quality summaries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. See https://tac.nist.gov/2014/BiomedSumm/guidelines.htmlfordetailedinformation.

  2. https://catalog.ldc.upenn.edu.

  3. An excellent overview of the paper by Ted Dunning can be accessed at http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html.

  4. https://www.ldc.upenn.edu.

  5. Figure derived from [2] with permission of author.

  6. http://www.rxnlp.com/rouge-2-0.

  7. https://github.com/chbrown/acl-anthology-network.

References

  1. Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 500–509. Association for Computational Linguistics, Portland, Oregon, USA (2011). http://www.aclweb.org/anthology/P11-1051

  2. Cabanac, G., Chandrasekaran, M.K., Frommholz, I., Jaidka, K., Kan, M.Y., Mayr, P., Wolfram, D.: Joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (birndl 2016). In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, JCDL ’16, pp. 299–300. ACM, New York, NY, USA (2016). doi:10.1145/2910896.2926734

  3. Cohan, A., Goharian, N.: Scientific article summarization using citation-context and article’s discourse structure. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 390–400. Association for Computational Linguistics (2015). http://aclweb.org/anthology/D15-1045

  4. Cohan, A., Soldaini, L., Goharian, N.: Matching citation text and cited spans in biomedical literature: a search-oriented approach. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1042–1048. Association for Computational Linguistics (2015). http://www.aclweb.org/anthology/N15-1110

  5. Conroy, J.M., Davis, S., Kubina, J., Liu, Y.K., O’Leary, D.P., Schlesinger, J.D.: Multilingual summarization: dimensionality reduction and a step towards optimal term coverage. In: ACL, MultiLing Workshop, pp. 454–463 (2013)

  6. Conroy, J.M., Davis, S.T.: Vector space and language models for scientific document summarization. In: Proceedings of NAACL-HLT, pp. 186–191 (2015)

  7. Davis, S.T., Conroy, J.M., Schlesinger, J.D.: OCCAMS—an optimal combinatorial covering algorithm for multi-document summarization. In: Vreeken, J., Ling, C., Zaki, M.J., Siebes, A., Yu, J.X., Goethals, B., Webb, G.I., Wu, X., (eds.) ICDM Workshops, pp. 454–463. IEEE Computer Society (2012)

  8. Divoli, A., Nakov, P., Hearst, M.A.: Do peers see more in a paper than its authors? Adv. Bioinform. pp. 750,214:1–750,214:15 (2012). http://dblp.uni-trier.de/db/journals/abi/abi2012.htmlDivoliNH12

  9. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)

    Google Scholar 

  10. Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., Radev, D.: Blind men and elephants: What do citation summaries tell us about a research article? J. Am. Soc. Inf. Sci. Technol. 59(1), 51–62 (2008). doi:10.1002/asi.v59:1

    Article  Google Scholar 

  11. Gillick, D., Favre, B.: A scalable global model for summarization. In: Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing, ILP ’09, pp. 10–18. Association for Computational Linguistics, Stroudsburg, PA, USA (2009). http://dl.acm.org/citation.cfm?id=1611638.1611640

  12. Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., Peng, H.: CIST system for CL-SciSumm 2016 shared task. In: Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL) co-located with the Joint Conference on Digital Libraries 2016 (JCDL 2016), Newark, NJ, USA, June 23, 2016., pp. 156–167 (2016). http://ceur-ws.org/Vol-1610/paper18.pdf

  13. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: S.S. Marie-Francine Moens (ed.) Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004)

  14. Lin, C.Y., Cao, G., Gao, J., Nie, J.Y.: An information-theoretic approach to automatic evaluation of summaries. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06, pp. 463–470. Association for Computational Linguistics, Stroudsburg, PA, USA (2006). doi:10.3115/1220835.1220894

  15. Lin, C.Y., Hovy, E.: The automated acquisition of topic signatures for text summarization. In: Proceedings of the 18th Conference on Computational Linguistics, pp. 495–501. Association for Computational Linguistics, Morristown, NJ, USA (2000)

  16. McDonald, R.: A study of global inference algorithms in multi-document summarization. In: Proceedings of ECIR, pp. 557–564 (2007)

  17. Nakov, P.I., Schwartz, A.S., Hearst, M.A.: Citances: citation sentences for semantic analysis of bioscience text. In: In Proceedings of the SIGIR’04 Workshop on Search and Discovery in Bioinformatics (2004)

  18. National Institute of Health PubMed (2014). http://www.ncbi.nlm.nih.gov/pubmed

  19. Nishikawa, H., Hasegawa, T., Matsuo, Y., Kikui, G.: Opinion summarization with integer linear programming formulation for sentence extraction and ordering. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 910–918. Association for Computational Linguistics (2010)

  20. Parveen, D., Mesgar, M., Strube, M.: Generating coherent summaries of scientific articles using coherence patterns. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 772–783. Association for Computational Linguistics (2016). https://aclweb.org/anthology/D16-1074

  21. Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation summary networks. In: Proceedings of the 22Nd International Conference on Computational Linguistics - Volume 1, COLING ’08, pp. 689–696. Association for Computational Linguistics, Stroudsburg, PA, USA (2008). http://dl.acm.org/citation.cfm?id=1599081.1599168

  22. Rankel, P., Conroy, J., Slud, E., O’Leary, D.: Ranking human and machine summarization systems. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 467–473. Association for Computational Linguistics, Edinburgh, Scotland, UK. (2011). http://www.aclweb.org/anthology/D11-1043

  23. Seber, G.: Multivariate observations. Wiley series in probability and statistics. Wiley-Interscience (2004). http://books.google.com/books?id=4PCa-OIL34QC

  24. Teufel, S., Moens, M.: Summarizing scientific articles—experiments with relevance and rhetorical status. Comput. Linguist. 28, 2002 (2002)

    Article  Google Scholar 

  25. Yates, F.: Contingency tables involving small numbers and the \(\chi ^2\) test. Supplement. J. R. Stat. Soc. 1, 217–235 (1934)

    MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Jeff Kubina for graciously gathering the background corpus for the Biomedical literature summarization task. In addition, extreme thanks to the anonymous reviewers for their careful and critical reviews! Their suggestions of changes as well as the additional experiments using differing background corpus as well as comparison to additional baselines greatly improved this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John M. Conroy.

Additional information

Section 4 is based on [6] Copyright ©2015, Association for Computational Linguistics. All rights reserved. Reprinted by permission of ACL and the authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Conroy, J.M., Davis, S.T. Section mixture models for scientific document summarization. Int J Digit Libr 19, 305–322 (2018). https://doi.org/10.1007/s00799-017-0218-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-017-0218-6

Keywords

Navigation