Abstract
In this paper, we present a system for summarization of scientific and structured documents that has three components: section mixture models are used for estimation of the weights of terms; a hypothesis test to select a subset of these terms; and a sentence extractor based on techniques for combinatorial optimization. The section mixture models approach is an adaptation of a bigram mixture model based on the main sections of a scientific document and a collection of citing sentences (citances) from papers that reference the document. The model was adapted from earlier work done on Biomedical documents used in the summarization task of the 2014 Text Analysis Conference (TAC 2014). The mixture model trained on the Biomedical data was used also on the data for the Computational Linguistics scientific summarization task of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (CL-SciSumm 2016). This model gives rise to machine-generated summaries with ROUGE scores that are nearly as strong as those seen on the Biomedical data and was also the highest scoring submission to the task of generating a human summary. For sentence extraction, we use the OCCAMS algorithm (Davis et al., in: Vreeken, Ling, Zaki, Siebes, Yu, Goethals, Webb, Wu (eds) ICDM workshops, IEEE Computer Society, pp 454–463, 2012) which takes the sentences from the original document and the assignment of weights of the terms computed by the language models and outputs a set of minimally overlapping sentences whose combined term coverage is maximized. Finally, we explore the importance of an appropriate background model for the hypothesis test to select terms to achieve the best quality summaries.
Similar content being viewed by others
Notes
An excellent overview of the paper by Ted Dunning can be accessed at http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html.
Figure derived from [2] with permission of author.
References
Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 500–509. Association for Computational Linguistics, Portland, Oregon, USA (2011). http://www.aclweb.org/anthology/P11-1051
Cabanac, G., Chandrasekaran, M.K., Frommholz, I., Jaidka, K., Kan, M.Y., Mayr, P., Wolfram, D.: Joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (birndl 2016). In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, JCDL ’16, pp. 299–300. ACM, New York, NY, USA (2016). doi:10.1145/2910896.2926734
Cohan, A., Goharian, N.: Scientific article summarization using citation-context and article’s discourse structure. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 390–400. Association for Computational Linguistics (2015). http://aclweb.org/anthology/D15-1045
Cohan, A., Soldaini, L., Goharian, N.: Matching citation text and cited spans in biomedical literature: a search-oriented approach. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1042–1048. Association for Computational Linguistics (2015). http://www.aclweb.org/anthology/N15-1110
Conroy, J.M., Davis, S., Kubina, J., Liu, Y.K., O’Leary, D.P., Schlesinger, J.D.: Multilingual summarization: dimensionality reduction and a step towards optimal term coverage. In: ACL, MultiLing Workshop, pp. 454–463 (2013)
Conroy, J.M., Davis, S.T.: Vector space and language models for scientific document summarization. In: Proceedings of NAACL-HLT, pp. 186–191 (2015)
Davis, S.T., Conroy, J.M., Schlesinger, J.D.: OCCAMS—an optimal combinatorial covering algorithm for multi-document summarization. In: Vreeken, J., Ling, C., Zaki, M.J., Siebes, A., Yu, J.X., Goethals, B., Webb, G.I., Wu, X., (eds.) ICDM Workshops, pp. 454–463. IEEE Computer Society (2012)
Divoli, A., Nakov, P., Hearst, M.A.: Do peers see more in a paper than its authors? Adv. Bioinform. pp. 750,214:1–750,214:15 (2012). http://dblp.uni-trier.de/db/journals/abi/abi2012.htmlDivoliNH12
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., Radev, D.: Blind men and elephants: What do citation summaries tell us about a research article? J. Am. Soc. Inf. Sci. Technol. 59(1), 51–62 (2008). doi:10.1002/asi.v59:1
Gillick, D., Favre, B.: A scalable global model for summarization. In: Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing, ILP ’09, pp. 10–18. Association for Computational Linguistics, Stroudsburg, PA, USA (2009). http://dl.acm.org/citation.cfm?id=1611638.1611640
Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., Peng, H.: CIST system for CL-SciSumm 2016 shared task. In: Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL) co-located with the Joint Conference on Digital Libraries 2016 (JCDL 2016), Newark, NJ, USA, June 23, 2016., pp. 156–167 (2016). http://ceur-ws.org/Vol-1610/paper18.pdf
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: S.S. Marie-Francine Moens (ed.) Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004)
Lin, C.Y., Cao, G., Gao, J., Nie, J.Y.: An information-theoretic approach to automatic evaluation of summaries. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06, pp. 463–470. Association for Computational Linguistics, Stroudsburg, PA, USA (2006). doi:10.3115/1220835.1220894
Lin, C.Y., Hovy, E.: The automated acquisition of topic signatures for text summarization. In: Proceedings of the 18th Conference on Computational Linguistics, pp. 495–501. Association for Computational Linguistics, Morristown, NJ, USA (2000)
McDonald, R.: A study of global inference algorithms in multi-document summarization. In: Proceedings of ECIR, pp. 557–564 (2007)
Nakov, P.I., Schwartz, A.S., Hearst, M.A.: Citances: citation sentences for semantic analysis of bioscience text. In: In Proceedings of the SIGIR’04 Workshop on Search and Discovery in Bioinformatics (2004)
National Institute of Health PubMed (2014). http://www.ncbi.nlm.nih.gov/pubmed
Nishikawa, H., Hasegawa, T., Matsuo, Y., Kikui, G.: Opinion summarization with integer linear programming formulation for sentence extraction and ordering. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 910–918. Association for Computational Linguistics (2010)
Parveen, D., Mesgar, M., Strube, M.: Generating coherent summaries of scientific articles using coherence patterns. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 772–783. Association for Computational Linguistics (2016). https://aclweb.org/anthology/D16-1074
Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation summary networks. In: Proceedings of the 22Nd International Conference on Computational Linguistics - Volume 1, COLING ’08, pp. 689–696. Association for Computational Linguistics, Stroudsburg, PA, USA (2008). http://dl.acm.org/citation.cfm?id=1599081.1599168
Rankel, P., Conroy, J., Slud, E., O’Leary, D.: Ranking human and machine summarization systems. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 467–473. Association for Computational Linguistics, Edinburgh, Scotland, UK. (2011). http://www.aclweb.org/anthology/D11-1043
Seber, G.: Multivariate observations. Wiley series in probability and statistics. Wiley-Interscience (2004). http://books.google.com/books?id=4PCa-OIL34QC
Teufel, S., Moens, M.: Summarizing scientific articles—experiments with relevance and rhetorical status. Comput. Linguist. 28, 2002 (2002)
Yates, F.: Contingency tables involving small numbers and the \(\chi ^2\) test. Supplement. J. R. Stat. Soc. 1, 217–235 (1934)
Acknowledgements
The authors would like to thank Jeff Kubina for graciously gathering the background corpus for the Biomedical literature summarization task. In addition, extreme thanks to the anonymous reviewers for their careful and critical reviews! Their suggestions of changes as well as the additional experiments using differing background corpus as well as comparison to additional baselines greatly improved this work.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Conroy, J.M., Davis, S.T. Section mixture models for scientific document summarization. Int J Digit Libr 19, 305–322 (2018). https://doi.org/10.1007/s00799-017-0218-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-017-0218-6