Skip to main content
Log in

Multi-view multi-objective clustering-based framework for scientific document summarization using citation context

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Due to the expanding rate of scientific publications, it has become a necessity to summarize scientific documents to allow researchers to keep track of recent developments. In this paper, we formulate the scientific document summarization problem in a multi-view clustering (MVC) framework. Two views of the scientific documents, semantic and syntactic, are considered jointly in MVC framework. To obtain an improved partitioning corresponding to different views, a differential evolution algorithm is utilized as the underlying optimization strategy. After obtaining the final optimal partitioning, various sentence-level features like length, position of the sentences, among others, are investigated to extract high scoring sentences and these extracted sentences are used to form the summary. We have also investigated the use of (a) two embedding spaces to represent the sentences of the documents in semantic space; and (b) use of citation contexts to incorporate their importance in summarizing a given scientific document. Our proposed multi-view based clustering approach is purely unsupervised in nature and, does not utilize any labelled data for the generation of the summary. To show the potentiality of multi-view clustering, a single-view based clustering framework is also developed for the purpose of comparison. The performance of the multi-view based system in comparison to a single-view is first tested on some random articles selected from the recently released SciSummNet 2019 dataset using ROUGE measure. Then, two more datasets namely, CL-SciSumm 2016 and CL-SciSumm 2017 are considered for evaluation. Our investigations prove that the proposed approach yields better results than the existing supervised and unsupervised methods including transformer-based deep learning model, with statistically significant improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://www.nist.gov/tac/2014

  2. https://ornlcda.github.io/SDProc/

  3. http://www.ntu.edu.sg/home/epnsugan/index_files/cec-benchmarking.htm

  4. https://fasttext.cc/docs/en/english-vectors.html

  5. https://github.com/WING-NUS/scisumm-corpus/tree/master/data

  6. https://fasttext.cc/docs/en/english-vectors.html

References

  1. AbuRa’ed A, Chiruzzo L, Saggion H, Accuosto P, Bravo Serrano À (2017) Lastus/taln@ clscisumm-17: Cross-document sentence matching and scientific text summarization systems. In: BIRNDL@ SIGIR (2)

  2. Aggarwal CC, Reddy CK (2014) Data clustering algorithms and application. CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  3. Alambo A, Lohstroh C, Madaus E, Padhee S, Foster B, Banerjee T, Thirunarayan K, Raymer M (2020) Topic-centric unsupervised multi-document summarization of scientific and news articles. arXiv:201108072

  4. Beltagy I, Cohan A, Feigenblat G, Freitag D, Ghosal T, Hall K, Herrmannova D, Knoth P, Lo K, Mayr P et al (2021) Overview of the second workshop on scholarly document processing. Tech. rep. Oak Ridge National Lab.(ORNL). Oak Ridge, TN (United States)

  5. Bornmann L, Mutz R (2015) Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J Assoc Inform Sci Technol 66(11):2215–2222

    Article  Google Scholar 

  6. Burges CJ (2010) From ranknet to lambdarank to lambdamart: an overview. Learning 11 (23–581):81

    Google Scholar 

  7. Cagliero L, La Quatra M (2020) Extracting highlights of scientific articles: a supervised summarization approach. Expert Syst Applic 160:113,659

    Article  Google Scholar 

  8. Cao Z, Li W, Wu D (2016) Polyu at cl-scisumm 2016. In: Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL), pp 132–138

  9. Chandrasekaran MK, Feigenblat G, Freitag D, Ghosal T, Hovy E, Mayr P, Shmueli-Scheuer M, de Waard A (2020) Overview of the first workshop on scholarly document processing (sdp). In: Proceedings of the first workshop on scholarly document processing, pp 1–6

  10. Cohan A, Goharian N (2018) Scientific document summarization via citation contextualization and scientific discourse. Int J Digit Libr 19(2–3):287–303

    Article  Google Scholar 

  11. Cohan A, Soldaini L, Goharian N (2015) Matching citation text and cited spans in biomedical literature: a search-oriented approach. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1042–1048

  12. Cohan A, Dernoncourt F, Kim DS, Bui T, Kim S, Chang W, Goharian N (2018) A discourse-aware attention model for abstractive summarization of long documents. arXiv:180405685

  13. Collins E, Augenstein I, Riedel S (2017) A supervised approach to extractive summarisation of scientific papers. 1706.03946

  14. Conroy J, Davis S (2015) Vector space models for scientific document summarization. In: Proceedings of the 1st workshop on vector space modeling for natural language processing, pp 186–191

  15. Davis ST, Conroy JM, Schlesinger JD (2012) Occams–an optimal combinatorial covering algorithm for multi-document summarization. In: 2012 IEEE 12th international conference on data mining workshops. IEEE, pp 454–463

  16. Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Trans Evol Comput 6(2):182–197

    Article  Google Scholar 

  17. Elkiss A, Shen S, Fader A, Erkan G, States D, Radev D (2008) Blind men and elephants: what do citation summaries tell us about a research article? Journal of the American Society for Information Science and Technologconroy2015vectory 59(1):51–62

    Article  Google Scholar 

  18. Erkan G, Radev DR (2004) Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479

    Article  Google Scholar 

  19. Hernández-Alvarez M, Gomez JM (2016) Survey about citation context analysis: Tasks, techniques, and resources. Nat Lang Eng 22(3):327–349

    Article  Google Scholar 

  20. Hoang CDV, Kan MY (2010) Towards automated related work summarization. In: Coling 2010: Posters, pp 427–435

  21. Huang S, Kang Z, Xu Z (2020) Auto-weighted multi-view clustering via deep matrix decomposition. Pattern Recogn 97:107,015

    Article  Google Scholar 

  22. Ismayilov G, Topcuoglu HR (2020) Neural network based multi-objective evolutionary algorithm for dynamic workflow scheduling in cloud computing. Fut Gen Comput Syst 102:307–322

    Article  Google Scholar 

  23. Jaidka K, Chandrasekaran MK, Rustagi S, Kan MY (2018) Insights from cl-scisumm 2016: the faceted scientific document summarization shared task. Int J Digit Libr 19(2–3):163– 171

    Article  Google Scholar 

  24. Karimi S, Moraes L, Das A, Shakery A, Verma R (2018) Citance-based retrieval and summarization using ir and machine learning. Scientometrics 116(2):1331–1366

    Article  Google Scholar 

  25. Kuang Y, Sun J, Gan X, Gong D, Liu Z, Zha M (2021) Dynamic multi-objective cooperative coevolutionary scheduling for mobile underwater wireless sensor networks. Comput Indus Eng 156:107,229

    Article  Google Scholar 

  26. Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: International conference on machine learning, pp 957–966

  27. Lauscher A, Glavas G, Eckert K (2017) Citation-based summarization of scientific articles using semantic textual similarity. In: Proc. of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL2017), Tokyo

  28. Lauscher A, Glavaš G, Eckert K (2017) University of mannheim@ clscisumm-17: Citation-based summarization of scientific articles using semantic textual similarity. In: CEUR workshop proceedings, RWTH 2002, pp 33–42

  29. Lei Z, Gao S, Zhang Z, Zhou MC, Cheng J (2021) Mo4: a many-objective evolutionary algorithm for protein structure prediction. IEEE Transactions on Evolutionary Computation

  30. Li L, Mao L, Zhang Y, Chi J, Huang T, Cong X, Peng H (2016) Cist system for cl-scisumm 2016 shared task. In: Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL), pp 156–167

  31. Li L, Zhang Y, Mao L, Chi J, Chen M, Huang Z (2017) Cist@ clscisumm-17: Multiple features based citation linkage, classification and summarization. In: BIRNDL@ SIGIR (2)

  32. Li X, Zhang H, Wang R, Nie F (2020) Multi-view clustering: a scalable and parameter-free bipartite graph fusion method. IEEE Transactions on Pattern Analysis and Machine Intelligence

  33. Liang J, Qiao K, Yue C, Yu K, Qu B, Xu R, Li Z, Hu Y (2021) A clustering-based differential evolution algorithm for solving multimodal multi-objective optimization problems. Swarm Evol Comput 60:100,788

    Article  Google Scholar 

  34. Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81

  35. Liu Y, Lapata M (2019) Text summarization with pretrained encoders. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). https://doi.org/10.18653/v1/d19-1387

  36. Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 404–411

  37. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:13013781

  38. Miller D (2019) Leveraging bert for extractive text summarization on lectures, 1906.04165

  39. Mishra SK, Saini N, Saha S, Bhattacharyya P (2021) Scientific document summarization in multi-objective clustering framework. Appl Intell, 1–24

  40. Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recogn 37(3):487– 501

    Article  MATH  Google Scholar 

  41. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318

  42. Qazvinian V, Radev DR (2008) Scientific paper summarization using citation summary networks. In: Proceedings of the 22nd international conference on computational linguistics, vol 1. Association for Computational Linguistics, pp 689–696

  43. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9

    Google Scholar 

  44. Randhawa S, Jain S (2019) Mlbc: multi-objective load balancing clustering technique in wireless sensor networks. Appl Soft Comput 74:66–89

    Article  Google Scholar 

  45. Saggion H, Poibeau T (2013) Automatic text summarization: Past, present and future. In: Multi-source, multilingual information extraction and summarization. Springer, pp 3–21

  46. Saggion H, AbuRa’ed AGT, Ronzano F (2016) Trainable citation-enhanced summarization of scientific articles. In: Cabanac G, Chandrasekaran MK, Frommholz I, Jaidka K, Kan M, Mayr P, Wolfram D (eds) Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL); 2016 June 23; Newark, United States.[place unknown]: CEUR Workshop Proceedings; 2016. pp 175–86. CEUR Workshop Proceedings

  47. Saha S, Mitra S, Kramer S (2018) Exploring multiobjective optimization for multiview clustering. ACM Trans Knowl Discov Data (TKDD) 12(4):44

    Google Scholar 

  48. Saini N, Saha S (2021) Multi-objective optimization techniques: a survey of the state-of-the-art and applications. Europ Phys J Special Topics 230(10):2319–2335

    Article  Google Scholar 

  49. Saini N, Saha S, Bhattacharyya P (2019) Automatic scientific document clustering using self-organized multi-objective differential evolution. Cogn Comput 11(2):271–293

    Article  Google Scholar 

  50. Saini N, Saha S, Jangra A, Bhattacharyya P (2019) Extractive single document summarization using multi-objective optimization: exploring self-organized differential evolution, grey wolf optimizer and water cycle algorithm. Knowl-Based Syst 164:45–67

    Article  Google Scholar 

  51. Saini N, Saha S, Tuteja H, Bhattacharyya P (2019) Textual entailment based figure summarization for biomedical articles. ACM Transactions on Multimedia Computing Communications and Applications

  52. Saini N, Saha S, Bhattacharyya P, Tuteja H (2020) Textual entailment–based figure summarization for biomedical articles. ACM Trans Multimed Comput Commun Applic (TOMM) 16(1s):1–24

    Article  Google Scholar 

  53. Saini N, Bansal D, Saha S, Bhattacharyya P (2021) Multi-objective multi-view based search result clustering using differential evolution framework. Exp Syst Applic 168:114,299

    Article  Google Scholar 

  54. Saini N, Kumar S, Saha S, Bhattacharyya P (2021) Scientific document summarization using citation context and multi-objective optimization. In: 2020 25th International conference on pattern recognition (ICPR). IEEE, pp 4290–4295

  55. Sharma KK, Seal A (2021) Outlier-robust multi-view clustering for uncertain data. Knowl-Based Syst 211:106,567

    Article  Google Scholar 

  56. Song S, Gao S, Chen X, Jia D, Qian X, Todo Y (2018) Aimoes: archive information assisted multi-objective evolutionary strategy for ab initio protein structure prediction. Knowl-Based Syst 146:58–72

    Article  Google Scholar 

  57. Sun C, Qiu X, Xu Y, Huang X (2019) How to fine-tune bert for text classification?. In: China National conference on chinese computational linguistics. Springer, pp 194–206

  58. Teufel S, Moens M (2002) Summarizing scientific articles: experiments with relevance and rhetorical status. Comput Ling 28(4):409–445

    Article  Google Scholar 

  59. Vanderwende L, Suzuki H, Brockett C, Nenkova A (2007) Beyond sumbasic: task-focused summarization with sentence simplification and lexical expansion. Inform Process Manag 43(6):1606–1618

    Article  Google Scholar 

  60. Wan X, Yang J, Xiao J (2007) Manifold-ranking based topic-focused multi-document summarization. IJCAI 7:2903–2908

    Google Scholar 

  61. Wang D, Tan D, Liu L (2018) Particle swarm optimization algorithm: an overview. Soft Comput 22(2):387–408

    Article  Google Scholar 

  62. Wang L, Fu X, Menhas MI, Fei M (2010) A modified binary differential evolution algorithm. In: Life system modeling and intelligent computing. Springer, pp 49–57

  63. Wang R, Lai S, Wu G, Xing L, Wang L, Ishibuchi H (2018) Multi-clustering via evolutionary multi-objective optimization. Inform Sci 450:128–140

    Article  MathSciNet  MATH  Google Scholar 

  64. Wang S, Liu X, Zhu E, Tang C, Liu J, Hu J, Xia J, Yin J (2019) Multi-view clustering via late fusion alignment maximization. In: IJCAI, pp 3778–3784

  65. Welch BL (1947) The generalization of ‘STUDENT’S’ problem when several different population variances are involved. Biometrika 34(1–2):28–35

    MathSciNet  MATH  Google Scholar 

  66. Xian Y, Lampert CH, Schiele B, Akata Z (2018) Zero-shot learning a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell 41(9):2251–2265

    Article  Google Scholar 

  67. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV (2019) Xlnet: Generalized autoregressive pretraining for language understanding. 1906.08237

  68. Yasunaga M, Kasai J, Zhang R, Fabbri AR, Li I, Friedman D, Radev DR (2019) Scisummnet: a large annotated corpus and content-impact models for scientific paper summarization with citation networks. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 7386–7393

  69. Zhang J, Zhao Y, Saleh M, Liu P (2020) Pegasus: pre-training with extracted gap-sentences for abstractive summarization. In: International conference on machine learning PMLR, pp 11,328–11,339

  70. Zhang Y, Er MJ, Zhao R, Pratama M (2016) Multiview convolutional neural networks for multidocument extractive summarization. IEEE Trans Cybern 47(10):3230–3242

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Naveen Saini or Saichethan Miriyala Reddy.

Ethics declarations

Ethics approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Conflict of Interests

Authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saini, N., Reddy, S.M., Saha, S. et al. Multi-view multi-objective clustering-based framework for scientific document summarization using citation context. Appl Intell 53, 18002–18026 (2023). https://doi.org/10.1007/s10489-022-04166-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04166-z

Keywords

Navigation