Abstract
Automatic document summarization is a widely studied field that aims to generate brief and informative summaries of long documents. In this paper, we proposed a hybrid approach for automatic document summarization using Transformer and sentence grouping. The Transformer model was created by training with BBC News dataset. We first preprocessed this dataset by correcting logical and spelling errors in the original full-text and summary document pairs. The preprocessed dataset was used to train a Transformer model with hyper-parameters that were determined through experimentation. In the testing stage, the documents were decomposed into sentences, and the similarities of each sentence with other sentences were calculated using the Simhash text similarity algorithm. The most similar sentences were grouped together, with the number of groups set to be 25% of the total number of sentences in the document. These groups of sentences were then input into the Transformer model, which produced abstractive new sentences for each group. To determine the order of the groups, the average position of their sentences in the original document was calculated. Finally, the Transformer model generated abstractive sentences, which were combined into a summary. Experimental results showed that the proposed approach achieved an average of 93.2% Simhash text similarity to the original full-text documents and an average of 5% more similarity to the original summary documents. These results demonstrated the effectiveness of the proposed approach in the automatic document summarization.



Similar content being viewed by others
Data availability
Not Applicable.
Code availability
The code will be available through a GitHub repository.
References
Aghajanyan A, Yousefi-Azar A, Cho Y (2021) PreSumm: a transformer-based model for abstractive and extractive summarization. IEEE Access 9:6267–6279
Ahuir V, Gonzalez J-A, Hurtado L-F et al (2024) Abstractive summarizers become emotional on news summarization. Appl Sci 14(2):713
Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6077–6086. Salt Lake City, UT, USA,
Berker M, Gungor T (2012) “Using genetic algorithms with lexical chains for automatic text summarization.” 1, 595–600
Bhargava R, Sharma Y, Sharma G (2016) ATSSI: abstractive text summarization using sentiment infusion. Procedia Computer Sci 89(1):404–411
Bhatia SS, Kala AK, Bhattacharyya K (2014) “Sentiment analysis: a review of current research.” 2014 IEEE symposium on computational intelligence in big data (CIBD), 67–72. IEEE, Orlando, FL, USA,
Blekanov IS, Tarasov N, Bodrunova SS (2022) Transformer-based abstractive summarization for reddit and twitter: single posts vs. comment pools in three languages. Future Internet 14(3):69
Brown PF, Della Pietra VJ, deSouza PV et al (1992) Class-based n-gram models of natural language. Comput Linguistics 18(4):467–479
Brown TB, Mann B, Ryder N et al (2020) “language models are few-shot learners.” Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20. Curran Associates Inc., Red Hook, NY, USA,
Cai X, Liu S, Han J et al (2021) ChestXRayBERT: a pretrained language model for chest radiology report summarization. IEEE Transactions on Multimedia 1–1
Chaudhari NV, Thakur SS, Waghmare KS (2019) WordNet-based spell checker for Marathi language.” Proceedings of the International Conference on Communication, Computing and Networking (ICCCN), 1–6. Bangalore, India
Chen L, Zhang H, Xiao J et al (2020) Image captioning with graph attention networks for object relation reasoning. IEEE Trans Multimed 22(4):930–942
Chen T, Lu Y, Zhang S et al (2020) Momentum2: a new momentum optimizer for faster convergence.” Proceedings of the 28th ACM International Conference on Multimedia, 3033–3035
Chen T, Wang X, Yue T et al (2023) Enhancing abstractive summarization with extracted knowledge graphs and multi-source transformers. Appl Sci 13(13):22
Chen Y, Song Q (2021) News text summarization method based on bart-textrank model. 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), 2005–2010
Chiang D (2005) A hierarchical phrase-based model for statistical machine translation.” Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 263–270. Ann Arbor, Michigan,
Collins M, Koo T (2002) Discriminative reranking for machine translation.” Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 231–238. Philadelphia, Pennsylvania,
Dai Z, Yang Z, Yang Y et al (2019) Transformer-XL: attentive language models beyond a fixed-length context.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp 2978–2988. Association for computational linguistics
Devlin J, Chang M-W, Lee K et al (2019) “BERT: pre-training of deep bidirectional transformers for language understanding.” Proc North Am Chapter Assoc Comput Linguist 4171–4186. Minneapolis, MN, USA
Erkan G, Radev DR (2004) LexRank: graph-based lexical centrality as salience in text summarization. J Artif Int Res 22(1):457–479
Gupta V, Lehal G (2010) A survey of text summarization extractive techniques. J Emerg Technol Web Intell 2:08
Hildebrandt J, Berthold A, Habich D et al (2021) LCTL: lightweight compression template library.” 2021 IEEE International Conference on Big Data (Big Data), 2966–2975
Jiang D, Cao S, Yang S (2021) Abstractive summarization of long texts based on BERT and sequence-to-sequence model. 2021 2nd International Conference on Information Science and Education (ICISE-IE), 460–466. IEEE
Karousos N, Vorvilas G, Pantazi D et al (2024) A hybrid text summarization technique of student open-ended responses to online educational surveys. Electronics 13(18):3722
Keskar N, McCann B, Varshney L et al (2019) “Ctrl”: A conditional transformer language model for controllable generation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 6408–6419. Florence, Italy
Kimura T, Tagami R, Miyamori H (2019) Query-focused summarization enhanced with sentence attention mechanism. 2019 IEEE International Conference on Big Data and Smart Computing (BigComp), 1–8
Koehn P, Och FJ, Marcu D (2003) “Statistical phrase-based translation.” Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, 48–54. Edmonton, Canada
Kumar A, Jain A (2019) Accelerating convergence of stochastic optimization using a novel adaptive learning rate schedule. 2019 IEEE Symposium Series on Computational Intelligence (SSCI), 2370–2377
Law H, Deng J (2019) “CornerNet-Lite: efficient keystone detection for real-time object detection.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3559–3568. Long Beach, CA, USA
Le HT, Le TM (2013) An approach to abstractive text summarization.” 2013 International Conference on Soft Computing and Pattern Recognition (SoCPaR), 371–376. IEEE,
Li Y, Wehbe R, Ahmad F et al (2022) Clinical-longformer and clinical-bigbird: transformers for long clinical sequences.” arXiv preprint arXiv:2203.00738,
Liu B (2012) Sentiment analysis and opinion mining, volume 12 of Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers,
Liu J, Li Z, Li X et al (2021) BERT-based document-level relation extraction with multi-granularity information. IEEE Access 9:104134–104145
López Espejel J (2021) Automatic abstractive summarization of long medical texts with multi-encoders transformer and general-domain summary evaluation with wikiSERA”. arXiv preprint arXiv:2105.04538
Meena SM, Ramkumar MP, Asmitha R et al (2020) Text summarization using text frequency ranking sentence prediction. 2020 2nd International Conference on Computational Communication and Signal Processing (ICCCSP), 1–5
Mishra R, Dave A (2016) “WordNet based spell checker for Gujarati.” Proceedings of the International Conference on Computer Communication and Informatics (ICCCI), 1–5. Coimbatore, India
Moawad IF, Aref M (2012) “Semantic graph reduction approach for abstractive Text Sum-marization.” 2012 Seventh International Conference on Computer Engineering & Systems (ICCES), 132–138. IEEE
Moro G, Ragazzi L, Valgimigli L et al (2023) efficient memory-enhanced transformer for long-document summarization in low-resource regimes. Sensors 23(7):3542
Naik SS, Gaonkar MN (2017) Extractive text summarization by feature-based sentence extraction using rule-based concept. 2017 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), 1364–1368. IEEE
Nallapati R, Zhou B (2016) Abstractive text summarization using sequence-to-sequence RNN”s and beyond. Proceedings of the 20th Conference on Computational Natural Language Learning, 280–290. Association for Computational Linguistics, Berlin, Germany
Pariza A (2023) BBC News Summary. https://www.kaggle.com/pariza/bbc-news-summary, accessed April 11,
Pembe S (2007) A query-based and structural feature-based summary system for web searches.” Proceedings of the International Conference on Machine Learning and Cybernetics, 1780–1785. Hong Kong, China
Radford A, Wu J, Child R et al (2019) GPT-2: Language generation with generative pre-training. Technical report OpenAI, USA
Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners. Tech rep, USA
Raffel C, Shazeer N, Roberts A et al (2019) Exploring the limits of transfer learning with a unified text-to-text transformer.” arXiv preprint arXiv:1910.10683
Reda A, Abdelsadek M, Alshraideh H et al (2022) A hybrid arabic text summarization approach based on transformers.” 2022 2nd International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), 56–62
Rhazzafe S, Caraffini F, Colreavy-Donnelly S et al (2024) Hybrid summarization of medical records for predicting length of stay in the intensive care unit. Appl Sci 14(13):5809
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 4648–4658. Brussels, Belgium
Shirwandkar NS, Kulkarni S (2018) “Extractive text summarization using deep learning.” 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA) 1–5. IEEE
Singh BR, Yadav A (2018) WordNet-based spell checker for Hindi.” Proceedings of the International Conference on Computer Networks, Big Data and IoT (ICCBI), 1–6. Shillong, India
Suleiman D, Awajan AA (2019) Deep learning based extractive text summarization: approaches, datasets and evaluation measures.” 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), 204–210. IEEE
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems (editors Z. Ghahramani, M. Welling, C. Cortes, et al.), 27 1–10. Curran Associates, Inc.,
Tang A, Tam R, Cadrin-Chênevert A et al (2018) Canadian association of radiologists white paper on artificial intelligence in radiology. Can Assoc Radiolog J 69(2):120–135 (PMID: 29655580)
Tian Y, Krishnan D, Isola P (2020) Learning to detect and track objects in motion.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6974–6983. Seattle, WA, USA
Trust P, Minghim R (2024) A study on text classification in the age of large language models. Mach Learn Knowl Extract 6(4):2688–2721
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. arXiv preprint arXiv:1706.03762
Yan J, Chao W (2015) English language statistical machine translation oriented classification algorithm. 2015 International Conference on Intelligent Transportation, Big Data and Smart City, 376–379
Yang J, Zhang L, He Y et al (2021) GPT-3: language models are few-shot learners. IEEE Trans Neural Netw Learn Syst 32(4):1197–1207
Zafarani R, Liu H, Abbasi M (2021) Social media mining. Handbook of Natural Language Processing (editors N. Indurkhya and F. J. Damerau), 705–729. CRC Press, Boca Raton, FL, 2nd edition,
Zhang J, Zhao Y, Saleh M et al (2019) PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. arXiv preprint arXiv:1912.08777
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
Ahmet Toprak contributed to conceptualization, data collection, formal analysis, software development, validation, visualization, writing an original draft, writing-review editing. Metin Turan contributed to conceptualization, data collection, formal analysis, software development, validation, visualization, writing an original draft, writing-review editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval and consent to participate
Not Applicable.
Human and animal rights
Not applicable.
Consent for publication
Yes, we consent this paper to be published in the Journal of Intelligent Information System.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Toprak, A., Turan, M. Enhanced automatic abstractive document summarization using transformers and sentence grouping. J Supercomput 81, 557 (2025). https://doi.org/10.1007/s11227-025-07048-6
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-07048-6