Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian

Baykara, Batuhan; Güngör, Tunga

doi:10.1007/s10579-021-09568-y

Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian

Original Paper
Published: 10 January 2022

Volume 56, pages 973–1007, (2022)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

1354 Accesses
10 Citations
Explore all metrics

Abstract

Due to the exponential growth in the number of documents on the Web, accessing the salient information relevant to a user need is gaining importance, which increases the popularity of text summarization. Recent progress in deep learning shifted the research in text summarization from extractive methods towards more abstractive approaches. The research and the available resources remain mostly limited to the English language, which prevents progress in other languages. There is need in low-resourced languages for gathering large-scale resources suitable for such tasks. In this study, we release two large-scale datasets (TR-News and HU-News) that can serve as benchmarks in the abstractive summarization task for Turkish and Hungarian. The datasets are primarily compiled for text summarization, but are also suitable for other tasks such as topic classification, title generation, and key phrase extraction. Morphology is important for these agglutinative languages since meaning is carried mostly within the morphemes of the words. We utilize these morphological properties for tokenization to retain the semantic information and reduce the vocabulary sparsity introduced by the agglutinative nature of these languages. Using the datasets compiled, we propose linguistically-oriented tokenization methods (SeperateSuffix and CombinedSuffix) and evaluate them on the state-of-the-art abstractive summarization models. The SeperateSuffix method achieves the highest ROUGE-1 score on the TR-News dataset and provides promising results on the HU-News dataset. In another experiment, we show that the multilingual cased BERT model outperforms monolingual BERT models for both languages and reaches the highest ROUGE-1 score on the HU-News dataset. Lastly, we provide qualitative analysis of the generated summaries on the TR-News dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Overview of Indian Language Datasets Used for Text Summarization

An Approach Toward Abstractive Text Summarization for Urdu Language Using LLM (ATSUL)

T-LLaMA: a Tibetan large language model based on LLaMA2

Article Open access 19 December 2024

Notes

References

Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source NLP framework for Turkic languages. Structure, 10, 1–5.
Google Scholar
Anand, D., & Wagh, R. (2019). Effective deep learning approaches for summarization of legal texts. Journal of King Saud University-Computer and Information Sciences. https://doi.org/10.1016/j.jksuci.2019.11.015
Article Google Scholar
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In: Proceedings of the international conference on learning representations (ICLR), 2015.
Beke, A., & Szaszák, G. (2016). Automatic summarization of highly spontaneous speech. In: Speech and Computer—18th International Conference, SPECOM 2016, Budapest, Hungary, August 23-27, 2016, Proceedings, volume 9811 of Lecture Notes in Computer Science, pp. 140–147. Springer.
Bostrom, K., & Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4617–4624, Online, Association for Computational Linguistics.
Çığır, C., Kutlu, M., & Çiçekli, İ. (2009). Generic text summarization for Turkish. In: ISCIS, pp. 224–229. IEEE
Çelikyılmaz, A., Bosselut, A., He, X., & Choi, Y. (2018). Deep communicating agents for abstractive summarization. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), pp. 1662–1675, Association for Computational Linguistics.
Cheng, J., Dong, L., & Lapata, M. (2016). Long Short-Term Memory-Networks for machine reading. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 551–561. Association for Computational Linguistics.
Cho, K., van Merriënboer, B., Gülçehre, c., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio,Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Linguistics.
Chopra, S., Auli, M., & Rush, A. M. (2016). Abstractive sentence summarization with attentive recurrent neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–98, June 2016. Association for Computational Linguistics.
Creutz, M., & Lagus, K. (2005). Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. In: Helsinki University of Technology
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics.
Döbrössy, B., Makrai, M., Tarján, B., & Szaszák, G. (2019). Investigating sub-word embedding strategies for the morphologically rich and free phrase-order Hungarian. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 187–193
Duchi, J. C., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159.
Google Scholar
Edmundson, H. P. (1969). New methods in automatic extracting. Journal of ACM, 16(2), 264–285.
Article Google Scholar
Erguvanlı, E. E., & Taylan, E. E. (1984). The function of word order in Turkish grammar (Vol. 106, p. 1984). University of California Press.
Google Scholar
Eşref, Y., & Can, B. (2019). Using morpheme-level attention mechanism for Turkish sequence labelling. In: 2019 27th Signal Processing and Communications Applications Conference (SIU), pp. 1–4. IEEE.
Gehrmann, S., Deng, Y., & Rush, A. (2018). Bottom-up abstractive summarization. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4098–4109. Association for Computational Linguistics.
Gehrmann, S., Ziegler, Z., & Rush, A. (2019). Generating abstractive summaries with finetuned language models. In: Proceedings of the 12th International Conference on Natural Language Generation, pp. 516–522. Association for Computational Linguistics.
Güngör, O., Güngör, T., & Üsküdarlı, S. (2019). The effect of morphology in named entity recognition with sequence tagging. Natural Language Engineering, 25(1), 147–169.
Article Google Scholar
Güran, A., Bayazit, N. G., & Bekar, E. (2011). Automatic summarization of Turkish documents using non-negative matrix factorization. In: 2011 International Symposium on Innovations in Intelligent Systems and Applications, pp. 480–484. IEEE
Güran, A., Bayazit, N. G., & Gürbüz, M. Z. (2013). Efficient feature integration with Wikipedia-based semantic feature extraction for Turkish text summarization. Turkish Journal of Electrical Engineering & Computer Sciences, 21(5), 1411–1425.
Article Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Huck, M., Riess, S., & Fraser, A. (2017). Target-side word segmentation strategies for neural machine translation. In: Proceedings of the Second Conference on Machine Translation, pp. 56–67.
Indig, B., Sass, B., Simon, E., Mittelholcz, I., Kundráth, P., & Vadász, N. (2019a). emtsv—egy formátum mind felett. In: XV. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2019), pp. 235–247. Szegedi Tudományegyetem Informatikai Tanszékcsoport.
Indig, B., Sass, B., Simon, E., Mittelholcz, I., Vadász, N., & Makrai, M. (2019b). One format to rule them all—the emtsv pipeline for Hungarian. In: Proceedings of the 13th Linguistic Annotation Workshop, pp. 155–165. Association for Computational Linguistics.
Kettunen, K. (2014). Can type-token ratio be used to show morphological complexity of languages? Journal of Quantitative Linguistics, 21(3), 223–245. https://doi.org/10.1080/09296174.2014.911506
Article Google Scholar
Kiefer, F. (1997). On emphasis and word order in Hungarian (Vol. 76). Psychology Press.
Google Scholar
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings
Körtvélyessy, L. (2017). Essentials of language typology. UPJŠ.
Google Scholar
Kryściński, W., Paulus, R., Xiong, C., & Socher, R. (2018). Improving abstraction in text summarization. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1808–1817. Association for Computational Linguistics.
Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 66–75. Association for Computational Linguistics.
Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text summarization branches out (pp. 74–81). Association for Computational Linguistics.
Google Scholar
Liu, Y., & Lapata, M. (2019). Text summarization with pretrained encoders. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3730–3740, Nov. 2019. Association for Computational Linguistics.
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2), 159–165.
Article Google Scholar
Luong, T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421, Sept. 2015. Association for Computational Linguistics.
McKeown, K. R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J. L., Nenkova, A., Sable, C., Schiffman, B., & Sigelman, S. (2002). Tracking and summarizing news on a daily basis with Columbias Newsblaster. In: Proceedings of the Human Language Technology Conference, ppp. 280–285, 2002.
Nallapati, R., Zhou, B., dos Santos, C., Gülçehre, c., & Xiang, B. (2016). Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290. Association for Computational Linguistics.
Narayan, S., Cohen, S. B., & Lapata, M. (2018). Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1797–1807. Association for Computational Linguistics.
Nemeskey, D. M. (2017a). Egy emBERT próbáló feladat. In: XVI. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY2020), pp. 409–418, 2020.
Nemeskey, D. M. (2017b). emMorph a Hungarian language modeling baseline. In: XIII. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY2017), pp. 91–102.
Nemeskey, D. M. (2017c). Natural Language Processing Methods for Language Modeling. PhD thesis, Eötvös Loránd University, 2020.
Nenkova, A., & McKeown, K. R. (2012). A survey of text summarization techniques. In: Mining Text Data, pp. 43–76. Springer, 2012.
Oflazer, K. (2014). Turkish and its challenges for language processing. Language Resources and Evaluation, 48(4), 639–653.
Article Google Scholar
Özsoy, M. G., Çiçekli, İ, & Alpaslan, F. N. (2010). Text summarization of Turkish texts using latent semantic analysis. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pp. 869-876, 2010. Association for Computational Linguistics.
Pan, Y., Li, X., Yang, Y., & Dong, R. (2020). Morphological word segmentation on agglutinative languages for neural machine translation. CoRR, arxiv:2001.01589
Paulus, R., Xiong, C., & Socher, R. (2018). A deep reinforced model for abstractive summarization. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
Pembe, F. C., & Güngör, T. (2008). Towards a new summarization approach for search engine results: An application for Turkish. In: Proceedings of the 2008 23rd International Symposium on Computer and Information Sciences, pp, 1–6. IEEE, 2008.
Porter, M. F. (2006). An algorithm for suffix stripping. Program, 40(3), 211–218.
Article Google Scholar
Rush, A. M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence summarization. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 379–389, Sept. 2015. Association for Computational Linguistics.
Sandhaus, E. (2008). The New York Times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12), 2008.
Google Scholar
Schweter, S. (2020). BERTurk—BERT models for Turkish.https://doi.org/10.5281/zenodo.3770924
Scialom, T., Dray, P.-A., Lamprier, S., Piwowarski, B., & Staiano, J. (2020). MLSUM: The multilingual summarization corpus. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8051–8067, Online, Nov. 2020. Association for Computational Linguistics.
See, A., Liu, P. J., & Manning, C.D. (2017). Get to the point: Summarization with pointer-generator networks. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083. Association for Computational Linguistics.
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725, Aug. 2016. Association for Computational Linguistics.
Simon, E., Indig, B., Kalivoda, A., Mittelholcz, I., Sass, B., & Vadasz, N. (2020). Újabb fejlemények az e-magyar háza táján. In: XVI. Magyar Számítógépes Nyelvészeti Konferencia, pp. 29–42. Szegedi Tudományegyetem Informatikai Tanszékcsoport.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In: Proceedings of the Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc., 2014.
Tawfik, A., Emam, M., Essam, K., Nabil, R., & Hassan, H. (2019). Morphology-aware word-segmentation in dialectal Arabic adaptation of neural machine translation. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 11–17, Aug. 2019. Association for Computational Linguistics.
Tündik, M. Á., Kaszás, V., & Szaszák, G. (2019). Assessing the semantic space bias caused by ASR error propagation and its effect on spoken document summarization. In: Proceedings of the Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pp. 1333–1337. ISCA, 2019.
Turpin, A., Tsegay, Y., Hawking, D., & Williams, H. E. (2007). Fast generation of result snippets in web search. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 127–134
Üstün, A., Kurfalı, M., & Can, B. (2018). Characters or morphemes: How to represent words? In: Proceedings of The Third Workshop on Representation Learning for NLP, pp. 144–153. Association for Computational Linguistics.
Váradi, T., Simon, E., Sass, B., Gerőcs, M., Mittelholtz, I., Novák, A., Indig, B., Prószéky, G., & Vincze, V. (2017). Az e-magyar digitális nyelvfeldolgozó rendszer. In: XIII. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2017), pp. 49–60, 2017. Szegedi Tudományegyetem Informatikai Tanszékcsoport.
Váradi, T., Simon, E., Sass, B., Mittelholcz, I., Novák, A., Indig, B., Farkas, R., & Vincze, V. (2018). E-magyar – A Digital Language Processing System. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 7–12. European Language Resources Association (ELRA). ISBN 979-10-95546-00-9.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L.u., & Polosukhin, I. (2017). Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, ukasz, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., & Dean, J. (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, arxiv:1609.08144
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2048–2057, 07–09 Jul 2015. PMLR.

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Boğaziçi University, Bebek, 34342, Istanbul, Turkey
Batuhan Baykara & Tunga Güngör

Authors

Batuhan Baykara
View author publications
You can also search for this author inPubMed Google Scholar
Tunga Güngör
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Batuhan Baykara.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

See Tables 14 and 15.

Table 14 A detailed example for Hungarian morphological parsing and disambiguation

Full size table

Table 15 A detailed example for Turkish morphological parsing and disambiguation

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baykara, B., Güngör, T. Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian. Lang Resources & Evaluation 56, 973–1007 (2022). https://doi.org/10.1007/s10579-021-09568-y

Download citation

Accepted: 12 November 2021
Published: 10 January 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10579-021-09568-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Overview of Indian Language Datasets Used for Text Summarization

An Approach Toward Abstractive Text Summarization for Urdu Language Using LLM (ATSUL)

T-LLaMA: a Tibetan large language model based on LLaMA2

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now