Abstract
Due to the exponential growth in the number of documents on the Web, accessing the salient information relevant to a user need is gaining importance, which increases the popularity of text summarization. Recent progress in deep learning shifted the research in text summarization from extractive methods towards more abstractive approaches. The research and the available resources remain mostly limited to the English language, which prevents progress in other languages. There is need in low-resourced languages for gathering large-scale resources suitable for such tasks. In this study, we release two large-scale datasets (TR-News and HU-News) that can serve as benchmarks in the abstractive summarization task for Turkish and Hungarian. The datasets are primarily compiled for text summarization, but are also suitable for other tasks such as topic classification, title generation, and key phrase extraction. Morphology is important for these agglutinative languages since meaning is carried mostly within the morphemes of the words. We utilize these morphological properties for tokenization to retain the semantic information and reduce the vocabulary sparsity introduced by the agglutinative nature of these languages. Using the datasets compiled, we propose linguistically-oriented tokenization methods (SeperateSuffix and CombinedSuffix) and evaluate them on the state-of-the-art abstractive summarization models. The SeperateSuffix method achieves the highest ROUGE-1 score on the TR-News dataset and provides promising results on the HU-News dataset. In another experiment, we show that the multilingual cased BERT model outperforms monolingual BERT models for both languages and reaches the highest ROUGE-1 score on the HU-News dataset. Lastly, we provide qualitative analysis of the generated summaries on the TR-News dataset.




Similar content being viewed by others
Notes
References
Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source NLP framework for Turkic languages. Structure, 10, 1–5.
Anand, D., & Wagh, R. (2019). Effective deep learning approaches for summarization of legal texts. Journal of King Saud University-Computer and Information Sciences. https://doi.org/10.1016/j.jksuci.2019.11.015
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In: Proceedings of the international conference on learning representations (ICLR), 2015.
Beke, A., & Szaszák, G. (2016). Automatic summarization of highly spontaneous speech. In: Speech and Computer—18th International Conference, SPECOM 2016, Budapest, Hungary, August 23-27, 2016, Proceedings, volume 9811 of Lecture Notes in Computer Science, pp. 140–147. Springer.
Bostrom, K., & Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4617–4624, Online, Association for Computational Linguistics.
Çığır, C., Kutlu, M., & Çiçekli, İ. (2009). Generic text summarization for Turkish. In: ISCIS, pp. 224–229. IEEE
Çelikyılmaz, A., Bosselut, A., He, X., & Choi, Y. (2018). Deep communicating agents for abstractive summarization. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), pp. 1662–1675, Association for Computational Linguistics.
Cheng, J., Dong, L., & Lapata, M. (2016). Long Short-Term Memory-Networks for machine reading. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 551–561. Association for Computational Linguistics.
Cho, K., van Merriënboer, B., Gülçehre, c., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio,Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Linguistics.
Chopra, S., Auli, M., & Rush, A. M. (2016). Abstractive sentence summarization with attentive recurrent neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–98, June 2016. Association for Computational Linguistics.
Creutz, M., & Lagus, K. (2005). Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. In: Helsinki University of Technology
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics.
Döbrössy, B., Makrai, M., Tarján, B., & Szaszák, G. (2019). Investigating sub-word embedding strategies for the morphologically rich and free phrase-order Hungarian. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 187–193
Duchi, J. C., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159.
Edmundson, H. P. (1969). New methods in automatic extracting. Journal of ACM, 16(2), 264–285.
Erguvanlı, E. E., & Taylan, E. E. (1984). The function of word order in Turkish grammar (Vol. 106, p. 1984). University of California Press.
Eşref, Y., & Can, B. (2019). Using morpheme-level attention mechanism for Turkish sequence labelling. In: 2019 27th Signal Processing and Communications Applications Conference (SIU), pp. 1–4. IEEE.
Gehrmann, S., Deng, Y., & Rush, A. (2018). Bottom-up abstractive summarization. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4098–4109. Association for Computational Linguistics.
Gehrmann, S., Ziegler, Z., & Rush, A. (2019). Generating abstractive summaries with finetuned language models. In: Proceedings of the 12th International Conference on Natural Language Generation, pp. 516–522. Association for Computational Linguistics.
Güngör, O., Güngör, T., & Üsküdarlı, S. (2019). The effect of morphology in named entity recognition with sequence tagging. Natural Language Engineering, 25(1), 147–169.
Güran, A., Bayazit, N. G., & Bekar, E. (2011). Automatic summarization of Turkish documents using non-negative matrix factorization. In: 2011 International Symposium on Innovations in Intelligent Systems and Applications, pp. 480–484. IEEE
Güran, A., Bayazit, N. G., & Gürbüz, M. Z. (2013). Efficient feature integration with Wikipedia-based semantic feature extraction for Turkish text summarization. Turkish Journal of Electrical Engineering & Computer Sciences, 21(5), 1411–1425.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Huck, M., Riess, S., & Fraser, A. (2017). Target-side word segmentation strategies for neural machine translation. In: Proceedings of the Second Conference on Machine Translation, pp. 56–67.
Indig, B., Sass, B., Simon, E., Mittelholcz, I., Kundráth, P., & Vadász, N. (2019a). emtsv—egy formátum mind felett. In: XV. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2019), pp. 235–247. Szegedi Tudományegyetem Informatikai Tanszékcsoport.
Indig, B., Sass, B., Simon, E., Mittelholcz, I., Vadász, N., & Makrai, M. (2019b). One format to rule them all—the emtsv pipeline for Hungarian. In: Proceedings of the 13th Linguistic Annotation Workshop, pp. 155–165. Association for Computational Linguistics.
Kettunen, K. (2014). Can type-token ratio be used to show morphological complexity of languages? Journal of Quantitative Linguistics, 21(3), 223–245. https://doi.org/10.1080/09296174.2014.911506
Kiefer, F. (1997). On emphasis and word order in Hungarian (Vol. 76). Psychology Press.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings
Körtvélyessy, L. (2017). Essentials of language typology. UPJŠ.
Kryściński, W., Paulus, R., Xiong, C., & Socher, R. (2018). Improving abstraction in text summarization. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1808–1817. Association for Computational Linguistics.
Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 66–75. Association for Computational Linguistics.
Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text summarization branches out (pp. 74–81). Association for Computational Linguistics.
Liu, Y., & Lapata, M. (2019). Text summarization with pretrained encoders. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3730–3740, Nov. 2019. Association for Computational Linguistics.
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2), 159–165.
Luong, T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421, Sept. 2015. Association for Computational Linguistics.
McKeown, K. R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J. L., Nenkova, A., Sable, C., Schiffman, B., & Sigelman, S. (2002). Tracking and summarizing news on a daily basis with Columbias Newsblaster. In: Proceedings of the Human Language Technology Conference, ppp. 280–285, 2002.
Nallapati, R., Zhou, B., dos Santos, C., Gülçehre, c., & Xiang, B. (2016). Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290. Association for Computational Linguistics.
Narayan, S., Cohen, S. B., & Lapata, M. (2018). Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1797–1807. Association for Computational Linguistics.
Nemeskey, D. M. (2017a). Egy emBERT próbáló feladat. In: XVI. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY2020), pp. 409–418, 2020.
Nemeskey, D. M. (2017b). emMorph a Hungarian language modeling baseline. In: XIII. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY2017), pp. 91–102.
Nemeskey, D. M. (2017c). Natural Language Processing Methods for Language Modeling. PhD thesis, Eötvös Loránd University, 2020.
Nenkova, A., & McKeown, K. R. (2012). A survey of text summarization techniques. In: Mining Text Data, pp. 43–76. Springer, 2012.
Oflazer, K. (2014). Turkish and its challenges for language processing. Language Resources and Evaluation, 48(4), 639–653.
Özsoy, M. G., Çiçekli, İ, & Alpaslan, F. N. (2010). Text summarization of Turkish texts using latent semantic analysis. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pp. 869-876, 2010. Association for Computational Linguistics.
Pan, Y., Li, X., Yang, Y., & Dong, R. (2020). Morphological word segmentation on agglutinative languages for neural machine translation. CoRR, arxiv:2001.01589
Paulus, R., Xiong, C., & Socher, R. (2018). A deep reinforced model for abstractive summarization. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
Pembe, F. C., & Güngör, T. (2008). Towards a new summarization approach for search engine results: An application for Turkish. In: Proceedings of the 2008 23rd International Symposium on Computer and Information Sciences, pp, 1–6. IEEE, 2008.
Porter, M. F. (2006). An algorithm for suffix stripping. Program, 40(3), 211–218.
Rush, A. M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence summarization. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 379–389, Sept. 2015. Association for Computational Linguistics.
Sandhaus, E. (2008). The New York Times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12), 2008.
Schweter, S. (2020). BERTurk—BERT models for Turkish.https://doi.org/10.5281/zenodo.3770924
Scialom, T., Dray, P.-A., Lamprier, S., Piwowarski, B., & Staiano, J. (2020). MLSUM: The multilingual summarization corpus. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8051–8067, Online, Nov. 2020. Association for Computational Linguistics.
See, A., Liu, P. J., & Manning, C.D. (2017). Get to the point: Summarization with pointer-generator networks. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083. Association for Computational Linguistics.
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725, Aug. 2016. Association for Computational Linguistics.
Simon, E., Indig, B., Kalivoda, A., Mittelholcz, I., Sass, B., & Vadasz, N. (2020). Újabb fejlemények az e-magyar háza táján. In: XVI. Magyar Számítógépes Nyelvészeti Konferencia, pp. 29–42. Szegedi Tudományegyetem Informatikai Tanszékcsoport.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In: Proceedings of the Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc., 2014.
Tawfik, A., Emam, M., Essam, K., Nabil, R., & Hassan, H. (2019). Morphology-aware word-segmentation in dialectal Arabic adaptation of neural machine translation. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 11–17, Aug. 2019. Association for Computational Linguistics.
Tündik, M. Á., Kaszás, V., & Szaszák, G. (2019). Assessing the semantic space bias caused by ASR error propagation and its effect on spoken document summarization. In: Proceedings of the Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pp. 1333–1337. ISCA, 2019.
Turpin, A., Tsegay, Y., Hawking, D., & Williams, H. E. (2007). Fast generation of result snippets in web search. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 127–134
Üstün, A., Kurfalı, M., & Can, B. (2018). Characters or morphemes: How to represent words? In: Proceedings of The Third Workshop on Representation Learning for NLP, pp. 144–153. Association for Computational Linguistics.
Váradi, T., Simon, E., Sass, B., Gerőcs, M., Mittelholtz, I., Novák, A., Indig, B., Prószéky, G., & Vincze, V. (2017). Az e-magyar digitális nyelvfeldolgozó rendszer. In: XIII. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2017), pp. 49–60, 2017. Szegedi Tudományegyetem Informatikai Tanszékcsoport.
Váradi, T., Simon, E., Sass, B., Mittelholcz, I., Novák, A., Indig, B., Farkas, R., & Vincze, V. (2018). E-magyar – A Digital Language Processing System. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 7–12. European Language Resources Association (ELRA). ISBN 979-10-95546-00-9.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L.u., & Polosukhin, I. (2017). Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, ukasz, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., & Dean, J. (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, arxiv:1609.08144
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2048–2057, 07–09 Jul 2015. PMLR.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Baykara, B., Güngör, T. Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian. Lang Resources & Evaluation 56, 973–1007 (2022). https://doi.org/10.1007/s10579-021-09568-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-021-09568-y