Abstract
The problem of text summarization has consistently been a significant and prominent challenge for a particular language. Each language’s unique characteristics will reflect that country’s identity, culture, and nuances. This paper introduces extractive text summarization models for Vietnamese documents. Our approach concentrates on discovering appreciative and plausible models by combining ML algorithms. Namely, we investigate three potential models, including a “G-global-hard-cluster” (with GloVe), “probability-cluster” (with LDA, Latent Dirichlet Allocation), and a “soft-specific” combination between SGD (Stochastic gradient descent) and kmeans. Moreover, we also provide experimental results to evaluate the quality of the summary and the consumption time. In particular, our approaches obtain the expected results with \(51.49\%\) ROUGE-1, \(17.99\%\) ROUGE-2, and \(29.25\%\) ROUGE-L. Finally, we discuss the promising results of the proposed models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abuobieda, A., Salim, N., Albaham, A.T., Osman, A.H., Kumar, Y.J.: Text summarization features selection method using pseudo genetic-based model. In: 2012 International Conference on Information Retrieval & Knowledge Management, pp. 193–197. IEEE (2012)
Agrawal, A., Gupta, U.: Extraction based approach for text summarization using k-means clustering. Int. J. Sci. Res. Publ. 4(11), 1–4 (2014)
Allahyari, M., et al.: Text summarization techniques: a brief survey. arXiv preprint arXiv:1707.02268 (2017)
Aone, C., Okurowski, M.E., Gorlinsky, J.: Trainable, scalable summarization using robust NLP and machine learning. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 1, pp. 62–66 (1998)
Barrera, A., Verma, R.: Combining syntax and semantics for automatic extractive single-document summarization. In: Gelbukh, A. (ed.) CICLing 2012. LNCS, vol. 7182, pp. 366–377. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28601-8_31
Barzilay, R., McKeown, K., Elhadad, M.: Information fusion in the context of multi-document summarization. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 550–557 (1999)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems, vol. 20 (2007)
Conroy, J.M., O’leary, D.P.: Text summarization via hidden Markov models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 406–407 (2001)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39(1), 1–22 (1977)
Do, T.N., Tran-Nguyen, M.T.: ImageNet challenging classification with the Raspberry Pis: a federated learning algorithm of local stochastic gradient descent models. In: Dang, T.K., Küng, J., Chung, T.M. (eds.) FDSE 2022. CCIS, vol. 1688, pp. 131–144. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-8069-5_9
Fattah, M.A., Ren, F.: GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Comput. Speech Lang. 23(1), 126–144 (2009)
Gupta, P., Pendluri, V.S., Vats, I.: Summarizing text by ranking text units according to shallow linguistic features. In: 13th International Conference on Advanced Communication Technology (ICACT 2011), pp. 1620–1625. IEEE (2011)
Gupta, V., Lehal, G.S.: A survey of text summarization extractive techniques. J. Emerg. Technol. Web Intell. 2(3), 258–268 (2010)
Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
Hermann, K.M., et al.: Teaching machines to read and comprehend. In: Advances in Neural Information Processing Systems, pp. 1693–1701 (2015)
Hovy, E., Lin, C.Y.: Automated text summarization and the SUMMARIST system. In: Proceedings of a Workshop, TIPSTER 1998, pp. 197–214. Association for Computational Linguistics (1998)
Lin, C.Y.: Training a selection function for extraction. In: Proceedings of the Eighth International Conference on Information and Knowledge Management, pp. 55–62 (1999)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Liu, Y., Liu, P., Radev, D., Neubig, G.: Brio: bringing order to abstractive summarization. arXiv preprint arXiv:2203.16804 (2022)
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982). https://doi.org/10.1109/TIT.1982.1056489
Luhn, H.P.: The automatic creation of literature abstracts. IBM J. Res. Dev. 2(2), 159–165 (1958)
MacQueen, J.: Classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
McKeown, K., Klavans, J.L., Hatzivassiloglou, V., Barzilay, R., Eskin, E.: Towards multidocument summarization by reformulation: progress and prospects. In: Conference on Empirical Methods in Natural Language Processing (1999)
Nguyen, T.H., Do, T.N.: Extractive text summarization on large-scale dataset using k-means clustering. In: Fujita, H., Fournier-Viger, P., Ali, M., Wang, Y. (eds.) IEA/AIE 2022. LNCS, vol. 13343, pp. 737–746. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08530-7_62
Nguyen, T.H., Do, T.N.: Text summarization on large-scale Vietnamese datasets. Array (2022)
Nobata, C., Sekine, S., Murata, M., Uchimoto, K., Utiyama, M., Isahara, H.: Sentence extraction system assembling multiple evidence. In: NTCIR. Citeseer (2001)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1177–1178. Association for Computing Machinery, New York (2010)
See, A., Liu, P.J., Manning, C.D.: Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368 (2017)
Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: primal estimated sub-gradient solver for SVM. In: Proceedings of the 24th International Conference on Machine Learning, pp. 807–814 (2007)
Wang, X., et al.: Lightseq2: accelerated training for transformer-based models on GPUs. arXiv preprint arXiv:2110.05722 (2021)
Zhang, P., Li, C.: Automatic text summarization based on sentences clustering and extraction. In: 2009 2nd IEEE International Conference on Computer Science and Information Technology, pp. 167–170. IEEE (2009)
Zhang, X., et al.: Momentum calibration for text generation. arXiv preprint arXiv:2212.04257 (2022)
Zhao, Y., Khalman, M., Joshi, R., Narayan, S., Saleh, M., Liu, P.J.: Calibrating sequence likelihood improves conditional language generation. arXiv preprint arXiv:2210.00045 (2022)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Nguyen, TH., Ma, T., Do, TN. (2023). LAVETTES: Large-scAle-dataset Vietnamese ExTractive TExt Summarization Models. In: Dang, T.K., Küng, J., Chung, T.M. (eds) Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications. FDSE 2023. Communications in Computer and Information Science, vol 1925. Springer, Singapore. https://doi.org/10.1007/978-981-99-8296-7_19
Download citation
DOI: https://doi.org/10.1007/978-981-99-8296-7_19
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8295-0
Online ISBN: 978-981-99-8296-7
eBook Packages: Computer ScienceComputer Science (R0)