LAVETTES: Large-scAle-dataset Vietnamese ExTractive TExt Summarization Models

Nguyen, Ti-Hon; Ma, Thanh; Do, Thanh-Nghi

doi:10.1007/978-981-99-8296-7_19

Ti-Hon Nguyen⁸,
Thanh Ma⁸ &
Thanh-Nghi Do^8,9

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1925))

Included in the following conference series:

International Conference on Future Data and Security Engineering

402 Accesses

Abstract

The problem of text summarization has consistently been a significant and prominent challenge for a particular language. Each language’s unique characteristics will reflect that country’s identity, culture, and nuances. This paper introduces extractive text summarization models for Vietnamese documents. Our approach concentrates on discovering appreciative and plausible models by combining ML algorithms. Namely, we investigate three potential models, including a “G-global-hard-cluster” (with GloVe), “probability-cluster” (with LDA, Latent Dirichlet Allocation), and a “soft-specific” combination between SGD (Stochastic gradient descent) and kmeans. Moreover, we also provide experimental results to evaluate the quality of the summary and the consumption time. In particular, our approaches obtain the expected results with \(51.49\%\) ROUGE-1, \(17.99\%\) ROUGE-2, and \(29.25\%\) ROUGE-L. Finally, we discuss the promising results of the proposed models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abuobieda, A., Salim, N., Albaham, A.T., Osman, A.H., Kumar, Y.J.: Text summarization features selection method using pseudo genetic-based model. In: 2012 International Conference on Information Retrieval & Knowledge Management, pp. 193–197. IEEE (2012)
Google Scholar
Agrawal, A., Gupta, U.: Extraction based approach for text summarization using k-means clustering. Int. J. Sci. Res. Publ. 4(11), 1–4 (2014)
Google Scholar
Allahyari, M., et al.: Text summarization techniques: a brief survey. arXiv preprint arXiv:1707.02268 (2017)
Aone, C., Okurowski, M.E., Gorlinsky, J.: Trainable, scalable summarization using robust NLP and machine learning. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 1, pp. 62–66 (1998)
Google Scholar
Barrera, A., Verma, R.: Combining syntax and semantics for automatic extractive single-document summarization. In: Gelbukh, A. (ed.) CICLing 2012. LNCS, vol. 7182, pp. 366–377. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28601-8_31
Chapter Google Scholar
Barzilay, R., McKeown, K., Elhadad, M.: Information fusion in the context of multi-document summarization. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 550–557 (1999)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems, vol. 20 (2007)
Google Scholar
Conroy, J.M., O’leary, D.P.: Text summarization via hidden Markov models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 406–407 (2001)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39(1), 1–22 (1977)
MathSciNet MATH Google Scholar
Do, T.N., Tran-Nguyen, M.T.: ImageNet challenging classification with the Raspberry Pis: a federated learning algorithm of local stochastic gradient descent models. In: Dang, T.K., Küng, J., Chung, T.M. (eds.) FDSE 2022. CCIS, vol. 1688, pp. 131–144. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-8069-5_9
Chapter Google Scholar
Fattah, M.A., Ren, F.: GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Comput. Speech Lang. 23(1), 126–144 (2009)
Article Google Scholar
Gupta, P., Pendluri, V.S., Vats, I.: Summarizing text by ranking text units according to shallow linguistic features. In: 13th International Conference on Advanced Communication Technology (ICACT 2011), pp. 1620–1625. IEEE (2011)
Google Scholar
Gupta, V., Lehal, G.S.: A survey of text summarization extractive techniques. J. Emerg. Technol. Web Intell. 2(3), 258–268 (2010)
Google Scholar
Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
Article Google Scholar
Hermann, K.M., et al.: Teaching machines to read and comprehend. In: Advances in Neural Information Processing Systems, pp. 1693–1701 (2015)
Google Scholar
Hovy, E., Lin, C.Y.: Automated text summarization and the SUMMARIST system. In: Proceedings of a Workshop, TIPSTER 1998, pp. 197–214. Association for Computational Linguistics (1998)
Google Scholar
Lin, C.Y.: Training a selection function for extraction. In: Proceedings of the Eighth International Conference on Information and Knowledge Management, pp. 55–62 (1999)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Liu, Y., Liu, P., Radev, D., Neubig, G.: Brio: bringing order to abstractive summarization. arXiv preprint arXiv:2203.16804 (2022)
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982). https://doi.org/10.1109/TIT.1982.1056489
Article MathSciNet MATH Google Scholar
Luhn, H.P.: The automatic creation of literature abstracts. IBM J. Res. Dev. 2(2), 159–165 (1958)
Article MathSciNet Google Scholar
MacQueen, J.: Classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
McKeown, K., Klavans, J.L., Hatzivassiloglou, V., Barzilay, R., Eskin, E.: Towards multidocument summarization by reformulation: progress and prospects. In: Conference on Empirical Methods in Natural Language Processing (1999)
Google Scholar
Nguyen, T.H., Do, T.N.: Extractive text summarization on large-scale dataset using k-means clustering. In: Fujita, H., Fournier-Viger, P., Ali, M., Wang, Y. (eds.) IEA/AIE 2022. LNCS, vol. 13343, pp. 737–746. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08530-7_62
Chapter Google Scholar
Nguyen, T.H., Do, T.N.: Text summarization on large-scale Vietnamese datasets. Array (2022)
Google Scholar
Nobata, C., Sekine, S., Murata, M., Uchimoto, K., Utiyama, M., Isahara, H.: Sentence extraction system assembling multiple evidence. In: NTCIR. Citeseer (2001)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1177–1178. Association for Computing Machinery, New York (2010)
Google Scholar
See, A., Liu, P.J., Manning, C.D.: Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368 (2017)
Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: primal estimated sub-gradient solver for SVM. In: Proceedings of the 24th International Conference on Machine Learning, pp. 807–814 (2007)
Google Scholar
Wang, X., et al.: Lightseq2: accelerated training for transformer-based models on GPUs. arXiv preprint arXiv:2110.05722 (2021)
Zhang, P., Li, C.: Automatic text summarization based on sentences clustering and extraction. In: 2009 2nd IEEE International Conference on Computer Science and Information Technology, pp. 167–170. IEEE (2009)
Google Scholar
Zhang, X., et al.: Momentum calibration for text generation. arXiv preprint arXiv:2212.04257 (2022)
Zhao, Y., Khalman, M., Joshi, R., Narayan, S., Saleh, M., Liu, P.J.: Calibrating sequence likelihood improves conditional language generation. arXiv preprint arXiv:2210.00045 (2022)

Download references

Author information

Authors and Affiliations

Can Tho University, Can Tho, Vietnam
Ti-Hon Nguyen, Thanh Ma & Thanh-Nghi Do
UMI UMMISCO 209, IRD/UPMC, Bondy, France
Thanh-Nghi Do

Authors

Ti-Hon Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Thanh Ma
View author publications
You can also search for this author in PubMed Google Scholar
Thanh-Nghi Do
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ti-Hon Nguyen or Thanh Ma .

Editor information

Editors and Affiliations

Ho Chi Minh City University of Industry and Trade, Ho Chi Minh City, Vietnam
Tran Khanh Dang
Johannes Kepler University of Linz, Linz, Austria
Josef Küng
Sungkyunkwan University, Suwon-si, Korea (Republic of)
Tai M. Chung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, TH., Ma, T., Do, TN. (2023). LAVETTES: Large-scAle-dataset Vietnamese ExTractive TExt Summarization Models. In: Dang, T.K., Küng, J., Chung, T.M. (eds) Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications. FDSE 2023. Communications in Computer and Information Science, vol 1925. Springer, Singapore. https://doi.org/10.1007/978-981-99-8296-7_19

Download citation

DOI: https://doi.org/10.1007/978-981-99-8296-7_19
Published: 17 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8295-0
Online ISBN: 978-981-99-8296-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics