Skip to main content

LAVETTES: Large-scAle-dataset Vietnamese ExTractive TExt Summarization Models

  • Conference paper
  • First Online:
Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications (FDSE 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1925))

Included in the following conference series:

  • 402 Accesses

Abstract

The problem of text summarization has consistently been a significant and prominent challenge for a particular language. Each language’s unique characteristics will reflect that country’s identity, culture, and nuances. This paper introduces extractive text summarization models for Vietnamese documents. Our approach concentrates on discovering appreciative and plausible models by combining ML algorithms. Namely, we investigate three potential models, including a “G-global-hard-cluster” (with GloVe), “probability-cluster” (with LDA, Latent Dirichlet Allocation), and a “soft-specific” combination between SGD (Stochastic gradient descent) and kmeans. Moreover, we also provide experimental results to evaluate the quality of the summary and the consumption time. In particular, our approaches obtain the expected results with \(51.49\%\) ROUGE-1, \(17.99\%\) ROUGE-2, and \(29.25\%\) ROUGE-L. Finally, we discuss the promising results of the proposed models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abuobieda, A., Salim, N., Albaham, A.T., Osman, A.H., Kumar, Y.J.: Text summarization features selection method using pseudo genetic-based model. In: 2012 International Conference on Information Retrieval & Knowledge Management, pp. 193–197. IEEE (2012)

    Google Scholar 

  2. Agrawal, A., Gupta, U.: Extraction based approach for text summarization using k-means clustering. Int. J. Sci. Res. Publ. 4(11), 1–4 (2014)

    Google Scholar 

  3. Allahyari, M., et al.: Text summarization techniques: a brief survey. arXiv preprint arXiv:1707.02268 (2017)

  4. Aone, C., Okurowski, M.E., Gorlinsky, J.: Trainable, scalable summarization using robust NLP and machine learning. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 1, pp. 62–66 (1998)

    Google Scholar 

  5. Barrera, A., Verma, R.: Combining syntax and semantics for automatic extractive single-document summarization. In: Gelbukh, A. (ed.) CICLing 2012. LNCS, vol. 7182, pp. 366–377. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28601-8_31

    Chapter  Google Scholar 

  6. Barzilay, R., McKeown, K., Elhadad, M.: Information fusion in the context of multi-document summarization. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 550–557 (1999)

    Google Scholar 

  7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  8. Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems, vol. 20 (2007)

    Google Scholar 

  9. Conroy, J.M., O’leary, D.P.: Text summarization via hidden Markov models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 406–407 (2001)

    Google Scholar 

  10. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39(1), 1–22 (1977)

    MathSciNet  MATH  Google Scholar 

  11. Do, T.N., Tran-Nguyen, M.T.: ImageNet challenging classification with the Raspberry Pis: a federated learning algorithm of local stochastic gradient descent models. In: Dang, T.K., Küng, J., Chung, T.M. (eds.) FDSE 2022. CCIS, vol. 1688, pp. 131–144. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-8069-5_9

    Chapter  Google Scholar 

  12. Fattah, M.A., Ren, F.: GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Comput. Speech Lang. 23(1), 126–144 (2009)

    Article  Google Scholar 

  13. Gupta, P., Pendluri, V.S., Vats, I.: Summarizing text by ranking text units according to shallow linguistic features. In: 13th International Conference on Advanced Communication Technology (ICACT 2011), pp. 1620–1625. IEEE (2011)

    Google Scholar 

  14. Gupta, V., Lehal, G.S.: A survey of text summarization extractive techniques. J. Emerg. Technol. Web Intell. 2(3), 258–268 (2010)

    Google Scholar 

  15. Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)

    Article  Google Scholar 

  16. Hermann, K.M., et al.: Teaching machines to read and comprehend. In: Advances in Neural Information Processing Systems, pp. 1693–1701 (2015)

    Google Scholar 

  17. Hovy, E., Lin, C.Y.: Automated text summarization and the SUMMARIST system. In: Proceedings of a Workshop, TIPSTER 1998, pp. 197–214. Association for Computational Linguistics (1998)

    Google Scholar 

  18. Lin, C.Y.: Training a selection function for extraction. In: Proceedings of the Eighth International Conference on Information and Knowledge Management, pp. 55–62 (1999)

    Google Scholar 

  19. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

    Google Scholar 

  20. Liu, Y., Liu, P., Radev, D., Neubig, G.: Brio: bringing order to abstractive summarization. arXiv preprint arXiv:2203.16804 (2022)

  21. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982). https://doi.org/10.1109/TIT.1982.1056489

    Article  MathSciNet  MATH  Google Scholar 

  22. Luhn, H.P.: The automatic creation of literature abstracts. IBM J. Res. Dev. 2(2), 159–165 (1958)

    Article  MathSciNet  Google Scholar 

  23. MacQueen, J.: Classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)

    Google Scholar 

  24. McKeown, K., Klavans, J.L., Hatzivassiloglou, V., Barzilay, R., Eskin, E.: Towards multidocument summarization by reformulation: progress and prospects. In: Conference on Empirical Methods in Natural Language Processing (1999)

    Google Scholar 

  25. Nguyen, T.H., Do, T.N.: Extractive text summarization on large-scale dataset using k-means clustering. In: Fujita, H., Fournier-Viger, P., Ali, M., Wang, Y. (eds.) IEA/AIE 2022. LNCS, vol. 13343, pp. 737–746. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08530-7_62

    Chapter  Google Scholar 

  26. Nguyen, T.H., Do, T.N.: Text summarization on large-scale Vietnamese datasets. Array (2022)

    Google Scholar 

  27. Nobata, C., Sekine, S., Murata, M., Uchimoto, K., Utiyama, M., Isahara, H.: Sentence extraction system assembling multiple evidence. In: NTCIR. Citeseer (2001)

    Google Scholar 

  28. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  29. Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1177–1178. Association for Computing Machinery, New York (2010)

    Google Scholar 

  30. See, A., Liu, P.J., Manning, C.D.: Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368 (2017)

  31. Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: primal estimated sub-gradient solver for SVM. In: Proceedings of the 24th International Conference on Machine Learning, pp. 807–814 (2007)

    Google Scholar 

  32. Wang, X., et al.: Lightseq2: accelerated training for transformer-based models on GPUs. arXiv preprint arXiv:2110.05722 (2021)

  33. Zhang, P., Li, C.: Automatic text summarization based on sentences clustering and extraction. In: 2009 2nd IEEE International Conference on Computer Science and Information Technology, pp. 167–170. IEEE (2009)

    Google Scholar 

  34. Zhang, X., et al.: Momentum calibration for text generation. arXiv preprint arXiv:2212.04257 (2022)

  35. Zhao, Y., Khalman, M., Joshi, R., Narayan, S., Saleh, M., Liu, P.J.: Calibrating sequence likelihood improves conditional language generation. arXiv preprint arXiv:2210.00045 (2022)

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ti-Hon Nguyen or Thanh Ma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nguyen, TH., Ma, T., Do, TN. (2023). LAVETTES: Large-scAle-dataset Vietnamese ExTractive TExt Summarization Models. In: Dang, T.K., Küng, J., Chung, T.M. (eds) Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications. FDSE 2023. Communications in Computer and Information Science, vol 1925. Springer, Singapore. https://doi.org/10.1007/978-981-99-8296-7_19

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8296-7_19

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8295-0

  • Online ISBN: 978-981-99-8296-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics