ABSTRACT
Automatic text summarization is the process of automatically reducing the length of documents without losing the primary ideas. Due to the flood of digital text-based information, there is a great demand for summarization systems. In this paper, we investigate a number of word-embedding based approaches for sentence representation which are combined with the PageRank algorithm to select sentences for summary construction. We compare these new methods with a range of other current approaches to summarization. While the same summarization approaches can generally be applied across different languages, we target Vietnamese because of the relative lack of previous work in this space and also because it provides a good example of a language which generally requires word segmentation. Our experiments find that a word-embedding and graph based approach is an effective strategy for Vietnamese summarization and that word segmentation is not necessary for achieving good summarization results.
- S. Aji and R. Kaimal. Document summarization using positive pointwise mutual information. International Journal of Computer Science & Information Technology (IJCSIT), 4(2):47--55, 2012.Google Scholar
- R. Arora and R. Balaraman. Latent dirichlet allocation and singular value decomposition based multi-document summarization. In Data Mining, 2008. ICDM'08 Eighth IEE International Conference on, pages 713--718. IEEE, 2008. Google ScholarDigital Library
- M. Bansal, K. Gimpel, and K. Livescu. Tailoring continuous word representations for dependency parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2014.Google ScholarCross Ref
- Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137--1155, 2003. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- M. Campr and K. Ježek. Topic models for comparative summarization. Text, Speech, and Dialogue, 8082:568--574, 2013.Google Scholar
- Y. L. Chang and J. T. Chien. Latent dirichlet learning for document summarization. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 1689--1692. IEEE, 2009. Google ScholarDigital Library
- J. M. Conroy and D. P. O'leary. Text summarization via hidden markov models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 406--407. ACM, 2001. Google ScholarDigital Library
- G. Erkan and D. R. Radev. Lexrank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, pages 457--479, 2004. Google ScholarDigital Library
- J. Goldstein and J. Carbonell. Summarization:(1) using mmr for diversity-based reranking and (2) evaluating summaries. In Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998, pages 181--195. Association for Computational Linguistics, 1998. Google ScholarDigital Library
- S. Guo and S. Sanner. Probabilistic latent maximal marginal relevance. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 833--834. ACM, 2010. Google ScholarDigital Library
- T. A. N. Hoang, H. K. Nguyen, and Q. V. Tran. An efficient vietnamese text summarization approach based on graph model. In Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), pages 1--6. IEEE, 2010.Google Scholar
- M. Kågebäck, O. Mogren, N. Tahmasebi, and D. Dubhashi. Extractive summarization using continuous vector space models. In Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)@ EACL, pages 31--39, 2014.Google Scholar
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604--632, 1999. Google ScholarDigital Library
- J. Kubina and J. Conroy. Mss multiling 2015 task, 2015.Google Scholar
- Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053, 2014.Google Scholar
- C. Y. Lin and E. Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 71--78. Association for Computational Linguistics, 2003. Google ScholarDigital Library
- I. Mani. Automatic summarization, volume 3. John Benjamins Publishing, 2001.Google Scholar
- A. K. McCallum. Mallet: A machine learning for language toolkit, 2002.Google Scholar
- R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. Association for Computational Linguistics, 2004.Google Scholar
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In Proceedings of Workshop at International Conference on Learning Representations, 2013.Google Scholar
- C. T. Nguyen, X. H. Phan, and T. T. Nguyen. Jvntextpro: A java-based vietnamese text processing tool, 2010.Google Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: bringing order to the web. 1999.Google Scholar
- G. Salton, A. Singhal, M. Mitra, and C. Buckley. Automatic text structuring and summarization. Information Processing and Management, 33(2):193--207, 1997. Google ScholarDigital Library
- K. M. Svore, L. Vanderwende, and C. J. C. Burges. Enhancing single-document summarization by combining ranknet and third-party sources. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 448--457, Prague, Czech Republic, June 2007. Association for Computational Linguistics.Google Scholar
- D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin. Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1555--1565, 2014.Google ScholarCross Ref
- H. N. T. Thu. An optimization text summarization method based on naive bayes and topic word for single syllable language. Applied Mathematical Sciences, 8(3):99--115, 2014.Google ScholarCross Ref
- H. N. T. Thu, Q. N. Huu, and T. N. T. Ngoc. A supervised learning method combine with dimensionality reduction in vietnamese text summarization. In Computing, Communications and IT Applications Conference (ComComAp), pages 69--73. IEEE, 2013.Google Scholar
Index Terms
- A Study on the Use of Word Embeddings and PageRank for Vietnamese Text Summarization
Recommendations
Vietnamese Text Summarization Based on Elementary Discourse Units
NLPIR '22: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information RetrievalThis paper presents text summarization models based on elementary discourse units (EDUs) to construct extractive and abstractive summarization for Vietnamese documents. First, we introduce algorithms using the POS information for constructing EDUs in ...
Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification
ICIAI '20: Proceedings of the 2020 the 4th International Conference on Innovation in Artificial IntelligenceIn recent years, bilingual word embeddings have been used to promote sentiment classification task in low-resource languages. However, existing bilingual word embedding methods either require annotated cross-lingual data or fail to capture enough ...
Towards coherent single-document summarization: an integer linear programming-based approach
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied ComputingAutomatic Text Summarization (ATS) is a viable option to reduce the content of textual documents, e.g., as a possible preprocessing step in many text mining applications. Single-document extractive summarizers have been developed based on different ...
Comments