research-article

A Study on the Use of Word Embeddings and PageRank for Vietnamese Text Summarization

Authors:
Viet Phung

Queensland University of Technology, Brisbane, Queensland

Queensland University of Technology, Brisbane, Queensland
View Profile

,
Lance De Vine

Queensland University of Technology, Brisbane, Queensland

Queensland University of Technology, Brisbane, Queensland
View Profile

ADCS '15: Proceedings of the 20th Australasian Document Computing SymposiumDecember 2015Article No.: 7Pages 1–8https://doi.org/10.1145/2838931.2838935

Published:08 December 2015Publication History

ADCS '15: Proceedings of the 20th Australasian Document Computing Symposium

Pages 1–8

ABSTRACT

Automatic text summarization is the process of automatically reducing the length of documents without losing the primary ideas. Due to the flood of digital text-based information, there is a great demand for summarization systems. In this paper, we investigate a number of word-embedding based approaches for sentence representation which are combined with the PageRank algorithm to select sentences for summary construction. We compare these new methods with a range of other current approaches to summarization. While the same summarization approaches can generally be applied across different languages, we target Vietnamese because of the relative lack of previous work in this space and also because it provides a good example of a language which generally requires word segmentation. Our experiments find that a word-embedding and graph based approach is an effective strategy for Vietnamese summarization and that word segmentation is not necessary for achieving good summarization results.

References

S. Aji and R. Kaimal. Document summarization using positive pointwise mutual information. International Journal of Computer Science & Information Technology (IJCSIT), 4(2):47--55, 2012.Google Scholar
R. Arora and R. Balaraman. Latent dirichlet allocation and singular value decomposition based multi-document summarization. In Data Mining, 2008. ICDM'08 Eighth IEE International Conference on, pages 713--718. IEEE, 2008. Google ScholarDigital Library
M. Bansal, K. Gimpel, and K. Livescu. Tailoring continuous word representations for dependency parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2014.Google ScholarCross Ref
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137--1155, 2003. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
M. Campr and K. Ježek. Topic models for comparative summarization. Text, Speech, and Dialogue, 8082:568--574, 2013.Google Scholar
Y. L. Chang and J. T. Chien. Latent dirichlet learning for document summarization. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 1689--1692. IEEE, 2009. Google ScholarDigital Library
J. M. Conroy and D. P. O'leary. Text summarization via hidden markov models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 406--407. ACM, 2001. Google ScholarDigital Library
G. Erkan and D. R. Radev. Lexrank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, pages 457--479, 2004. Google ScholarDigital Library
J. Goldstein and J. Carbonell. Summarization:(1) using mmr for diversity-based reranking and (2) evaluating summaries. In Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998, pages 181--195. Association for Computational Linguistics, 1998. Google ScholarDigital Library
S. Guo and S. Sanner. Probabilistic latent maximal marginal relevance. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 833--834. ACM, 2010. Google ScholarDigital Library
T. A. N. Hoang, H. K. Nguyen, and Q. V. Tran. An efficient vietnamese text summarization approach based on graph model. In Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), pages 1--6. IEEE, 2010.Google Scholar
M. Kågebäck, O. Mogren, N. Tahmasebi, and D. Dubhashi. Extractive summarization using continuous vector space models. In Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)@ EACL, pages 31--39, 2014.Google Scholar
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604--632, 1999. Google ScholarDigital Library
J. Kubina and J. Conroy. Mss multiling 2015 task, 2015.Google Scholar
Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053, 2014.Google Scholar
C. Y. Lin and E. Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 71--78. Association for Computational Linguistics, 2003. Google ScholarDigital Library
I. Mani. Automatic summarization, volume 3. John Benjamins Publishing, 2001.Google Scholar
A. K. McCallum. Mallet: A machine learning for language toolkit, 2002.Google Scholar
R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. Association for Computational Linguistics, 2004.Google Scholar
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In Proceedings of Workshop at International Conference on Learning Representations, 2013.Google Scholar
C. T. Nguyen, X. H. Phan, and T. T. Nguyen. Jvntextpro: A java-based vietnamese text processing tool, 2010.Google Scholar
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: bringing order to the web. 1999.Google Scholar
G. Salton, A. Singhal, M. Mitra, and C. Buckley. Automatic text structuring and summarization. Information Processing and Management, 33(2):193--207, 1997. Google ScholarDigital Library
K. M. Svore, L. Vanderwende, and C. J. C. Burges. Enhancing single-document summarization by combining ranknet and third-party sources. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 448--457, Prague, Czech Republic, June 2007. Association for Computational Linguistics.Google Scholar
D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin. Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1555--1565, 2014.Google ScholarCross Ref
H. N. T. Thu. An optimization text summarization method based on naive bayes and topic word for single syllable language. Applied Mathematical Sciences, 8(3):99--115, 2014.Google ScholarCross Ref
H. N. T. Thu, Q. N. Huu, and T. N. T. Ngoc. A supervised learning method combine with dimensionality reduction in vietnamese text summarization. In Computing, Communications and IT Applications Conference (ComComAp), pages 69--73. IEEE, 2013.Google Scholar

Index Terms

A Study on the Use of Word Embeddings and PageRank for Vietnamese Text Summarization

Recommendations

Vietnamese Text Summarization Based on Elementary Discourse Units
NLPIR '22: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval

This paper presents text summarization models based on elementary discourse units (EDUs) to construct extractive and abstractive summarization for Vietnamese documents. First, we introduce algorithms using the POS information for constructing EDUs in ...
Read More
Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification
ICIAI '20: Proceedings of the 2020 the 4th International Conference on Innovation in Artificial Intelligence

In recent years, bilingual word embeddings have been used to promote sentiment classification task in low-resource languages. However, existing bilingual word embedding methods either require annotated cross-lingual data or fail to capture enough ...
Read More
Towards coherent single-document summarization: an integer linear programming-based approach
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing

Automatic Text Summarization (ATS) is a viable option to reduce the content of textual documents, e.g., as a possible preprocessing step in many text mining applications. Single-document extractive summarizers have been developed based on different ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ADCS '15: Proceedings of the 20th Australasian Document Computing Symposium
December 2015
72 pages
ISBN:9781450340403
DOI:10.1145/2838931
Editors:
Laurence A. F. Park
Western Sydney University
,
Sarvnaz Karimi
CSIRO
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 December 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
PageRank
Vietnamese
graph-based model
single-document summarization
word embeddings
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
ADCS '15 Paper Acceptance Rate5of14submissions,36%Overall Acceptance Rate30of57submissions,53%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 177
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Study on the Use of Word Embeddings and PageRank for Vietnamese Text Summarization

ADCS '15: Proceedings of the 20th Australasian Document Computing Symposium

ABSTRACT

References

Cited By

Index Terms

Recommendations

Vietnamese Text Summarization Based on Elementary Discourse Units

Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification

Towards coherent single-document summarization: an integer linear programming-based approach