skip to main content
research-article

An EDU-Based Approach for Thai Multi-Document Summarization and Its Application

Published: 30 January 2015 Publication History

Abstract

Due to lack of a word/phrase/sentence boundary, summarization of Thai multiple documents has several challenges in unit segmentation, unit selection, duplication elimination, and evaluation dataset construction. In this article, we introduce Thai Elementary Discourse Units (TEDUs) and their derivatives, called Combined TEDUs (CTEDUs), and then present our three-stage method of Thai multi-document summarization, that is, unit segmentation, unit-graph formulation, and unit selection and summary generation. To examine performance of our proposed method, a number of experiments are conducted using 50 sets of Thai news articles with their manually constructed reference summaries. Based on measures of ROUGE-1, ROUGE-2, and ROUGE-SU4, the experimental results show that: (1) the TEDU-based summarization outperforms paragraph-based summarization; (2) our proposed graph-based TEDU weighting with importance-based selection achieves the best performance; and (3) unit duplication consideration and weight recalculation help improve summary quality.

References

[1]
Alguliev, R. M., Aliguliyev, R. M., Hajirahimova, M. S., and Mehdiyev, C. A. 2011. Mcmr: Maximum coverage and minimum redundant text summarization model. Expert Syst. Appl. 38, 12, 14514--14522.
[2]
Aliguliyev, R. M. 2009. A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Syst. Appl. 36, 4, 7764--7772.
[3]
Barzilay, R., McKeown, K. R., and Elhadad, M. 1999. Information fusion in the context of multi-document summarization. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. 550--557.
[4]
Cai, X. and Li, W. 2011. A spectral analysis approach to document summarization: Clustering and ranking sentences simultaneously. Inf. Sci. 181, 18, 3816--3827.
[5]
Carbonell, J. and Goldstein, J. 1998. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), 335--336.
[6]
Carlson, L., Marcu, D., and Okurowski, M. E. 2003. Building a discourse-tagged corpus in the frame-work of rhetorical structure theory. In Proceedings of the 2nd SIGDIAL Workshop on Discourse and Dialogue (SIGDIAL’03).
[7]
Charoensuk, J., Sukvaree, T., and Kawtrakul, A. 2005. Elementary discourse unit segmentation for thai using discourse cues and syntactic information. In Proceedings of the 6th Symposium on Natural Language Processing (SNLP’05).
[8]
Chongsuntornsri, A. and Sornil, O. 2006. An automatic thai text summarization using topic sensitive pagerank. In Proceedings of the International Symposium on Communications and Information Technologies (ISCIT ’06). 547--552.
[9]
Deza, M. M. and Deza, E. 2009. Encyclopedia of Distances. Springer.
[10]
Erkan, G. and Radev, D. R. 2004. Lexpagerank: Prestige in multi-document text summarization. http://clair.si.umich.edu/~radev/papers/emnlp04pos.pdf.
[11]
Ferreira, R., Cabral, L. D. S., Lins, R. D., Silva, G. P., Freitas, F., Cavalcanti, G. D., Lima, R., Simske, S. J., and Favaro, L. 2013. Assessing sentence scoring techniques for extractive text summarization. Expert Syst. Appl. 40, 14, 5755--5764.
[12]
Goldstein, J. and Carbonell, J. 1998. Summarization: (1) using mmr for diversity - based reranking and (2) evaluating summaries. In Proceedings of the Workshop on Tipster Text Program (TIPSTER’98). Association for Computational Linguistics, 181--195.
[13]
Jaccard, P. 1901. Etude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Societe Vaudoise des Sciences Naturelles 37, 547--579.
[14]
Jaruskulchai, C. and Kruengkrai, C. 2003. A practical text summarizer by paragraph extraction for thai. In Proceedings of the 6th International Workshop on Information Retrieval with Asian Languages (AsianIR’03). 9--16.
[15]
Ketui, N. and Theeramunkong, T. 2010. Inclusion-based and exclusion-based approaches in graph-based multiple news summarization. In Proceedings of the 5th International Conference on Knowledge, Information and Creativity Support Systems (KICSS’10), Lecture Notes in Computer Science, vol. 6746, Springer, 91--102.
[16]
Ketui, N., Theeramunkong, T., and Onsuwan, C. 2012. A rule-based method for thai elementary discourse unit segmentation (ted-seg). In Proceedings of the 7th International Conference on Knowledge, Information and Creativity Support Systems (KICSS’12), IEEE Computer Society. 195--202.
[17]
Ketui, N., Theeramunkong, T., and Onsuwan, C. 2013. Thai elementary discourse unit analysis and syntactic-based segmentation. Inf.-Ann. Int. Interdiscipl. J. 16, 10, 7423--7436.
[18]
Kittiphattanabawon, N., Theeramunkong, T., and Nantajeewarawat, E. 2010. Exploration of document relation quality with consideration of term representation basis, term weighting and association measure. In Proceedings of the Pacific Asia Workshop on Intelligence and Security Informatics (PAISI’10), Lecture Notes in Computer Science, vol. 6122, Springer, 126--139.
[19]
Kuo, J.-J. and Chen, H.-H. 2008. Multidocument summary generation: Using informative and event words. ACM Trans. Asian Lang. Inform. Process. 7, 1, 3:1--3:23.
[20]
Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Proceeding of the ACL Workshop on Text Summarization Branches Out (WAS’04). 74--81.
[21]
Maier, D. 1978. The complexity of some problems on subsequences and supersequences. J. ACM 25, 2, 322--336.
[22]
Mani, I. 1997. Multi-document summarization by graph search and matching. In Proceedings of the 14th National Conference on Artificial Intelligence and the 9th Conference on Innovative Applicatins of Artificial Intelligence (AAAI/IAAI’97), 622--628.
[23]
Mani, I. and Bloedorn, E. 1999. Summarizing similarities and differences among related documents. Inf. Retriev. 1, 35--67.
[24]
McKeown, K., Klavans, J., Hatzivassiloglou, V., Barzilay, R., and Eskin, E. 1999. Towards multi-document summarization by reformulation: Progress and prospects. In Proceedings of the 16th National Conference on Artificial Intelligence and the 11th Innovative Applications of Artificial Intelligence Conference (AAAI/IAAI’99). 453--460.
[25]
McKeown, K. and Radev, D. 1999. Generating summaries of multiple news articles. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). 74--82.
[26]
Meknavin, S., Charoenpornsawat, P., and Kijsirikul, B. 1997. Feature-based thai word segmentation. In Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS’97).
[27]
Mihalcea, R. 2004. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In Proceedings of the ACL Interactive Poster and Demonstration Sessions (ACLdemo’04). Association for Computational Linguistics.
[28]
Okazaki, N., Matsuo, Y., and Ishizuka, M. 2005. Improving chronological ordering of sentences extracted from multiple newspaper articles. ACM Trans. Asian Lang. Inform. Process. 4, 3, 321--339.
[29]
Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The pagerank citation ranking: Bringing order to the web. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
[30]
Radev, D. R., Jing, H., and Budzikowska, M. 2000. Centroid-based summarization of multiple documents: Sentence extraction, utility-based evaluation, and user studies. In Proceedings of the NAACL-ANLP Workshop on Automatic Summarization (NAACL-ANLP-AutoSum’00), 21--30. Association for Computational Linguistics. 21--30.
[31]
Singhal, A. 2001. Modern information retrieval: A brief overview. Bull. IEEE Comput. Soc. Technic. Committee Data Engin. 24, 4, 35--43.
[32]
Sinthupoun, S. and Sornil, O. 2010. Thai rhetorical structure analysis. Int. J. Comput. Sci. Inf. Secur. 7, 1, 95--105.
[33]
Sornil, O. and Gree-ut, K. 2006. An automatic text summarization approach using content-based and graph-based characteristics. In Proceedings of the IEEE Conference on Cybernetics and Intelligent Systems (ICCIS’06), 1--6.
[34]
Sukvaree, T., Kawtrakul, A., and Caelen, J. 2007. Thai text coherence structuring with coordinating and subordinating relations for text summarization. In Proceedings of the 6th International and Interdisciplinary Conference on Modeling and Using Context (CONTEXT’07), 453--466.
[35]
Suwanno, N., Suzuki, Y., and Yamazaki, H. 2005. Extracting thai compound nouns for paragraph extraction in thai text. In Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP/KE’05), 657--662.
[36]
Thangthai, A. and Jaruskulchai, C. 2004. Impact parameter on lsa performance for thai text summarization. In Proceedings of the 43rd Kasetsart University Annual Conference: Veterinary Medicine, Science (Vichakarn’04). 331--339.
[37]
Theeramunkong, T., Boriboon, M., Haruechaiyasak, C., Kittiphattanabawon, N., Kosawat, K., Onsuwan, C., Siriwat, I., Suwanapong, T., and Tongtep, N. 2010. Thai-nest: A framework for thai named entity tagging specification and tools. In Proceedings of the 2nd International Conference on Corpus Linguistics (CILC’10), 895--908.
[38]
Tongtep, N. and Theeramunkong, T. 2013. Multi-stage automatic ne and pos annotation using pattern-based and atatistical-based techniques for thai corpus construction. IEICE Trans. Inf. Syst. E96-D, 10, 2245--2256.
[39]
Wang, H. and Zhou, G. 2012. Toward a unified framework for standard and update multi-document summarization. ACM Trans. Asian Lang. Inform. Process. 11, 2, 5:1--5:18.

Cited By

View all
  • (2024)Construction of Text Summarization Corpus in Economics Domain and Baseline ModelsJournal of information and communication convergence engineering10.56977/jicce.2024.22.1.3322:1(33-43)Online publication date: 31-Mar-2024
  • (2020)StyloThai:ACM Transactions on Asian and Low-Resource Language Information Processing10.1145/336583219:3(1-15)Online publication date: 9-Jan-2020
  • (2017)A method to generate text summary by accounting pronoun frequency for keywords weightage computation2017 International Conference on Engineering and Technology (ICET)10.1109/ICEngTechnol.2017.8308170(1-4)Online publication date: Aug-2017

Index Terms

  1. An EDU-Based Approach for Thai Multi-Document Summarization and Its Application

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 14, Issue 1
    January 2015
    83 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/2730923
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 January 2015
    Accepted: 01 July 2014
    Revised: 01 July 2014
    Received: 01 February 2014
    Published in TALLIP Volume 14, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. EDU-based approach
    2. Multi-document summarization
    3. Thai text summarization
    4. unit selection

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • National Electronics and Computer Technology Center (NECTEC)
    • Bangchak Petroleum Public Company Limited (BCP), Thailand
    • National Research University Project of Thailand Office of Higher Education Commission

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Construction of Text Summarization Corpus in Economics Domain and Baseline ModelsJournal of information and communication convergence engineering10.56977/jicce.2024.22.1.3322:1(33-43)Online publication date: 31-Mar-2024
    • (2020)StyloThai:ACM Transactions on Asian and Low-Resource Language Information Processing10.1145/336583219:3(1-15)Online publication date: 9-Jan-2020
    • (2017)A method to generate text summary by accounting pronoun frequency for keywords weightage computation2017 International Conference on Engineering and Technology (ICET)10.1109/ICEngTechnol.2017.8308170(1-4)Online publication date: Aug-2017

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media