research-article

Using topic themes for multi-document summarization

Authors:

Sanda Harabagiu,

Finley LacatusuAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 28, Issue 3

Article No.: 13, Pages 1 - 47

https://doi.org/10.1145/1777432.1777436

Published: 02 July 2010 Publication History

Get Access

Abstract

The problem of using topic representations for multidocument summarization (MDS) has received considerable attention recently. Several topic representations have been employed for producing informative and coherent summaries. In this article, we describe five previously known topic representations and introduce two novel representations of topics based on topic themes. We present eight different methods of generating multidocument summaries and evaluate each of these methods on a large set of topics used in past DUC workshops. Our evaluation results show a significant improvement in the quality of summaries based on topic themes over MDS methods that use other alternative topic representations.

References

[1]

Baayen, R., Piepenbrock, R., and Gulikers, L. 1995. The CELEX Lexical Database (Release 2) {CD-ROM}. Linguistic Data Consortium, University of Pennsylvania {Distributor}, Philadelphia, PA.

Google Scholar

[2]

Baker, C. F., Fillmore, C. J., and Lowe, J. B. 1998. The Berkeley FrameNet project. In Proceedings of the Joint Conference of the International Committee on Computation Linguistics and the Association for Computation Linguistics (COLING-ACL'98). 86--90.

Digital Library

Google Scholar

[3]

Barzilay, R. and Lee, L. 2004. Catching the drift: probabilistic content models, with applications to generation and summarization. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL'04). 113--120.

Google Scholar

[4]

Barzilay, R., McKeown, K. R., and Elhadad, M. 1999. Information fusion in the context of multi-document summarization. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. 550--557.

Digital Library

Google Scholar

[5]

Barzilay, R., McKeown, K. R., and Elhadad, M. 2002. Inferring strategies for sentence ordering in multidocument news summarization. J. Artif. Intell. Res. 35--55.

Digital Library

Google Scholar

[6]

Bejan, C. A. and Hathaway, C. 2007. Utd-srl: A pipeline architecture for extracting frame semantic structures. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval'07).

Digital Library

Google Scholar

[7]

Biryukov, M., Angheluta, R., and Moens, M.-F. 2005. Multidocument question answering text summarization using topic signatures. In Proceedings of the Dutch-Belgian Information Retrieval Workshop (DIR'5).

Google Scholar

[8]

Carbonell, J., Geng, Y., and Goldstein, J. 1997. Automated query-relevant summarization and diversity-based reranking. In Proceedings of the Workshop on AI in Digital Libraries (IJCAI'97). 12--19.

Google Scholar

[9]

Carbonell, J. G. and Goldstein, J. 1998. The Use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference, A. Moffat and J. Zobel, Eds., 335--336.

Digital Library

Google Scholar

[10]

Clarke, J. and Lapata, M. 2006. Models for sentence compression: a comparison across domains, training requirements and evaluation measures. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics.

Digital Library

Google Scholar

[11]

Collins, M. 1999. Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania.

Digital Library

Google Scholar

[12]

Dang, H. 2005. Overview of DUC 2005. In Proceedings of the Document Understanding Workshop (DUC'05).

Google Scholar

[13]

DeJong, G. F. 1982. An overview of the FRUMP system. In Strategies for Natural Language Processing, W. G. Lehnert and M. H. Ringle Eds., Lawrence Erlbaum Associates, 149--176.

Google Scholar

[14]

Euler, T. 2002. Tailoring text using topic words: selection and compression. In Proceedings of 13th International Workshop on Database and Expert Systems Applications (DEXA'02). 215--222.

Digital Library

Google Scholar

[15]

Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. MIT Press.

Google Scholar

[16]

Gildea, D. and Jurafsky, D. 2002. Automatic labeling of semantic roles. Comput. Linguist. 28, 3, 245--288.

Digital Library

Google Scholar

[17]

Gildea, D. and Palmer, M. 2002. The necessity of syntactic parsing for predicate argument recognition. In Proceedings of the 40th Annual Conference of the Association for Computational Linguistics (ACL'02). 239--246.

Digital Library

Google Scholar

[18]

Grishman, R. and Sundheim, B. 1996. Message understanding conference - 6: A brief history. In Proceedings of the 16th International Conference on Computational Linguistics (COLING). 466--471.

Digital Library

Google Scholar

[19]

Harabagiu, S. 1997. WordNet-Based Inference of Textual Context, Cohesion and Coherence. Ph.D. thesis, University of Southern California, Los Angeles, CA.

Digital Library

Google Scholar

[20]

Harabagiu, S. 2004. Incremental Topic Representations. In Proceedings of the 20th COLING Conference.

Digital Library

Google Scholar

[21]

Harabagiu, S., Hickl, A., and Lacatusu, F. 2006. Negation, contrast and contradiction in text processing. In Proceedings of the Annual Conference of the American Association for Artificial Intelligence (AAAI'06).

Digital Library

Google Scholar

[22]

Harabagiu, S. and Maiorano, S. 2002. Multi-document summarization with GISTexter. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC'02).

Google Scholar

[23]

Hearst, M. A. 1997. Texttiling: segmenting text into multi-paragraph subtopic passages. Computat. Ling. 23, 1, 33--64.

Digital Library

Google Scholar

[24]

Hickl, A., Williams, J., Bensley, J., Roberts, K., Rink, B., and Shi, Y. 2006. Recognizing textual entailment with LCC's Groundhog System. In Proceedings of the 2nd PASCAL Challenges Workshop.

Google Scholar

[25]

Hirschman, L., Robinson, P., Ferro, L., Chinchor, N., Brown, E., Grishman, R., and Sundheim, B. 1999. Hub-4 Event99 General Guidelines and Templettes. Springer.

Google Scholar

[26]

Hori, C. and Furui, S. 2004. Speech summarization: an approach through word extraction and a method for evaluation. IEICE Trans. Inform. Syst. E87-D(1), 15--25.

Google Scholar

[27]

Hovy, E., Lin, C. Y., and Zhou, L. 2005. A BE-based multi-document summarizer with sentence compression. In Proceedings of Multilingual Summarization Evaluation Workshop (ACL'05).

Google Scholar

[28]

Hovy, E., Lin, C.-Y., Zhou, L., and Fukumoto, J. 2006. Automated summarization evaluation with basic elements. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06).

Google Scholar

[29]

Ji, X., Xu, W., and ZhuJing, S. 2006. Document clustering with prior knowledge. In Proceedings of the 29th Annual International ACM SIGIR Conference.

Digital Library

Google Scholar

[30]

Kehler, A. 2002. Coherence, Reference, and the Theory of Grammar. CSLI, Stanford, CA.

Google Scholar

[31]

Knight, K. and Marcu, D. 2000. Statistics-based summarization—step one: sentence compression. In Proceedings of the 17th National Conference of the American Association for Artificial Intelligence. 703--710.

Digital Library

Google Scholar

[32]

Knott, A. and Sanders, T. J. M. 1998. The classification of coherence relations and their linguistic markers: an exploration of two languages. J. Pragmatics 30, 135--175.

Crossref

Google Scholar

[33]

Kudo, T. and Matsumoto, Y. 2003. Fast methods for kernel-based text analysis. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 24--31.

Digital Library

Google Scholar

[34]

Lacatusu, F., Hickl, A., Harabagiu, S., and Nezda, L. 2004. Lite-GISTexter at Proceedings of the Document Understanding Conference (DUC'04).

Google Scholar

[35]

Lin, C.-Y. and Hovy, E. 2000. The automated acquisition of topic signatures for text summarization. In Proceedings of the 18th Conference of the International Committee on Computational Linguistics (COLING).

Digital Library

Google Scholar

[36]

Lin, C.-Y. and Hovy, E. 2003. The potential and limitations of automatic sentence extraction for summarization. In Proceedings of the HLT-NAACL Workshop: Text Summarization (DUC03).

Digital Library

Google Scholar

[37]

Marcu, D. 1998. Improving summarization through rhetorical parsing tuning. In Proceedings of the Sixth Workshop on Very Large Corpora. 206--215.

Google Scholar

[38]

Marcu, D. and Echihabi, A. 2002. An unsupervised approach to recognizing discourse relations. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL'02).

Digital Library

Google Scholar

[39]

Marcu, D. and Gerber, L. 2001. An Inquiry into the Nature of Multidocument Abstracts, Extracts, and Their Evaluation. In Proceedings of the Workshop on Automatic Summarization (NAACL'01). 1--8.

Google Scholar

[40]

McKeown, K. R., Klavans, J., Hatzivassiloglou, V., Barzilay, R., and Eskin, E. 1999. Towards multidocument summarization by reformulation: progress and prospects. In Proceedings of the 16th National Conference on Artificial Intelligence. 453--460.

Digital Library

Google Scholar

[41]

Morris, J. and Hirst, G. 1991. Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computat. Ling. 17, 1, 21--43.

Digital Library

Google Scholar

[42]

Moschitti, A. and Bejan, C. A. 2004. A semantic kernel for predicate argument classification. In Proceedings of Conference on Computational Natural Language Learning (CoNLL'04). 17--24.

Google Scholar

[43]

Nenkova, A. and Passonneau, R. 2004. Evaluating Content Selection in Summarization: the Pyramid Method. In Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL'04).

Google Scholar

[44]

Ng, V. 2004. Learning noun phrase anaphoricity to improve coreference resolution: issues in representation and optimization. In Proceedings of the 42nd Annual Meeting of the Asssociation for Computational Linguistics (ACL'04).

Digital Library

Google Scholar

[45]

Nicolae, C. and Nicolae, G. 2006. Bestcut: A graph algorithm for coreference resolution. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 275--283.

Digital Library

Google Scholar

[46]

Palmer, M., Gildea, D., and Kingsbury, P. 2005. The proposition bank: an annotated corpus of semantic roles. Computat. Ling. 31, 1, 71--106.

Digital Library

Google Scholar

[47]

Passonneau, R., Nenkova, A., McKeown, K., and Sigelman, S. 2005. Applying the Pyramid Method in DUC 2005. In Proceedings of the Document Understanding Workshop (DUC'05).

Google Scholar

[48]

Pradhan, S., Ward, W., Hacioglu, K., Martin, J., and Jurafsky, D. 2005. Semantic role labeling using different syntactic views. In Proceedings of the Association for Computational Linguistics 43rd Annual Meeting (ACL'05).

Digital Library

Google Scholar

[49]

Radev, D. R., Jing, H., and Budzikowska, M. 2000. Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. In Proceedings of the ANLP-NAACL Workshop on Automatic Summarization.

Digital Library

Google Scholar

[50]

Riloff, E. 1996. Automatically generating extraction patterns from untagged text. In Proceedings of the Conference of the Association for the Advacenmnet of Artificial Intelligence/Innovative Applications of Artificial Intelligence (AAAI/IAAI). 1044--1049.

Digital Library

Google Scholar

[51]

Riloff, E. and Schmelzenbach, M. 1998. An empirical approach to conceptual case frame acquisition. In Proceedings of the 16th Workshop on Very Large Corpora.

Google Scholar

[52]

SemEval. 2007. Fourth international workshop on semantic evaluations. In Proceedings of the Association for Computational Linguistics (ACL'07).

Google Scholar

[53]

SENSEVAL-3. 2004. Third international workshop on the evaluation of systems for the semantic analysis of text. In Proceedings of the Association for Computational Linguistics (ACL'04).

Google Scholar

[54]

Soricut, R. and Marcu, D. 2003. Sentence level discourse parsing using syntactic and lexical information. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics.

Digital Library

Google Scholar

[55]

Surdeanu, M. and Turmo, J. 2005. Semantic role labeling using complete syntactic analysis. In Proceedings of Conference on Computational Natural Language Learning (CoNLL'05).

Digital Library

Google Scholar

[56]

Turner, J. and Charniak, E. 2005. Supervised and unsupervised learning for sentence compression. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05). 290--297.

Digital Library

Google Scholar

[57]

Zajic, D., Dorr, B. J., and Schwartz, R. 2004. BBN/UMD at DUC-2004: Topiary. In Proceedings of the HLT/NAACL Document Understanding Workshop (DUC'04). 112--119.

Google Scholar

Cited By

View all

Abdelqader KMohamed AShaalan K(2023)Systematic Review of Automatic Arabic Text Summarization TechniquesBusiness Intelligence and Information Technology10.1007/978-981-99-3416-4_63(783-796)Online publication date: 4-Jul-2023
https://doi.org/10.1007/978-981-99-3416-4_63
Etaiwi WAwajan A(2022)SemG-TS: Abstractive Arabic Text Summarization Using Semantic Graph EmbeddingMathematics10.3390/math1018322510:18(3225)Online publication date: 6-Sep-2022
https://doi.org/10.3390/math10183225
Dong LSatpute MWu WDu D(2021)Two-Phase Multidocument Summarization Through Content-Attention-Based Subtopic DetectionIEEE Transactions on Computational Social Systems10.1109/TCSS.2021.30792068:6(1379-1392)Online publication date: Dec-2021
https://doi.org/10.1109/TCSS.2021.3079206
Show More Cited By

Index Terms

Using topic themes for multi-document summarization
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection

Recommendations

Topic themes for multi-document summarization
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

The problem of using topic representations for multi-document summarization (MDS) has received considerable attention recently. In this paper, we describe five different topic representations and introduce a novel representation of topics based on topic ...
Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...

Reviews

Reviewer: Quinsulon Israel

With the increase of digital text and the rise of related metadata, there is a growing interest in finding ways to reduce information overload while still maintaining the most important and useful content. Focused multi-document summarization (MDS) is a process that seeks to condense collections of documents that are related by a query, question, topic, or category down to a passage of only several sentences. Harabagiu and Lacatusu present research based on topic themes, a new method of topic representation. Topic themes not only improve all aspects of the MDS process, but they also improve one's understanding of the performance of the various focus-based techniques and their various combinations, with different extraction, compression, and selection methods. These topic themes are basically simple predicate-argument structures. In short, the research compares two of their own novel representations with five state-of-the-art topic representations that use eight theme selection methods (in all, 40 MDS system combinations). Because of the many explanations of the various MDS topic representation techniques, the fundamental MDS and evaluation measures, and the authors' methodology, the paper is a bit verbose and information dense. That being said, the paper's clear writing style makes it accessible to new computational linguistics and natural language processing students, who should read this paper in its entirety. However, experts-readers who are already very familiar with information retrieval and MDS-should use this source as a reference. This elucidation of the MDS field is a great example of thorough experimentation. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

ACM Transactions on Information Systems Volume 28, Issue 3

June 2010

231 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/1777432

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2010

Accepted: 01 August 2009

Revised: 01 June 2009

Received: 01 October 2008

Published in TOIS Volume 28, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
902
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Abdelqader KMohamed AShaalan K(2023)Systematic Review of Automatic Arabic Text Summarization TechniquesBusiness Intelligence and Information Technology10.1007/978-981-99-3416-4_63(783-796)Online publication date: 4-Jul-2023
https://doi.org/10.1007/978-981-99-3416-4_63
Etaiwi WAwajan A(2022)SemG-TS: Abstractive Arabic Text Summarization Using Semantic Graph EmbeddingMathematics10.3390/math1018322510:18(3225)Online publication date: 6-Sep-2022
https://doi.org/10.3390/math10183225
Dong LSatpute MWu WDu D(2021)Two-Phase Multidocument Summarization Through Content-Attention-Based Subtopic DetectionIEEE Transactions on Computational Social Systems10.1109/TCSS.2021.30792068:6(1379-1392)Online publication date: Dec-2021
https://doi.org/10.1109/TCSS.2021.3079206
Borkakoty HSharma U(2021)A Naive Extractive Text Summarizer for Assamese Language2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)10.1109/ICIRCA51532.2021.9544769(1712-1717)Online publication date: 2-Sep-2021
https://doi.org/10.1109/ICIRCA51532.2021.9544769
Kumar AKatiyar VKumar P(2021)RETRACTED: A Comparative Analysis of Pre-Processing Time in Summary of Hindi Language using Stanza and SpacyIOP Conference Series: Materials Science and Engineering10.1088/1757-899X/1110/1/0120191110:1(012019)Online publication date: 1-Mar-2021
https://doi.org/10.1088/1757-899X/1110/1/012019
Dar RDileep A(2021)Small, narrow, and parallel recurrent neural networks for sentence representation in extractive text summarizationJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03583-113:9(4151-4157)Online publication date: 6-Nov-2021
https://doi.org/10.1007/s12652-021-03583-1
Satpute MDong LWu WDu D(2019)Multi-Document Extractive Summarization as a Non-linear Combinatorial Optimization ProblemNonlinear Combinatorial Optimization10.1007/978-3-030-16194-1_15(295-308)Online publication date: 1-Jun-2019
https://doi.org/10.1007/978-3-030-16194-1_15
Meng LTan AWunsch II DMeng LTan AWunsch II D(2019)Clustering and Its Extensions in the Social Media DomainAdaptive Resonance Theory in Social Media Data Clustering10.1007/978-3-030-02985-2_2(15-44)Online publication date: 1-May-2019
https://doi.org/10.1007/978-3-030-02985-2_2
Timakum TKim GSong M(2018)A data-driven analysis of the knowledge structure of library science with full-text journal articlesJournal of Librarianship and Information Science10.1177/096100061879397752:2(345-365)Online publication date: 9-Oct-2018
https://doi.org/10.1177/0961000618793977
Hao ZXu BZheng SGao Y(2018)Structured Text Summarization via Open Domain Information Extraction2018 IEEE 22nd International Conference on Computer Supported Cooperative Work in Design ((CSCWD))10.1109/CSCWD.2018.8465372(701-706)Online publication date: May-2018
https://doi.org/10.1109/CSCWD.2018.8465372
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Topic themes for multi-document summarization

Topic analysis for topic-focused multi-document summarization

Research on Multi-document Summarization Based on LDA Topic Model

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations