Abstract
This article presents a unified framework for extracting standard and update summaries from a set of documents. In particular, a topic modeling approach is employed for salience determination and a dynamic modeling approach is proposed for redundancy control. In the topic modeling approach for salience determination, we represent various kinds of text units, such as word, sentence, document, documents, and summary, using a single vector space model via their corresponding probability distributions over the inherent topics of given documents or a related corpus. Therefore, we are able to calculate the similarity between any two text units via their topic probability distributions. In the dynamic modeling approach for redundancy control, we consider the similarity between the summary and the given documents, and the similarity between the sentence and the summary, besides the similarity between the sentence and the given documents, for standard summarization while for update summarization, we also consider the similarity between the sentence and the history documents or summary. Evaluation on TAC 2008 and 2009 in English language shows encouraging results, especially the dynamic modeling approach in removing the redundancy in the given documents. Finally, we extend the framework to Chinese multi-document summarization and experiments show the effectiveness of our framework.
- Allan, J., Wade, C., and Boliva, A. R. 2003. Retrieval and novelty detection at the sentence level. In Proceedings of the 26th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR’03). 314--321. Google ScholarDigital Library
- Arora, R. and Ravindran, B. 2008a. Latent Dirichlet allocation based multi-document summarization. In Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data (ANUTD’08). 91--97. Google ScholarDigital Library
- Arora, R. and Ravindran, B. 2008b. Latent Dirichlet Allocation and Singular Value Decomposition-Based Multi-Document Summarization. In Proceedings of the International Conference on Data Mining (ICDM’08). 713--718. Google ScholarDigital Library
- Bhandari, H., Shimbo, M., Ito, T., and Matsumoto, Y. 2008. Generic text summarization using probabilistic latent semantic indexing. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’08). 133--140.Google Scholar
- Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022. Google ScholarDigital Library
- Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comp. Netw. 30, 1--7, 107--117. Google ScholarDigital Library
- Carbonell, J. and Goldstein, J. 1998. Use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR’98). 335--336. Google ScholarDigital Library
- Dang, H. T. and Owczarzak, K. 2008. Overview of the TAC 2008 update summarization task. In Proceedings of the 1st Text Analysis Conference (TAC’08).Google Scholar
- Edmundson, H. P. 1969. New methods in automatic extracting. J. ACM 16, 2, 264--285. Google ScholarDigital Library
- Erkan, G. and Radev, D. R. 2004. LexPageRank: Prestige in multi-document text summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04). 365--371.Google Scholar
- Gillick, D., Favre, B., and Hakkani-Tur, D. 2008. The ICSI summarization system at TAC 2008. In Proceedings of the 1st Text Analysis Conference (TAC’08).Google Scholar
- Gillick, D., Favre, B., Hakkani-Tur, D., Bohnet, B., Liu, Y., and Xie, S. 2009. The ICSI/UTD summarization system at TAC 2009. In Proceedings of the 2nd Text Analysis Conference (TAC’09).Google Scholar
- Haghighi, A. and Vanderwende, L. 2009. Exploring content models for multi-document summarization. The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (ACL’09). 362--370. Google ScholarDigital Library
- Jones, K. 1999. Automatic summarizing: Factors and directions. In Advances in Automatic Text Summarization, MIT Press, 1--12.Google Scholar
- Jones, K. 2007. Automatic summarizing: The state of the art. Inf. Proc. Man. 43, 6, 1449--1481. Google ScholarDigital Library
- Kleinberg, J. and Authoritative, M. 1998. Sources in a hyperlinked environment. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SIAM’98). 668--677. Google ScholarDigital Library
- Kullback, S. and Leibler, R. A. 1951. On information and sufficiency. Annals Math. Stat. 22, 1, 79--86.Google ScholarCross Ref
- Larkey, L. S., Allan, J., Connell, M. E., Bolivar, A., and Wade, C. 2003. UMass at TREC 2002: Cross Language and Novelty Tracks. Nat. Inst. Stand. Tech. 721--732.Google Scholar
- Lin, C. Y. and Hovy, E. H. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of 2003 Language Technology Conference (HLT-NAACL’03). Google ScholarDigital Library
- Liu, D., Wang, Y., Liu, C., and Wang, Z. 2006. Multiple documents summarization based on genetic algorithm. Fuzzy System and Knowledge Discovery, Lecture Notes in Computer Science, vol. 4223, 355--364. Google ScholarDigital Library
- Mihalcea, R. 2005. Language independent extractive summarization. In Proceedings of the ACL Interactive Poster and Demonstration Sessions (ACL’05). 49--52. Google ScholarDigital Library
- Mani, I. and Bloedorn, E. 1999. Summarizing similarities and differences among related documents. Inf. Retriev. 1, 1, 35--67. Google ScholarDigital Library
- Nastase, V. 2008. Topic-driven multi-document summarization with encyclopedic knowledge and spreading activation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 763--772. Google ScholarDigital Library
- Park, S., Lee, J. H., Ahn, C. M., Hong, J. S., and Chun, S. J. 2006. Query based summarization using non-negative matrix factorization. In Proceeding of International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES’’06). 84--89. Google ScholarDigital Library
- Radev, D. R., Jing, H., and Budzikowska, M. 2000. Centroid-based summarization of multiple documents: Sentence extraction, utility-based evaluation, and user studies. In Proceedings of the ANLP-NAACL Workshop on Summarization (ANLP-NAACL’00). Google ScholarDigital Library
- Radev, D. R., Jing, H., and Budzikowska, M. 2001. Experiments in single and multiple documents summarization using MEAD. In Proceedings of the Document Understanding Conference (DUC’01).Google Scholar
- Steinberger, J. and Jezek, K. 2004. Using latent semantic analysis in text summarization and summary evaluation. In Proceedings of ISIM (ISIM’04). 93--100.Google Scholar
- Torralbo, R., Alfonseca, E., Guirao, J. M., and Moreno-Sandoval, A. 2005. Description of the UAM system at DUC-2005. In Proceedings of the Document Understanding Conference Workshop 2005 at HLT/EMNLP 2005 (HLT/EMNLP’05).Google Scholar
- Varadarajan, R. and Hristidis, V. 2006. A system for query-specific document summarization. In Proceedings of the 15th ACM International Conference and Information and Knowledge Management (CIKM’06). 622--631. Google ScholarDigital Library
- Wang, D., Zhu, S., Li, T., and Gong, Y. 2009. Multi-document summarization using sentence-based topic models. In Proceedings of the International Joint Conference on Natural Language Processing Conference Short Paper (INCNLP’09). 297--300. Google ScholarDigital Library
- Xu, Y. D., Xu, Z. M., and Wang, X. L. 2007. Multi-document automatic summarization technique based on information fusion. Chin. J. Comp. 30, 11, 2048--2054.Google Scholar
Index Terms
- Toward a Unified Framework for Standard and Update Multi-Document Summarization
Recommendations
Latent dirichlet allocation based multi-document summarization
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text dataExtraction based Multi-Document Summarization Algorithms consist of choosing sentences from the documents using some weighting mechanism and combining them into a summary. In this article we use Latent Dirichlet Allocation to capture the events being ...
Topic-Driven Multi-document Summarization
IALP '10: Proceedings of the 2010 International Conference on Asian Language ProcessingThis paper presents a topic-driven framework for generating a generic summary from multi-documents. Our approach is based on the intuition that, from the statistical point of view, the summary’s probability distribution over the topics should be ...
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Comments