Multi-document summarization via Archetypal Analysis of the content-graph joint model

Canhasi, Ercan; Kononenko, Igor

doi:10.1007/s10115-013-0689-8

Multi-document summarization via Archetypal Analysis of the content-graph joint model

Regular Paper
Published: 22 September 2013

Volume 41, pages 821–842, (2014)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Ercan Canhasi¹ &
Igor Kononenko¹

588 Accesses
18 Citations
Explore all metrics

Abstract

In recent years, algebraic methods, more precisely matrix decomposition approaches, have become a key tool for tackling document summarization problem. Typical algebraic methods used in multi-document summarization (MDS) vary from soft and hard clustering approaches to low-rank approximations. In this paper, we present a novel summarization method AASum which employs the archetypal analysis for generic MDS. Archetypal analysis (AA) is a promising unsupervised learning tool able to completely assemble the advantages of clustering and the flexibility of matrix factorization. In document summarization, given a content-graph data matrix representation of a set of documents, positively and/or negatively salient sentences are values on the data set boundary. These extreme values, archetypes, can be computed using AA. While each sentence in a data set is estimated as a mixture of archetypal sentences, the archetypes themselves are restricted to being sparse mixtures, i.e., convex combinations of the original sentences. Since AA in this way readily offers soft clustering, we suggest to consider it as a method for simultaneous sentence clustering and ranking. Another important argument in favor of using AA in MDS is that in contrast to other factorization methods, which extract prototypical, characteristic, even basic sentences, AA selects distinct (archetypal) sentences and thus induces variability and diversity in produced summaries. Experimental results on the DUC generic summarization data sets evidence the improvement of the proposed approach over the other closely related methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aliguliyev M-A (2010) Clustering techniques and discrete particle swarm optimization algorithm for multi-document summarization. Comput Intell 26(4):420–448
Article MATH MathSciNet Google Scholar
Arora R, Ravindran B (2008) Latent Dirichlet allocation and singular value decomposition based multi-document summarization. In: Proceedings of the 8th IEEE international conference on data mining, ICDM. IEEE Computer Society, pp 713–718
Bauckhage C, Thurau C (2009) Making archetypal analysis practical In: Proceedings of pattern recognition 31st DAGM symposium, LNCS, Springer, pp 272–281
Bhandari H, Shimbo M, Ito T, Matsumoto Y (2008) Generic text summarization using probabilistic latent semantic indexing In: Proceedings of the 3rd international joint conference on natural language proceeding 2008, pp 133–140
Cai X, Li W (2011) A spectral analysis approach to document summarization: clustering and ranking sentences simultaneously. Inf Sci 181(18):3816–3827
Article MathSciNet Google Scholar
Chan B-H-P (2003) Archetypal analysis of galaxy spectra. Mon Not R Astron Soc 338(3):790–795
Article Google Scholar
Cohn A-D, Hofmann T (2000) The missing link—a probabilistic model of document content and hypertext connectivity. In: Advances in neural information processing systems 13. Papers from neural information processing systems (NIPS) 2000, pp 430–436
Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36(4):33–347
Article MathSciNet Google Scholar
Erkan G, Radev R (2004) LexRank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res (JAIR) 22:457–479
Google Scholar
Eugster M, Leisch F (2009) From Spider-man to Hero archetypal analysis in R. J Stat Softw 30(8):1–23
Google Scholar
Fattah M-A, Ren F (2009) GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Comput Speech Lang 23(1):126–144
Article Google Scholar
Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR. ACM, pp 19–25
Huggins P, Pachter L, Sturmfels B (2007) Toward the human genotope. Bull Math Biol 69(8):2723–2735
Article MATH MathSciNet Google Scholar
Ledeneva Y, René Arnulfo García-Hernández A, Soto R-M, Reyes R-C, Gelbukh A-F (2011) EM clustering algorithm for automatic text summarization. In: Proceedings of advances in artificial intelligence—10th Mexican international conference on artificial intelligence, LNCS. Springer, pp 305–315
Lee J-H, Park S, Ahn CM, Kim D (2009) Automatic generic document summarization based on non-negative matrix factorization. Inf Process Manag 45(1):20–34
Article Google Scholar
Lee C-B, Kim M-S, Park H-R (2003) Automatic summarization based on principal component analysis. In: Proceedings of progress in artificial intelligence, LNCS. Springer, pp 19–25
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out: proceedings of the ACL-04 workshop of ACL 2004, pp 74–81
Lin C-Y, Hovey E (2003) Automatic evaluation of summaries using n-gram co-occurence statistics. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, HLT-NAACL, pp 71–78
Mani I (1991) Automatic summarization. John Benjamins Publishing Company, Amsterdam
Google Scholar
Mei J-P, Chen L (2012) SumCR: a new subtopic-based extractive approach for text summarization. Knowl Inf Syst 31(3):527–545
Article MathSciNet Google Scholar
Mei Q, Guo J, Radev D-R (2010) DivRank: the interplay of prestige and diversity in information networks. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD. ACM, pp 1009–1018
Mihalcea R, Tarau P (2004) TextRank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, EMNLP, ACL, pp 404–411
Mørup M, Hansen L-K (2012) Archetypal analysis for machine learning and data mining. Neurocomputing 80:54–63
Article Google Scholar
Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: bringing order to the web. Stanford University
Porzio G-C, Ragozini G, Vistocco D (2008) On the use of archetypes as benchmarks. Appl Stoch Models Bus Ind 24(5):419–437
Article MATH MathSciNet Google Scholar
Baeza-Yates R, Berthier R (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston
Google Scholar
Richardson M, Domingos P (2001) The Intelligent surfer: probabilistic combination of link and content information in PageRank. In: Proceedings of the advances in neural information processing systems 14, NIPS. MIT Press, pp 1441–1448
Steinberger J, Jezek K (2004) Text summarization and singular value decomposition. In: Proceedings of advances in information systems, ADVIS. Springer, pp 245–254
Wang D, Zhu S, Li T, Chi Y, Gong Y (2011) Integrating document clustering and multidocument summarization. TKDD 5(3):14
Article Google Scholar
Wang D, Li T, Zhu S and Ding C (2008) Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in, information retrieval (SIGIR08), pp 307–314
Wei F, Li W, Lu Q, He Y (2010) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2):245–259
Article Google Scholar
Yeh JY, Ke HR, Yang WP, Meng IH (2005) Text summarization using a trainable summarizer and latent semantic analysis. Inf Process Manag 41(1):75–95
Article Google Scholar
Zhu X, Goldberg A-B, Gael J-V, Andrzejewski D (2010) Improving diversity in ranking using absorbing random walks. In: Proceedings of human language technology conference of the North American chapter of the association of computational linguistics, HLT-NAACL, pp 97–104

Download references

Acknowledgments

We thank anonymous reviewers for their very useful comments and suggestions.

Author information

Authors and Affiliations

Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
Ercan Canhasi & Igor Kononenko

Authors

Ercan Canhasi
View author publications
You can also search for this author in PubMed Google Scholar
Igor Kononenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ercan Canhasi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Canhasi, E., Kononenko, I. Multi-document summarization via Archetypal Analysis of the content-graph joint model. Knowl Inf Syst 41, 821–842 (2014). https://doi.org/10.1007/s10115-013-0689-8

Download citation

Received: 15 December 2012
Revised: 03 September 2013
Accepted: 14 September 2013
Published: 22 September 2013
Issue Date: December 2014
DOI: https://doi.org/10.1007/s10115-013-0689-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-document summarization via Archetypal Analysis of the content-graph joint model

Abstract

Access this article

Similar content being viewed by others

Automatic Extractive Multi-document Summarization Based on Archetypal Analysis

Extractive Document Summarization using Non-negative Matrix Factorization

Multi-Document Extractive Summarization as a Non-linear Combinatorial Optimization Problem

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-document summarization via Archetypal Analysis of the content-graph joint model

Abstract

Access this article

Similar content being viewed by others

Automatic Extractive Multi-document Summarization Based on Archetypal Analysis

Extractive Document Summarization using Non-negative Matrix Factorization

Multi-Document Extractive Summarization as a Non-linear Combinatorial Optimization Problem

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation