Abstract
In recent years, algebraic methods, more precisely matrix decomposition approaches, have become a key tool for tackling document summarization problem. Typical algebraic methods used in multi-document summarization (MDS) vary from soft and hard clustering approaches to low-rank approximations. In this paper, we present a novel summarization method AASum which employs the archetypal analysis for generic MDS. Archetypal analysis (AA) is a promising unsupervised learning tool able to completely assemble the advantages of clustering and the flexibility of matrix factorization. In document summarization, given a content-graph data matrix representation of a set of documents, positively and/or negatively salient sentences are values on the data set boundary. These extreme values, archetypes, can be computed using AA. While each sentence in a data set is estimated as a mixture of archetypal sentences, the archetypes themselves are restricted to being sparse mixtures, i.e., convex combinations of the original sentences. Since AA in this way readily offers soft clustering, we suggest to consider it as a method for simultaneous sentence clustering and ranking. Another important argument in favor of using AA in MDS is that in contrast to other factorization methods, which extract prototypical, characteristic, even basic sentences, AA selects distinct (archetypal) sentences and thus induces variability and diversity in produced summaries. Experimental results on the DUC generic summarization data sets evidence the improvement of the proposed approach over the other closely related methods.
Similar content being viewed by others
References
Aliguliyev M-A (2010) Clustering techniques and discrete particle swarm optimization algorithm for multi-document summarization. Comput Intell 26(4):420–448
Arora R, Ravindran B (2008) Latent Dirichlet allocation and singular value decomposition based multi-document summarization. In: Proceedings of the 8th IEEE international conference on data mining, ICDM. IEEE Computer Society, pp 713–718
Bauckhage C, Thurau C (2009) Making archetypal analysis practical In: Proceedings of pattern recognition 31st DAGM symposium, LNCS, Springer, pp 272–281
Bhandari H, Shimbo M, Ito T, Matsumoto Y (2008) Generic text summarization using probabilistic latent semantic indexing In: Proceedings of the 3rd international joint conference on natural language proceeding 2008, pp 133–140
Cai X, Li W (2011) A spectral analysis approach to document summarization: clustering and ranking sentences simultaneously. Inf Sci 181(18):3816–3827
Chan B-H-P (2003) Archetypal analysis of galaxy spectra. Mon Not R Astron Soc 338(3):790–795
Cohn A-D, Hofmann T (2000) The missing link—a probabilistic model of document content and hypertext connectivity. In: Advances in neural information processing systems 13. Papers from neural information processing systems (NIPS) 2000, pp 430–436
Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36(4):33–347
Erkan G, Radev R (2004) LexRank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res (JAIR) 22:457–479
Eugster M, Leisch F (2009) From Spider-man to Hero archetypal analysis in R. J Stat Softw 30(8):1–23
Fattah M-A, Ren F (2009) GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Comput Speech Lang 23(1):126–144
Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR. ACM, pp 19–25
Huggins P, Pachter L, Sturmfels B (2007) Toward the human genotope. Bull Math Biol 69(8):2723–2735
Ledeneva Y, René Arnulfo García-Hernández A, Soto R-M, Reyes R-C, Gelbukh A-F (2011) EM clustering algorithm for automatic text summarization. In: Proceedings of advances in artificial intelligence—10th Mexican international conference on artificial intelligence, LNCS. Springer, pp 305–315
Lee J-H, Park S, Ahn CM, Kim D (2009) Automatic generic document summarization based on non-negative matrix factorization. Inf Process Manag 45(1):20–34
Lee C-B, Kim M-S, Park H-R (2003) Automatic summarization based on principal component analysis. In: Proceedings of progress in artificial intelligence, LNCS. Springer, pp 19–25
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out: proceedings of the ACL-04 workshop of ACL 2004, pp 74–81
Lin C-Y, Hovey E (2003) Automatic evaluation of summaries using n-gram co-occurence statistics. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, HLT-NAACL, pp 71–78
Mani I (1991) Automatic summarization. John Benjamins Publishing Company, Amsterdam
Mei J-P, Chen L (2012) SumCR: a new subtopic-based extractive approach for text summarization. Knowl Inf Syst 31(3):527–545
Mei Q, Guo J, Radev D-R (2010) DivRank: the interplay of prestige and diversity in information networks. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD. ACM, pp 1009–1018
Mihalcea R, Tarau P (2004) TextRank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, EMNLP, ACL, pp 404–411
Mørup M, Hansen L-K (2012) Archetypal analysis for machine learning and data mining. Neurocomputing 80:54–63
Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: bringing order to the web. Stanford University
Porzio G-C, Ragozini G, Vistocco D (2008) On the use of archetypes as benchmarks. Appl Stoch Models Bus Ind 24(5):419–437
Baeza-Yates R, Berthier R (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston
Richardson M, Domingos P (2001) The Intelligent surfer: probabilistic combination of link and content information in PageRank. In: Proceedings of the advances in neural information processing systems 14, NIPS. MIT Press, pp 1441–1448
Steinberger J, Jezek K (2004) Text summarization and singular value decomposition. In: Proceedings of advances in information systems, ADVIS. Springer, pp 245–254
Wang D, Zhu S, Li T, Chi Y, Gong Y (2011) Integrating document clustering and multidocument summarization. TKDD 5(3):14
Wang D, Li T, Zhu S and Ding C (2008) Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in, information retrieval (SIGIR08), pp 307–314
Wei F, Li W, Lu Q, He Y (2010) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2):245–259
Yeh JY, Ke HR, Yang WP, Meng IH (2005) Text summarization using a trainable summarizer and latent semantic analysis. Inf Process Manag 41(1):75–95
Zhu X, Goldberg A-B, Gael J-V, Andrzejewski D (2010) Improving diversity in ranking using absorbing random walks. In: Proceedings of human language technology conference of the North American chapter of the association of computational linguistics, HLT-NAACL, pp 97–104
Acknowledgments
We thank anonymous reviewers for their very useful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Canhasi, E., Kononenko, I. Multi-document summarization via Archetypal Analysis of the content-graph joint model. Knowl Inf Syst 41, 821–842 (2014). https://doi.org/10.1007/s10115-013-0689-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0689-8