Skip to main content
Log in

Multi-document summarization via Archetypal Analysis of the content-graph joint model

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In recent years, algebraic methods, more precisely matrix decomposition approaches, have become a key tool for tackling document summarization problem. Typical algebraic methods used in multi-document summarization (MDS) vary from soft and hard clustering approaches to low-rank approximations. In this paper, we present a novel summarization method AASum which employs the archetypal analysis for generic MDS. Archetypal analysis (AA) is a promising unsupervised learning tool able to completely assemble the advantages of clustering and the flexibility of matrix factorization. In document summarization, given a content-graph data matrix representation of a set of documents, positively and/or negatively salient sentences are values on the data set boundary. These extreme values, archetypes, can be computed using AA. While each sentence in a data set is estimated as a mixture of archetypal sentences, the archetypes themselves are restricted to being sparse mixtures, i.e., convex combinations of the original sentences. Since AA in this way readily offers soft clustering, we suggest to consider it as a method for simultaneous sentence clustering and ranking. Another important argument in favor of using AA in MDS is that in contrast to other factorization methods, which extract prototypical, characteristic, even basic sentences, AA selects distinct (archetypal) sentences and thus induces variability and diversity in produced summaries. Experimental results on the DUC generic summarization data sets evidence the improvement of the proposed approach over the other closely related methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Aliguliyev M-A (2010) Clustering techniques and discrete particle swarm optimization algorithm for multi-document summarization. Comput Intell 26(4):420–448

    Article  MATH  MathSciNet  Google Scholar 

  2. Arora R, Ravindran B (2008) Latent Dirichlet allocation and singular value decomposition based multi-document summarization. In: Proceedings of the 8th IEEE international conference on data mining, ICDM. IEEE Computer Society, pp 713–718

  3. Bauckhage C, Thurau C (2009) Making archetypal analysis practical In: Proceedings of pattern recognition 31st DAGM symposium, LNCS, Springer, pp 272–281

  4. Bhandari H, Shimbo M, Ito T, Matsumoto Y (2008) Generic text summarization using probabilistic latent semantic indexing In: Proceedings of the 3rd international joint conference on natural language proceeding 2008, pp 133–140

  5. Cai X, Li W (2011) A spectral analysis approach to document summarization: clustering and ranking sentences simultaneously. Inf Sci 181(18):3816–3827

    Article  MathSciNet  Google Scholar 

  6. Chan B-H-P (2003) Archetypal analysis of galaxy spectra. Mon Not R Astron Soc 338(3):790–795

    Article  Google Scholar 

  7. Cohn A-D, Hofmann T (2000) The missing link—a probabilistic model of document content and hypertext connectivity. In: Advances in neural information processing systems 13. Papers from neural information processing systems (NIPS) 2000, pp 430–436

  8. Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36(4):33–347

    Article  MathSciNet  Google Scholar 

  9. Erkan G, Radev R (2004) LexRank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res (JAIR) 22:457–479

    Google Scholar 

  10. Eugster M, Leisch F (2009) From Spider-man to Hero archetypal analysis in R. J Stat Softw 30(8):1–23

    Google Scholar 

  11. Fattah M-A, Ren F (2009) GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Comput Speech Lang 23(1):126–144

    Article  Google Scholar 

  12. Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR. ACM, pp 19–25

  13. Huggins P, Pachter L, Sturmfels B (2007) Toward the human genotope. Bull Math Biol 69(8):2723–2735

    Article  MATH  MathSciNet  Google Scholar 

  14. Ledeneva Y, René Arnulfo García-Hernández A, Soto R-M, Reyes R-C, Gelbukh A-F (2011) EM clustering algorithm for automatic text summarization. In: Proceedings of advances in artificial intelligence—10th Mexican international conference on artificial intelligence, LNCS. Springer, pp 305–315

  15. Lee J-H, Park S, Ahn CM, Kim D (2009) Automatic generic document summarization based on non-negative matrix factorization. Inf Process Manag 45(1):20–34

    Article  Google Scholar 

  16. Lee C-B, Kim M-S, Park H-R (2003) Automatic summarization based on principal component analysis. In: Proceedings of progress in artificial intelligence, LNCS. Springer, pp 19–25

  17. Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out: proceedings of the ACL-04 workshop of ACL 2004, pp 74–81

  18. Lin C-Y, Hovey E (2003) Automatic evaluation of summaries using n-gram co-occurence statistics. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, HLT-NAACL, pp 71–78

  19. Mani I (1991) Automatic summarization. John Benjamins Publishing Company, Amsterdam

    Google Scholar 

  20. Mei J-P, Chen L (2012) SumCR: a new subtopic-based extractive approach for text summarization. Knowl Inf Syst 31(3):527–545

    Article  MathSciNet  Google Scholar 

  21. Mei Q, Guo J, Radev D-R (2010) DivRank: the interplay of prestige and diversity in information networks. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD. ACM, pp 1009–1018

  22. Mihalcea R, Tarau P (2004) TextRank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, EMNLP, ACL, pp 404–411

  23. Mørup M, Hansen L-K (2012) Archetypal analysis for machine learning and data mining. Neurocomputing 80:54–63

    Article  Google Scholar 

  24. Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: bringing order to the web. Stanford University

  25. Porzio G-C, Ragozini G, Vistocco D (2008) On the use of archetypes as benchmarks. Appl Stoch Models Bus Ind 24(5):419–437

    Article  MATH  MathSciNet  Google Scholar 

  26. Baeza-Yates R, Berthier R (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston

    Google Scholar 

  27. Richardson M, Domingos P (2001) The Intelligent surfer: probabilistic combination of link and content information in PageRank. In: Proceedings of the advances in neural information processing systems 14, NIPS. MIT Press, pp 1441–1448

  28. Steinberger J, Jezek K (2004) Text summarization and singular value decomposition. In: Proceedings of advances in information systems, ADVIS. Springer, pp 245–254

  29. Wang D, Zhu S, Li T, Chi Y, Gong Y (2011) Integrating document clustering and multidocument summarization. TKDD 5(3):14

    Article  Google Scholar 

  30. Wang D, Li T, Zhu S and Ding C (2008) Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in, information retrieval (SIGIR08), pp 307–314

  31. Wei F, Li W, Lu Q, He Y (2010) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2):245–259

    Article  Google Scholar 

  32. Yeh JY, Ke HR, Yang WP, Meng IH (2005) Text summarization using a trainable summarizer and latent semantic analysis. Inf Process Manag 41(1):75–95

    Article  Google Scholar 

  33. Zhu X, Goldberg A-B, Gael J-V, Andrzejewski D (2010) Improving diversity in ranking using absorbing random walks. In: Proceedings of human language technology conference of the North American chapter of the association of computational linguistics, HLT-NAACL, pp 97–104

Download references

Acknowledgments

We thank anonymous reviewers for their very useful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ercan Canhasi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Canhasi, E., Kononenko, I. Multi-document summarization via Archetypal Analysis of the content-graph joint model. Knowl Inf Syst 41, 821–842 (2014). https://doi.org/10.1007/s10115-013-0689-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0689-8

Keywords

Navigation