Skip to main content
Log in

SumCR: A new subtopic-based extractive approach for text summarization

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In text summarization, relevance and coverage are two main criteria that decide the quality of a summary. In this paper, we propose a new multi-document summarization approach SumCR via sentence extraction. A novel feature called Exemplar is introduced to help to simultaneously deal with these two concerns during sentence ranking. Unlike conventional ways where the relevance value of each sentence is calculated based on the whole collection of sentences, the Exemplar value of each sentence in SumCR is obtained within a subset of similar sentences. A fuzzy medoid-based clustering approach is used to produce sentence clusters or subsets where each of them corresponds to a subtopic of the related topic. Such kind of subtopic-based feature captures the relevance of each sentence within different subtopics and thus enhances the chance of SumCR to produce a summary with a wider coverage and less redundancy. Another feature we incorporate in SumCR is Position, i.e., the position of each sentence appeared in the corresponding document. The final score of each sentence is a combination of the subtopic-level feature Exemplar and the document-level feature Position. Experimental studies on DUC benchmark data show the good performance of SumCR and its potential in summarization tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2): 171–196

    Article  Google Scholar 

  2. Aliguliyev RM (2006) A novel partitioning-based clustering method and generic document summarization. In: Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, pp 626–629

  3. Amini MR, Gallinari P (2002) The use of unlabeled data to improve supervised learning for text summarization. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, pp 105–112

  4. Arora R, Ravindran B (2008) Latent dirichlet allocation and singular value decomposition based multi-document summarization. In: Proceedings of the international conference on data mining, pp 713–718

  5. Barzilay R, Lee L (2004) Catching the drift: probabilistic content models, with applications to generation and summarization. In: HLT-NAACL: proceedings of the main conference, 2004, pp 113–120

  6. Baxendale P (1958) Machine-made index for technical literature-an experiment. IBM J Res Dev 2(4): 354–361

    Article  Google Scholar 

  7. Carbonell JG, Goldstein J (1998) The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 335–336

  8. Celikyilmaz A, Hakkani-Tur D (2010) A hybrid hierarchical model for multi-document summarization. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL 2010), pp 1149–1154

  9. Conroy JM, O’Leary DP (2001) Text summarization via hidden markov models. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, pp 406–407

  10. Edmundson HP (1969) New methods in automatic extracting. J Assoc Comput Mach 16(2): 264–285

    Article  MATH  Google Scholar 

  11. Erkan G, Radev DR (2004) LexPageRank: prestige in multi-document text summarization. In: Proceedings of empirical methods in natural language (EMNLP 2004), pp 365–371

  12. Feng S, Wang D, Yu G, Gao W, Wong K-F (2011) Extracting common emotions from blogs based on fine-grained sentiment clustering. Knowl Inf Syst 27(2): 281–302

    Article  MATH  Google Scholar 

  13. Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’01), pp 19–25

  14. Haghighi A, Vanderwende L (2009) Exploring content models for multi-document summarization. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the association for computational linguistics (NAACL’09), pp 362–370

  15. Hovy E, Lin C-Y (1999) Automated text summarization in SUMMARIST. In: Mani I, Maybury M (eds) Advances in automatic text summarization. The MIT Press, Cambridge, pp 81–94

    Google Scholar 

  16. Jing H (2000) Sentence reduction for automatic text summarization. In: Proceedings of 6th conference on applied natural language processing (ANCL’00), pp 310–315

  17. Knight K, Marcu D (2002) Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif Intell 139(1): 91–107

    Article  MathSciNet  MATH  Google Scholar 

  18. Kupiec J, Pedersen J, Chen F (1995) A trainable document summarizer. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’95), pp 68–73

  19. Lee J-H, Park S, Ahn C-M, Kim D (2009) Automatic generic document summarization based on non-negative matrix factorization. Inf Process Manag 45(1): 20–34

    Article  Google Scholar 

  20. Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out: proceedings of the ACL-04 workshop of ACL 2004, pp 74–81

  21. Lin C-Y, Hovey E (2003) Automatic evaluation of summaries using n-gram co-occurence statistics. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, pp 71–78

  22. Long C, Huang M, Zhu X, Li M (2009) Multi-document summarization by information distance. In: Proceedings of the 2009 Ninth IEEE international conference on data mining (ICDM’09), pp 866–871

  23. Mani I (2001) Automatic summarization. John Benjamin’s Publishing Company, Amsterdam

    MATH  Google Scholar 

  24. McCallum AK (1996) Bow: a toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow/

  25. Mei J-P, Chen L (2010) Fuzzy clustering with weighted medoids for relational data. Pattern Recognit 43(5): 1964–1974

    Article  MATH  Google Scholar 

  26. Mihalcea R, Tarau P (2004) TextRank: bringing order into texts. In: Dekang L, Dekai W (eds) Proceedings of empirical methods in natural language (EMNLP 2004), pp 404–411

  27. Moschitti A (2009) Syntactic and semantic kernels for short text pair categorization. In: Proceedings of the 12th conference of the European chapter of the association for computational linguistics, Athens, pp 576–584

  28. Nenkova A, Vanderwende L (2005) The impact of frequency on summarization. Technical Report, Microsoft Research, MSR-TR-2005-101

  29. Neto JL, Santos AD, Kaestner CA, Freitas, AA (2000) Document clustering and text summarization. In: Proceedings of the 4th international conference on practical applications of knowledge discovery and data ming (PAKDD’00), pp 41–55

  30. Nomoto T, Matsumoto Y (2001) A new approach to unsupervised text summarization. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’01), pp 26–34

  31. Otterbacher JC, Radev DR, Luo A (2002) Revisions that improve cohesion in multi-document summaries: a preliminary study. In: Proceedings of the ACL02 workshop on automatic summarization, pp 27–36

  32. Quan X, Liu G, Lu Z, Ni X, Wenyin L (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3): 473–491

    Article  Google Scholar 

  33. Radev DR, Jing H, Stys M, Tam D (2004) Centroid-based summarization of multiple documents. Inf Process Manag 40(6): 919–938

    Article  MATH  Google Scholar 

  34. Shen D, Sun J-T, Li H, Yang Q, Chen, Z (2007) Document summarization using conditional random fields. In: Proceedings of the 20th international joint conference on artificial intelligence, pp 2862–2867

  35. Tang J, Yao L, Chen D (2009) Multi-topic based query-oriented summarization. In: Proceedings of the SIAM international conference on data mining, pp 1147–1158

  36. Vanderwende L, Suzuki H, Brockett C, Nenkova A (2007) Beyond sumbasic: task-focused summarization with sentence simplification and lexical expansion. Inf Process Manag 43(6): 1606–1618

    Article  Google Scholar 

  37. Wang D, Li T, Zhu S, Ding C (2008) Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’08), pp 307–314

  38. Wei F, Li W, Lu Q, He Y (2010) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2): 245–259

    Article  Google Scholar 

  39. Zhao L, Wu L, Huang X (2009) Using query expansion in graph-based approach for query-focused multidocument summarization. Inf Process Manag 45(1): 35–41

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lihui Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mei, JP., Chen, L. SumCR: A new subtopic-based extractive approach for text summarization. Knowl Inf Syst 31, 527–545 (2012). https://doi.org/10.1007/s10115-011-0437-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0437-x

Keywords

Navigation