SumCR: A new subtopic-based extractive approach for text summarization

Mei, Jian-Ping; Chen, Lihui

doi:10.1007/s10115-011-0437-x

SumCR: A new subtopic-based extractive approach for text summarization

Regular Paper
Published: 06 August 2011

Volume 31, pages 527–545, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Jian-Ping Mei¹ &
Lihui Chen¹

472 Accesses
29 Citations
Explore all metrics

Abstract

In text summarization, relevance and coverage are two main criteria that decide the quality of a summary. In this paper, we propose a new multi-document summarization approach SumCR via sentence extraction. A novel feature called Exemplar is introduced to help to simultaneously deal with these two concerns during sentence ranking. Unlike conventional ways where the relevance value of each sentence is calculated based on the whole collection of sentences, the Exemplar value of each sentence in SumCR is obtained within a subset of similar sentences. A fuzzy medoid-based clustering approach is used to produce sentence clusters or subsets where each of them corresponds to a subtopic of the related topic. Such kind of subtopic-based feature captures the relevance of each sentence within different subtopics and thus enhances the chance of SumCR to produce a summary with a wider coverage and less redundancy. Another feature we incorporate in SumCR is Position, i.e., the position of each sentence appeared in the corresponding document. The final score of each sentence is a combination of the subtopic-level feature Exemplar and the document-level feature Position. Experimental studies on DUC benchmark data show the good performance of SumCR and its potential in summarization tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2): 171–196
Article Google Scholar
Aliguliyev RM (2006) A novel partitioning-based clustering method and generic document summarization. In: Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, pp 626–629
Amini MR, Gallinari P (2002) The use of unlabeled data to improve supervised learning for text summarization. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, pp 105–112
Arora R, Ravindran B (2008) Latent dirichlet allocation and singular value decomposition based multi-document summarization. In: Proceedings of the international conference on data mining, pp 713–718
Barzilay R, Lee L (2004) Catching the drift: probabilistic content models, with applications to generation and summarization. In: HLT-NAACL: proceedings of the main conference, 2004, pp 113–120
Baxendale P (1958) Machine-made index for technical literature-an experiment. IBM J Res Dev 2(4): 354–361
Article Google Scholar
Carbonell JG, Goldstein J (1998) The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 335–336
Celikyilmaz A, Hakkani-Tur D (2010) A hybrid hierarchical model for multi-document summarization. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL 2010), pp 1149–1154
Conroy JM, O’Leary DP (2001) Text summarization via hidden markov models. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, pp 406–407
Edmundson HP (1969) New methods in automatic extracting. J Assoc Comput Mach 16(2): 264–285
Article MATH Google Scholar
Erkan G, Radev DR (2004) LexPageRank: prestige in multi-document text summarization. In: Proceedings of empirical methods in natural language (EMNLP 2004), pp 365–371
Feng S, Wang D, Yu G, Gao W, Wong K-F (2011) Extracting common emotions from blogs based on fine-grained sentiment clustering. Knowl Inf Syst 27(2): 281–302
Article MATH Google Scholar
Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’01), pp 19–25
Haghighi A, Vanderwende L (2009) Exploring content models for multi-document summarization. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the association for computational linguistics (NAACL’09), pp 362–370
Hovy E, Lin C-Y (1999) Automated text summarization in SUMMARIST. In: Mani I, Maybury M (eds) Advances in automatic text summarization. The MIT Press, Cambridge, pp 81–94
Google Scholar
Jing H (2000) Sentence reduction for automatic text summarization. In: Proceedings of 6th conference on applied natural language processing (ANCL’00), pp 310–315
Knight K, Marcu D (2002) Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif Intell 139(1): 91–107
Article MathSciNet MATH Google Scholar
Kupiec J, Pedersen J, Chen F (1995) A trainable document summarizer. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’95), pp 68–73
Lee J-H, Park S, Ahn C-M, Kim D (2009) Automatic generic document summarization based on non-negative matrix factorization. Inf Process Manag 45(1): 20–34
Article Google Scholar
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out: proceedings of the ACL-04 workshop of ACL 2004, pp 74–81
Lin C-Y, Hovey E (2003) Automatic evaluation of summaries using n-gram co-occurence statistics. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, pp 71–78
Long C, Huang M, Zhu X, Li M (2009) Multi-document summarization by information distance. In: Proceedings of the 2009 Ninth IEEE international conference on data mining (ICDM’09), pp 866–871
Mani I (2001) Automatic summarization. John Benjamin’s Publishing Company, Amsterdam
MATH Google Scholar
McCallum AK (1996) Bow: a toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow/
Mei J-P, Chen L (2010) Fuzzy clustering with weighted medoids for relational data. Pattern Recognit 43(5): 1964–1974
Article MATH Google Scholar
Mihalcea R, Tarau P (2004) TextRank: bringing order into texts. In: Dekang L, Dekai W (eds) Proceedings of empirical methods in natural language (EMNLP 2004), pp 404–411
Moschitti A (2009) Syntactic and semantic kernels for short text pair categorization. In: Proceedings of the 12th conference of the European chapter of the association for computational linguistics, Athens, pp 576–584
Nenkova A, Vanderwende L (2005) The impact of frequency on summarization. Technical Report, Microsoft Research, MSR-TR-2005-101
Neto JL, Santos AD, Kaestner CA, Freitas, AA (2000) Document clustering and text summarization. In: Proceedings of the 4th international conference on practical applications of knowledge discovery and data ming (PAKDD’00), pp 41–55
Nomoto T, Matsumoto Y (2001) A new approach to unsupervised text summarization. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’01), pp 26–34
Otterbacher JC, Radev DR, Luo A (2002) Revisions that improve cohesion in multi-document summaries: a preliminary study. In: Proceedings of the ACL02 workshop on automatic summarization, pp 27–36
Quan X, Liu G, Lu Z, Ni X, Wenyin L (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3): 473–491
Article Google Scholar
Radev DR, Jing H, Stys M, Tam D (2004) Centroid-based summarization of multiple documents. Inf Process Manag 40(6): 919–938
Article MATH Google Scholar
Shen D, Sun J-T, Li H, Yang Q, Chen, Z (2007) Document summarization using conditional random fields. In: Proceedings of the 20th international joint conference on artificial intelligence, pp 2862–2867
Tang J, Yao L, Chen D (2009) Multi-topic based query-oriented summarization. In: Proceedings of the SIAM international conference on data mining, pp 1147–1158
Vanderwende L, Suzuki H, Brockett C, Nenkova A (2007) Beyond sumbasic: task-focused summarization with sentence simplification and lexical expansion. Inf Process Manag 43(6): 1606–1618
Article Google Scholar
Wang D, Li T, Zhu S, Ding C (2008) Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’08), pp 307–314
Wei F, Li W, Lu Q, He Y (2010) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2): 245–259
Article Google Scholar
Zhao L, Wu L, Huang X (2009) Using query expansion in graph-based approach for query-focused multidocument summarization. Inf Process Manag 45(1): 35–41
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Division of Information Engineering, School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, Republic of Singapore
Jian-Ping Mei & Lihui Chen

Authors

Jian-Ping Mei
View author publications
You can also search for this author in PubMed Google Scholar
Lihui Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lihui Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mei, JP., Chen, L. SumCR: A new subtopic-based extractive approach for text summarization. Knowl Inf Syst 31, 527–545 (2012). https://doi.org/10.1007/s10115-011-0437-x

Download citation

Received: 01 September 2010
Revised: 01 June 2011
Accepted: 18 July 2011
Published: 06 August 2011
Issue Date: June 2012
DOI: https://doi.org/10.1007/s10115-011-0437-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SumCR: A new subtopic-based extractive approach for text summarization

Abstract

Access this article

Similar content being viewed by others

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

A comprehensive and analytical review of text clustering techniques

Recent automatic text summarization techniques: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SumCR: A new subtopic-based extractive approach for text summarization

Abstract

Access this article

Similar content being viewed by others

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

A comprehensive and analytical review of text clustering techniques

Recent automatic text summarization techniques: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation