Skip to main content
Log in

A hybrid machine learning model for multi-document summarization

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

This work proposes an approach that uses statistical tools to improve content selection in multi-document automatic text summarization. The method uses a trainable summarizer, which takes into account several features: the similarity of words among sentences, the similarity of words among paragraphs, the text format, cue-phrases, a score related to the frequency of terms in the whole document, the title, sentence location and the occurrence of non-essential information. The effect of each of these sentence features on the summarization task is investigated. These features are then used in combination to construct text summarizer models based on a maximum entropy model, a naive-Bayes classifier, and a support vector machine. To produce the final summary, the three models are combined into a hybrid model that ranks the sentences in order of importance. The performance of this new method has been tested using the DUC 2002 data corpus. The effectiveness of this technique is measured using the ROUGE score, and the results are promising when compared with some existing techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aone C, Okurowski ME, Gorlinsky J, Larsen B (1997) A scalable summarization system using robust NLP. In: Proceedings of the ACL’97/EACL’97 workshop on intelligent scalable text summarization, Madrid, Spain, pp 10–17

    Google Scholar 

  2. Azzam S, Humphreys K, Gaizauskas R (1999) Using coreference chains for text summarization. In: Proceedings of the ACL’99, College Park, MD, USA, pp 77–84

    Google Scholar 

  3. Begum N, Fattah M, Ren F (2009) Automatic text summarization using support vector machine. Int J Innov Comput Inf Control 5(7):1987–1996

    Google Scholar 

  4. Carbonell JG, Goldstein J (1998) The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st ACM SIGIR, pp 335–336

    Google Scholar 

  5. Diaz A, Gervás P (2007) User-model based personalized summarization. Inf Process Manag 43(6):1715–1734

    Article  Google Scholar 

  6. Dorr B, Gaasterland T (2007) Exploiting aspectual features and connecting words for summarization-inspired temporal-relation extraction. Inf Process Manag 43(6):1681–1704

    Article  Google Scholar 

  7. Edmundson HP (1969) New methods in automatic extracting. J ACM 16(2):264–285

    Article  MATH  Google Scholar 

  8. Fattah M, Ren F (2009) GA, MR, FFNN, PNN & GMM based models for automatic text summarization. Comput Speech Lang 23(1):126–144

    Article  Google Scholar 

  9. Goldstein J, Kantrowitz M, Mittal V, Carbonell J (1999) Summarizing text documents: sentence selection and evaluation metrics. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’99), Berkeley, CA, USA, pp 121–128

    Chapter  Google Scholar 

  10. Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’01), New Orleans, LA, USA, pp 19–25

    Chapter  Google Scholar 

  11. Hahn U, Mani I (2000) The challenges of automatic summarization. IEEE Comput 33(11):29–36

    Article  Google Scholar 

  12. Harabagiu S, Hickl A, Lacatusu F (2007) Satisfying information needs with multi-document summaries. Inf Process Manag 43(6):1619–1642

    Article  Google Scholar 

  13. Hirao T, Okumura M, Yasuda N, Isozaki H (2007) Supervised automatic evaluation for summarization with voted regression model. Inf Process Manag 43(6):1521–1535

    Article  Google Scholar 

  14. Hobson S, Dorr B, Monz C, Schwartz R (2007) Task-based evaluation of text summarization using relevance prediction. Inf Process Manag 43(6):1482–1499

    Article  Google Scholar 

  15. Hovy E, Lin CY (1997) Automatic text summarization in SUMMARIST. In: Proceedings of the ACL’97/EACL’97 workshop on intelligent scalable text summarization, Madrid, Spain, pp 18–24

    Google Scholar 

  16. Kupiec J, Pedersen J, Chen F (1995) A trainable document summarizer. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’95), Seattle, WA, USA, pp 68–73

    Chapter  Google Scholar 

  17. Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B (2007) Generating gene summaries from biomedical literature: a study of semi-structured summarization. Inf Process Manag 43(6):1777–1791

    Article  Google Scholar 

  18. Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2(2):159–165

    Article  MathSciNet  Google Scholar 

  19. Mani I, Bloedorn E (1999) Summarizing similarities and differences among related documents. Inf Retr 1(1–2):35–67

    Article  Google Scholar 

  20. Mani I, Maybury MT (eds) (1999) Advances in automated text summarization. MIT Press, Cambridge

    Google Scholar 

  21. McKeown K, Radev DR (1995) Generating summaries of multiple news articles. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’95), Seattle, WA, USA, pp 74–82

    Chapter  Google Scholar 

  22. Moens M (2007) Summarizing court decisions. Inf Process Manag 43(6):1748–1764

    Article  MathSciNet  Google Scholar 

  23. Nomoto T (2007) Discriminative sentence compression with conditional random fields. Inf Process Manag 43(6):1571–1587

    Article  Google Scholar 

  24. Nomoto T, Matsumoto Y (2001) A new approach to unsupervised text summarization. In: Proceedings of the 24th ACM SIGIR, pp 26–34

    Google Scholar 

  25. Nigam K, Lafferty J, Mc-Callum A (1999) Using maximum entropy for text classification. In: IJCAI-99 workshop on machine learning for information filtering

    Google Scholar 

  26. Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106(4):620–630

    Article  MATH  MathSciNet  Google Scholar 

  27. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423

    Article  MATH  MathSciNet  Google Scholar 

  28. Over P, Dang H, Harman D (2007) DUC in context. Inf Process Manag 43(6):1506–1520

    Article  Google Scholar 

  29. Reeve L, Han H, Brooks A (2007) The use of domain-specific concepts in biomedical text summarization. Inf Process Manag 43(6):1765–1776

    Article  Google Scholar 

  30. Salton G, Singhal A, Mitra M, Buckley C (1997) Automatic text structuring and summarization. Inf Process Manag 33(2):193–207

    Article  Google Scholar 

  31. Schank R, Abelson R (1977) In: Scripts, plans, goals, and understanding. Lawrence Erlbaum Associates, Hillsdale

    Google Scholar 

  32. Sohrab M, Fattah M, Ren F (2008) The best feature parameter and HMM for text summarization. In: Research in computing science, and CORE-2008, 9th conference on computing, vol 34, pp 153–161

    Google Scholar 

  33. Sjöbergh J (2007) Older versions of the ROUGEeval summarization evaluation system were easier to fool. Inf Process Manag 43(6):1500–1505

    Article  Google Scholar 

  34. Sparck Jones K (1993) Discourse modeling for automatic summarizing. Technical report 29D, Computer laboratory, University of Cambridge

  35. Steinberger J, Poesio M, Kabadjov M, Ježek K (2007) Two uses of anaphora resolution in summarization. Inf Process Manag 43(6):1663–1680

    Article  Google Scholar 

  36. Teufel SH, Moens M (1997) Sentence extraction as a classification task. In: Proceedings of the ACL’97/EACL’97 workshop on intelligent scalable text summarization, Madrid, Spain, pp 58–65

    Google Scholar 

  37. Wan X, Yang J (2006) Improved affinity graph based multi-document summarization. In: Proceedings of the human language technology conference of the North American chapter of the ACL, pp 181–184

    Google Scholar 

  38. Yeh J, Ke H, Yang W, Meng I (2005) Text summarization using a trainable summarizer and latent semantic analysis. Inf Process Manag 41(1):75–95

    Article  Google Scholar 

  39. Yeh JY, Ke HR, Yang WP (2002) Chinese text summarization using a trainable summarizer and latent semantic analysis. In: Proceedings of the 5th international conference on Asian digital libraries (ICADL’02), Singapore. Lecture notes in computer science, vol 2555. Springer, Berlin, pp 76–87

    Google Scholar 

  40. Ye S, Chua T, Kan M, Qiu L (2007) Document concept lattice for text understanding and summarization. Inf Process Manag 43(6):1643–1662

    Article  Google Scholar 

  41. Young SR, Hayes PJ (1985) Automatic classification and summarization of banking telexes. In: Proceedings of the 2nd conference on artificial intelligence application, pp 402–408

    Google Scholar 

  42. Zajic D, Dorr B, Lin J, Schwartz R (2007) Multi-candidate reduction: sentence compression as a tool for document summarization tasks. Inf Process Manag 43(6):1549–1570

    Article  Google Scholar 

  43. Wan X (2010) Towards a unified approach to simultaneous single-document and multi-document summarizations. In: Proceedings of the 23rd international conference on computational linguistics, Beijing, China, pp 1137–1145

    Google Scholar 

  44. Brassard G, Bratley P (1996) Fundamentals of algorithms. Prentice hall, New Jersey

    Google Scholar 

  45. Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97:245–271

    Article  MATH  MathSciNet  Google Scholar 

  46. Conroy J, Schlesinger J, Kubina J (2011) CLASSY 2011 at TAC: guided and multi-lingual summaries and evaluation metrics. In: Proceedings of the fourth text analysis conference (TAC 2011). National Institute of Standards and Technology, Gaithersburg

    Google Scholar 

  47. Schlesinger J, Leary D, Conroy J (2008) Arabic/English multidocument summarization with CLASSY—the past and the future. In: Gelbukh AF (ed) CICLing, Haifa, Israel, February 2008. Lecture notes in computer science, vol 4919. Springer, Berlin, pp 568–581

    Google Scholar 

  48. Li J, Li L, Li T (2012) Multi-document summarization via submodularity. Appl Intell 37(3):420–430

    Article  Google Scholar 

Download references

Acknowledgement

This work is supported by the Deanship of Scientific Research, Taibah University, KSA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Abdel Fattah.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fattah, M.A. A hybrid machine learning model for multi-document summarization. Appl Intell 40, 592–600 (2014). https://doi.org/10.1007/s10489-013-0490-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-013-0490-0

Keywords

Navigation