skip to main content
research-article

A Comparative Analysis on Hindi and English Extractive Text Summarization

Authors Info & Claims
Published:09 May 2019Publication History
Skip Abstract Section

Abstract

Text summarization is the process of transfiguring a large documental information into a clear and concise form. In this article, we present a detailed comparative study of various extractive methods for automatic text summarization on Hindi and English text datasets of news articles. We consider 13 different summarization techniques, namely, TextRank, LexRank, Luhn, LSA, Edmundson, ChunkRank, TGraph, UniRank, NN-ED, NN-SE, FE-SE, SummaRuNNer, and MMR-SE, and we evaluate their performance using various performance metrics, such as precision, recall, F1, cohesion, non-redundancy, readability, and significance. A thorough analysis is done in eight different parts that exhibits the strengths and limitations of these methods, effect of performance over the summary length, impact of language of a document, and other factors as well. A standard summary evaluation tool (ROUGE) and extensive programmatic evaluation using Python 3.5 in Anaconda environment are used to evaluate their outcome.

References

  1. Hans Peter Luhn. 1958. The automatic creation of literature abstracts. IBM J. Res. Dev. 2, 2, 159--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dipanjan Das and Andre F. T. Martins. 2007. A survey on automatic text summarization. Lit. Survey Lang. Stat. 4, 192--195.Google ScholarGoogle Scholar
  3. Ehsan Shareghi and Leila Sharif Hassanabadi. 2008. Text summarization with harmony search algorithm-based sentence extraction. Proceedings of the 5th International Conference on Soft Computing as Transdisciplinary Science and Technology. ACM. 226--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Sankar and L. Sobha. 2009. An approach to text summarization. Proceedings of the 3rd International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies. ACL. 53--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Daraksha Parveen, Mohsen Mesgar, and Michael Strube. 2016. Generating coherent summaries of scientific articles using coherence patterns. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 773--783.Google ScholarGoogle ScholarCross RefCross Ref
  6. Pradeepika Verma and Hari Om. 2019. MCRMR: Maximum coverage and relevancy with minimal redundancy-based multi-document summarization. Expert Syst. Appl. 120, 43--56.Google ScholarGoogle ScholarCross RefCross Ref
  7. Harold P. Edmundson. 1969. New methods in automatic extracting. J. ACM 16, 2, 264--285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gunes Erkan and Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artific. Intell. Res. 22, 457--479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Josef Steinberger and Karel Jezek. 2004. Using latent semantic analysis in text summarization and summary evaluation. Proceedings of the International Conference on Information System Implementation and Modeling (ISIM’04). 93--100.Google ScholarGoogle Scholar
  10. Rafael Ferreira, Luciano de Souza Cabral, Rafael Dueire Lins, Gabriel Pereira e Silva, Fred Freitas, George D. C. Cavalcanti, Rinaldo Lima, Steven J. Simske, and Luciano Favaro. 2013. Assessing sentence scoring techniques for extractive text summarization. Expert Syst. Appl. 40, 14, 5755--5764.Google ScholarGoogle ScholarCross RefCross Ref
  11. Sandeep Sripada, Venu Gopal Kasturi, and Gautam Kumar Parai. 2005. Multi-document extraction-based Summarization. CS 224N, Final Project. https://nlp.stanford.edu/courses/cs224n/2010/reports/ssandeep-venuk-gkparai.pdf.Google ScholarGoogle Scholar
  12. Xiaojun Wan. 2010. Towards a unified approach to simultaneous single-document and multi-document summarizations. In Proceedings of the 23rd International Conference on Computational Linguistics. ACL. 1137--1145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Janara Christensen, Stephen Soderland, and Oren Etzioni. 2013. Towards coherent multi-document summarization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1163--1173.Google ScholarGoogle Scholar
  14. Daraksha Parveen, Hans-Martin Ramsl, and Michael Strube. 2015. Topical coherence for graph-based extractive summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1949--1954.Google ScholarGoogle ScholarCross RefCross Ref
  15. Pradeepika Verma and Hari Om. 2019. Collaborative ranking-based text summarization using a metaheuristic approach. In Proceedings of the Emerging Technologies in Data Mining and Information Security. Springer. 417--426.Google ScholarGoogle ScholarCross RefCross Ref
  16. Hayato Kobayashi, Masaki Noguchi, and Taichi Yatsuka. 2015. Summarization based on embedding distributions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL. 1984--1989.Google ScholarGoogle ScholarCross RefCross Ref
  17. Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252.Google ScholarGoogle Scholar
  18. Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. SummaRuNNer: A recurrent neural network-based sequence model for extractive summarization of documents. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI’17). 3075--3081. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Rasim M. Alguliev, Ramiz M. Aliguliyev, Makrufa S. Hajirahimova, and Chingiz A. Mehdiyev. 2011. MCMR: Maximum coverage and minimum redundant text summarization model. Expert Syst. Appl. 38, 12, 14514--14522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Rasim M. Alguliev, Ramiz M. Aliguliyev, and Nijat R. Isazade. 2013. Multiple documents summarization based on evolutionary optimization algorithm. Expert Syst. Appl. 40, 5, 1675--1689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Atif Khan, Naomie Salim, and Yogan Jaya Kumar. 2015. A framework for multi-document abstractive summarization based on semantic role labelling. Appl. Soft Comput. 30, 737--747. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Razieh Abbasi-ghalehtaki, Hassan Khotanlou, and Mansour Esmaeilpour. 2016. Fuzzy evolutionary cellular learning automata model for text summarization. Swarm Evolution. Comput. 30, 11--26.Google ScholarGoogle ScholarCross RefCross Ref
  23. Rasmita Rautray and Rakesh Chandra Balabantaray. 2017. Cat swarm optimization-based evolutionary framework for multi document summarization. Physica A: Stat. Mech. Appl. 477, 174--186.Google ScholarGoogle ScholarCross RefCross Ref
  24. Pradeepika Verma and Hari Om. 2019. A variable dimension optimization approach for text summarization. In Proceedings of the Harmony Search and Nature Inspired Optimization Algorithms. Springer. 687--696.Google ScholarGoogle ScholarCross RefCross Ref
  25. Vishal Gupta and Gurpreet Singh Lehal. 2010. A survey of text summarization extractive techniques. J. Emerg. Technol. Web Intell. 2, 3, 258--268.Google ScholarGoogle Scholar
  26. Mahak Gambhir and Vishal Gupta. 2017. Recent automatic text summarization techniques: A survey. Artific. Intell. Rev. 47, 1, 1--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. N. Moratanch and S. Chitrakala. 2016. A survey on abstractive text summarization. In Proceedings of the Conference on Circuit, Power and Computing Technologies (ICCPCT’16). IEEE. 1--7.Google ScholarGoogle Scholar
  28. Christopher C. Yang and Kar Wing Li. 2003. Automatic construction of English/Chinese parallel corpora. J. Amer. Soc. Info. Sci. Technol. 54, 8, 730--742. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Eduard Hovy and Chin-Yew Lin. 1998. Automated text summarization and the SUMMARIST system. In Proceedings of the Association for Computational Linguistics Workshop. ACL. 13--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle Scholar
  31. Chin-Yew Lin. 2004. Looking for a few good metrics: Automatic summarization evaluation—How many samples are enough? In Proceedings of NII Testbeds and Community for Information Access Research.Google ScholarGoogle Scholar
  32. Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. 2010. Opinosis: A graph-based approach to abstractive summarization of highly redundant opinions. Proceedings of the 23rd International Conference on Computational Linguistics. ACL. 340--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Feng Jin, Minlie Huang, and Xiaoyan Zhu. 2010. A comparative study on ranking and selection strategies for multi-document summarization. In Proceedings of the 23rd International Conference on Computational Linguistics. ACL. 525--533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Eleni Galiotou, Nikitas Karanikolas, and Christodoulos Tsoulloftas. 2013. On the effect of stemming algorithms on extractive summarization: A case study. Proceedings of the 17th Panhellenic Conference on Informatics. ACM. 300--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. P. M. Dhanya and M. Jathavedan. 2013. Comparative study of text summarization in Indian Languages. Int. J. Comput. Appl. 75, 6.Google ScholarGoogle Scholar
  36. K. Vimal Kumar, Divakar Yadav, and Arun Sharma. 2015. Graph-based technique for hindi text summarization. Information Systems Design and Intelligent Applications. Springer, New Delhi, 301--310.Google ScholarGoogle Scholar
  37. K. Vimal Kumar and Divakar Yadav. 2015. An improvised extractive approach to hindi text summarization. Information Systems Design and Intelligent Applications. Springer, New Delhi, 291--300.Google ScholarGoogle Scholar
  38. C. Sunitha, A. Jaya, and Amal Ganesh. 2016. A study on abstractive summarization techniques in indian languages. Procedia Comput. Sci. 87, 25--31.Google ScholarGoogle ScholarCross RefCross Ref
  39. Pradeepika Verma and Hari Om. 2016. Extraction-based text summarization methods on user’s review data: A comparative study. In Proceedings of the Conference on Smart Trends for Information Technology and Computer Communications. Springer, Singapore. 346--354.Google ScholarGoogle ScholarCross RefCross Ref
  40. Inderjeet Mani and Mark T. Maybury. 1999. Advances in Automatic Text Summarization. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jade Goldstein and Jaime Carbonell. 1998. Summarization: (1) using MMR for diversity-based reranking and (2) evaluating summaries. Proceedings of the Association for Computational Linguistics Workshop. ACL. 181--195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Google ScholarGoogle Scholar
  43. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.Google ScholarGoogle Scholar
  44. Michael Alexander Kirkwood Halliday and Ruqaiya Hasan. 2014. Cohesion in English. Routledge.Google ScholarGoogle Scholar
  45. Houda Oufaida, Omar Nouali, and Philippe Blache. 2014. Minimum redundancy and maximum relevance for single and multi-document Arabic text summarization. J. King Saud Univ.-Comput. Info. Sci. 26, 4, 450--461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and Mark Kantrowitz. 2000. Multi-document summarization by sentence extraction. In Proceedings of the NAACL-ANLP Workshop on Automatic Summarization. ACL. 40--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ondrej Bojar, Vojtech Diatka, Pavel Rychly, Pavel Stranik, Vat Suchomel, Ales Tamchyna, and Daniel Zeman. 2014. HindEnCorp-Hindi-English and Hindi-only corpus for machine translation. In Proceedings of the Language Resources and Evaluation Conference (LREC’14). 3550--3555.Google ScholarGoogle Scholar
  48. William H. DuBay. 2004. The Principles of Readability. ERIC. Online Submission. https://files.eric.ed.gov/fulltext/ED490073.pdf.Google ScholarGoogle Scholar
  49. Ray R. Larson. 2010. Introduction to information retrieval. J. Amer. Soc. Info. Sci. Technol. 4, 852--853. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Comparative Analysis on Hindi and English Extractive Text Summarization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 3
        September 2019
        386 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3305347
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 May 2019
        • Accepted: 1 January 2019
        • Revised: 1 October 2018
        • Received: 1 September 2017
        Published in tallip Volume 18, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format