skip to main content
research-article

MWI-Sum: A Multilingual Summarizer Based on Frequent Weighted Itemsets

Published: 21 September 2015 Publication History

Abstract

Multidocument summarization addresses the selection of a compact subset of highly informative sentences, i.e., the summary, from a collection of textual documents. To perform sentence selection, two parallel strategies have been proposed: (a) apply general-purpose techniques relying on data mining or information retrieval techniques, and/or (b) perform advanced linguistic analysis relying on semantics-based models (e.g., ontologies) to capture the actual sentence meaning. Since there is an increasing need for processing documents written in different languages, the attention of the research community has recently focused on summarizers based on strategy (a).
This article presents a novel multilingual summarizer, namely MWI-Sum (Multilingual Weighted Itemset-based Summarizer), that exploits an itemset-based model to summarize collections of documents ranging over the same topic. Unlike previous approaches, it extracts frequent weighted itemsets tailored to the analyzed collection and uses them to drive the sentence selection process. Weighted itemsets represent correlations among multiple highly relevant terms that are neglected by previous approaches. The proposed approach makes minimal use of language-dependent analyses. Thus, it is easily applicable to document collections written in different languages.
Experiments performed on benchmark and real-life collections, English-written and not, demonstrate that the proposed approach performs better than state-of-the-art multilingual document summarizers.

References

[1]
John Atkinson and Ricardo Munoz. 2013. Rhetorics-based multi-document summarization. Expert Syst. Appl. 40, 11 (2013), 4346--4352.
[2]
Elena Baralis, Luca Cagliero, and Laura Farinetti. 2015. Generation and evaluation of summaries of academic teaching materials. In Proceedings of the 39th Annual IEEE Computer Software and Applications Conference (COMPSAC’15). 881--886.
[3]
Elena Baralis, Luca Cagliero, Saima Jabeen, and Alessandro Fiori. 2012. Multi-document summarization exploiting frequent itemsets. In Proceedings of the ACM Symposium on Applied Computing (SAC’12). 782--786.
[4]
Elena Baralis, Luca Cagliero, Saima Jabeen, Alessandro Fiori, and Sajid Shah. 2013a. Multi-document summarization based on the Yago ontology. Expert Syst. Appl. 40, 17 (2013), 6976--6984.
[5]
Elena Baralis, Luca Cagliero, Naeem A. Mahoto, and Alessandro Fiori. 2013b. GraphSum: Discovering correlations among multiple terms for graph-based summarization. Inf. Sci. 249 (2013), 96--109.
[6]
Elena Maria Baralis, Luca Cagliero, Alessandro Fiori, and Saima Jabeen. 2011. PatTexSum: A pattern-based text summarizer. In Proceedings of the Mining Complex Patterns Workshop. 18--29. Retrieved from http://porto.polito.it/2460874/.
[7]
Regina Barzilay and Michael Elhadad. 1997. Using lexical chains for text summarization. In Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization. 10--17.
[8]
S. Bird, E. Klein, and E. Loper. 2009. Natural Language Processing with Python. O’Reilly Media.
[9]
Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th International Conference on World Wide Web 7. 107--117.
[10]
Luca Cagliero and Paolo Garza. 2014. Infrequent weighted itemset mining using frequent pattern growth. IEEE Trans. Knowl. Data Eng. 26, 4 (2014), 903--915.
[11]
Giuseppe Carenini, Raymond T. Ng, and Xiaodong Zhou. 2007. Summarizing email conversations with clue words. In World Wide Web Conference Series. 91--100.
[12]
Wesley T. Chuang and Jihoon Yang. 2000. Extracting sentence segments for text summarization: A machine learning approach. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00). ACM, New York, NY, 152--159.
[13]
John Conroy, Judith Schlesinger, Jeff Kubina, Peter Rankel, and Dianne OLeary. 2011. CLASSY 2011 at TAC: Guided and multi-lingual summaries and evaluation metrics. In TAC’11: Proceedings of the the 2011 Text Analysis Conference (TAC’11).
[14]
John M. Conroy, Jade Goldstein, Judith D. Schlesinger, and Dianne P. OLeary. 2004. Left-brain/right-brain multi-document summarization. In DUC 2004 Conference Proceedings.
[15]
Wordnet Lexical Database. 2012. Homepage. Available at http://wordnet.princeton.edu.
[16]
Thomas G. Dietterich. 1998. Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation 10, 7 (1998).
[17]
Document Understanding Conference. 2004. HTL/NAACL Workshop on Text Summarization. http://duc.nist.gov/pubs.html#2004.
[18]
Mark Dredze, Hanna M. Wallach, Danny Puller, and Fernando Pereira. 2008. Generating summary keywords for emails using topics. In Proceedings of the 13th International Conference on Intelligent User Interfaces (IUI’08). ACM, New York, NY, 199--206.
[19]
Elena Filatova. 2004. A formal model for information selection in multi-sentence text extraction. In Proceedings of the International Conference on Computational Linguistics (COLING’04). 397--403.
[20]
Garcia Lus Fernando Fortes, de Lima Jos Valdeni, Stanley Loh, and Jos Palazzo Moreira de Oliveira. 2006. Using ontological modeling in a context-aware summarization system to adapt text for mobile devices. In Active Conceptual Modeling of Learning (Lecture Notes in Computer Science), Peter P. Chen and Leah Y. Wong (Eds.), Vol. 4512. Springer, 144--154.
[21]
George Giannakopoulos, Mahmoud El-Haj, Benoit Favre, Marina Litvak, Josef Steinberger, and Vasudeva Varma. 2011. TAC2011 MultiLing Pilot Overview. In Proceedings of the TAC 2011 Workshop. NIST, Gaithersburg, MD, Retreived from http://users.iit.demokritos.gr/∼ggianna/Publications/MultiLingOverview.pdf.
[22]
George Giannakopoulos and Vangelis Karkaletsis. 2011. AutoSummENG and MeMoG in evaluating guided summaries. In Proceedings of the TAC 2011 Workshop. NIST. Retrieved from http://users.iit.demokritos.gr/∼ggianna/Publications/TAC2011-AESOPSystemPresentation.pdf.
[23]
Dan Gillick, Benoit Favre, and Dilek Hakkani-Tur. 2008. The ICSI summarization system at TAC 2008. In Proceedings of the Text Analysis Conference (TAC’08).
[24]
Dan Gillick, Benoit Favre, Dilek Hakkani-Tur, Bernd Bohnet, Yang Liu, and Shasha Xie. 2009. The ICSI/TUD summarization system at TAC 2009. In Proceedings of the Text Analysis Conference (TAC’09).
[25]
Gösta Grahne and Jianfei Zhu. 2003. Efficiently using prefix-trees in mining frequent itemsets. In Proceedings of the Workshop on Frequent Itemset Mining Implementations, FIMI’03 (CEUR-WS), Bart Goethals and Mohammed J. Zaki (Eds.), Vol. 90.
[26]
Oskar Gross, Antoine Doucet, and Hannu Toivonen. 2014. Document summarization based on word associations. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’’14). ACM, New York, NY, 1023--1026.
[27]
Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. 2003. Clustering data streams: Theory and practice. IEEE Trans. on Knowl. and Data Eng. 15, 3 (March 2003), 515--528. http://dx.doi.org/10.1109/TKDE.2003.1198387.
[28]
Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. 2007. Frequent pattern mining: Current status and future directions. Data Min. Knowl. Discov. 15, 1 (2007), 55--86.
[29]
Jiawei Han, Jain Pei, and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. In SIGMOD’00.
[30]
Leonhard Hennig, Winfried Umbrath, and Robert Wetzker. 2008. An ontology-based approach to text summarization. In Web Intelligence/IAT Workshops. IEEE, 291--294.
[31]
Jiri Hynek and Karel Jezek. 2003. Practical approach to automatic text summarization. In ELPUB.
[32]
Szymon Jaroszewicz and Dan A. Simovici. 2004. Interestingness of frequent itemsets using Bayesian networks as background knowledge. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 178--186.
[33]
Jon M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5 (Sept. 1999), 604--632.
[34]
A. Kogilavani and B. Balasubramanie. 2009. Ontology enhanced clustering based summarization of medical documents. Int. J. Recent Trends Engin. 1, 1 (2009).
[35]
Eugene Krapivin, Mark Last, and Marina Litvak. 2014. JRouge - Java ROUGE Implementation. Retrieved from https://bitbucket.org/nocgod/jrouge/wiki/Home/.
[36]
Lei Li, Dingding Wang, Chao Shen, and Tao Li. 2010. Ontology-enriched multi-document summarization in disaster management. In SIGIR, Fabio Crestani, Stphane Marchand-Maillet, Hsin-Hsi Chen, Efthimis N. Efthimiadis, and Jacques Savoy (Eds.). ACM, 819--820.
[37]
Chin-Yew Lin and Eduard Hovy. 2003. Automatic evaluation of summaries using N-gram co-occurrence statistics. In Proceedings of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1. 71--78.
[38]
Hui Lin and Jeff Bilmes. 2011. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (HLT’11). Association for Computational Linguistics, Stroudsburg, PA, 510--520.
[39]
Edward Loper and Steven Bird. 2002. NLTK: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1 (ETMTNLP’02). Association for Computational Linguistics, Stroudsburg, PA, 63--70.
[40]
Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. 2011. Tell me what I need to know: Succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
[41]
Michael McCandless, Erik Hatcher, and Otis Gospodnetic. 2010. Lucene in Action, Second Edition: Covers Apache Lucene 3.0. Manning Publications Co., Greenwich, CT.
[42]
Jade Goldstein Vibhu Mittal, Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and Mark Kantrowitz. 2000. Multi-document summarization by sentence extraction. In Proceedings of the ANLP/NAACL Workshop on Automatic Summarization. 40--48.
[43]
Jane Morris and Graeme Hirst. 1991. Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Comput. Linguist. 17, 1 (March 1991), 21--48.
[44]
Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. 1999. Discovering frequent closed itemsets for association rules. In Proceedings of the 7th International Conference on Database Theory (ICDT’99). Springer-Verlag, London, UK, 398--416.
[45]
Chen Ping and Verma Rakesh M. 2006. A query-based medical information summarization system using ontology knowledge. In CBMS. IEEE Computer Society, 37--42.
[46]
Mohsen Pourvali and Mohammad Saniee Abadeh. 2012. Automated text summarization base on lexicales chain and graph using of WordNet and Wikipedia knowledge base. CoRR abs/1203.3586 (2012).
[47]
Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22 (2004), 2004.
[48]
Dragomir R. Radev, Hongyan Jing, Malgorzata Stys, and Daniel Tam. 2004. Centroid-based summarization of multiple documents. Inf. Process. Manag. 40, 6 (2004), 919--938.
[49]
T. Ralphs and M. Guzelsoy. 2006. The SYMPHONY callable library for mixed integer programming. The Next Wave in Computing, Optimization, and Decision Technologies 29 (2006), 61--76. Software available at http://http://www.coin-or.org/SYMPHONY.
[50]
J. Roberto and J.r. Bayardo. 1998. Efficiently mining long patterns from databases. In SIGMOD 1998, Laura M. Haas and Ashutosh Tiwary (Eds.). 85--93.
[51]
N. Rotem. 2011. Open Text Summarizer (OTS). Retrieved from http://libots.sourceforge.net/.
[52]
Josef Steinberger, Mijail Kabadjov, Ralf Steinberger, Hristo Tanev, Marco Turchi, and Vanni Zavarella. 2011. JRC’s participation at TAC 2011: Guided and multilingual summarization tasks. In Proceedings of the 2011 Text Analysis Conference (TAC’11).
[53]
Ke Sun and Fengshan Bai. 2008. Mining weighted association rules without preassigned weights. IEEE Trans. Knowl. Data Eng. 20, 4 (April 2008), 489--495.
[54]
Hiroya Takamura and Manabu Okumura. 2009. Text summarization model based on the budgeted median problem. In Proceeding of the 18th ACM Conference on Information and Knowledge Management. 1589--1592.
[55]
Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava. 2002. Selecting the right interestingness measure for association patterns. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02). 32--41.
[56]
Nikolaj Tatti. 2010. Probably the best itemsets. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 293--302.
[57]
TexLexAn. 2011. TexLexAn: An Open-Source Text Summarizer. Retrieved from http://texlexan.sourceforge.net/.
[58]
Text Analysis Conference. 2011. NIST Text Analysis Conference Summarization Track. Retrieved from http://www.nist.gov/tac/2011/Summarization.
[59]
K. S. Thakkar, R. V. Dharaskar, and M. B. Chandak. 2010. Graph-based algorithms for text summarization. In Proceedings of the 2010 3rd International Conference on Emerging Trends in Engineering and Technology (ICETET). 516--519.
[60]
Merijn van Erp and Lambert Schomaker. 2000. Variants of the borda count method for combining ranked classifier hypotheses. In Proceedings of the 7th International Workshop on Frontiers in Handwriting Recognition. 443--452.
[61]
Xiaojun Wan and Jianwu Yang. 2006. Improved affinity graph based multi-document summarization. In Proceedings of HLT-NAACL, Companion Volume: Short Papers. 181--184.
[62]
Dingding Wang and Tao Li. 2010. Document update summarization using incremental hierarchical clustering. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 279--288.
[63]
Dingding Wang, Shenghuo Zhu, Tao Li, Yun Chi, and Yihong Gong. 2011. Integrating document clustering and multidocument summarization. ACM Trans. Knowl. Discov. Data 5, 3, (August 2011), Article 14, 26 pages.
[64]
Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong Gong. 2013. Comparative document summarization via discriminative sentence selection. ACM Trans. Knowl. Discov. Data 7, 1, Article 2 (March 2013), 18 pages.
[65]
Wei Wang, Jiong Yang, and Philip S. Yu. 2000. Efficient mining of weighted association rules (WAR). In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 270--274.
[66]
Chia-Wei Wu and Chao-Lin Liu. 2003. Ontology-based text summarization for business news articles. In Computers and Their Applications, Narayan C. Debnath (Ed.). ISCA, 389--392.
[67]
Xindong Wu, Gong-Qing Wu, Fei Xie, Zhu Zhu, and Xue-Gang Hu. 2010. News filtering and summarization on the web. IEEE Intell. Syst. 25, 5 (Sept. 2010), 68--76.
[68]
Zi Yang, Keke Cai, Jie Tang, Li Zhang, Zhong Su, and Juanzi Li. 2011. Social context summarization. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’11). ACM, New York, NY, 255--264.
[69]
Junyan Zhu, Can Wang, Xiaofei He, Jiajun Bu, Chun Chen, Shujie Shang, Mingcheng Qu, and Gang Lu. 2009. Tag-oriented document summarization. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). ACM, New York, NY, 1195--1196.

Cited By

View all
  • (2024)A Survey on Transformer-Based Extractive Summarization Methods2024 19th Iranian Conference on Intelligent Systems (ICIS)10.1109/ICIS64839.2024.10887480(263-271)Online publication date: 23-Oct-2024
  • (2023)Accelerating large-scale weighted similarity queries based on external storageInformation Systems10.1016/j.is.2023.102213117:COnline publication date: 1-Jul-2023
  • (2023)Enhancing metaheuristic based extractive text summarization with fuzzy logicNeural Computing and Applications10.1007/s00521-023-08209-535:13(9711-9723)Online publication date: 2-Feb-2023
  • Show More Cited By

Index Terms

  1. MWI-Sum: A Multilingual Summarizer Based on Frequent Weighted Itemsets

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 34, Issue 1
      October 2015
      172 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/2806674
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 September 2015
      Accepted: 01 July 2015
      Revised: 01 June 2015
      Received: 01 June 2014
      Published in TOIS Volume 34, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Multilingual summarization
      2. frequent weighted itemset mining
      3. text mining

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)9
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 03 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A Survey on Transformer-Based Extractive Summarization Methods2024 19th Iranian Conference on Intelligent Systems (ICIS)10.1109/ICIS64839.2024.10887480(263-271)Online publication date: 23-Oct-2024
      • (2023)Accelerating large-scale weighted similarity queries based on external storageInformation Systems10.1016/j.is.2023.102213117:COnline publication date: 1-Jul-2023
      • (2023)Enhancing metaheuristic based extractive text summarization with fuzzy logicNeural Computing and Applications10.1007/s00521-023-08209-535:13(9711-9723)Online publication date: 2-Feb-2023
      • (2022)A Review of the Trends and Challenges in Adopting Natural Language Processing Methods for Education Feedback AnalysisIEEE Access10.1109/ACCESS.2022.317775210(56720-56739)Online publication date: 2022
      • (2022)Extractive single-document summarization using adaptive binary constrained multi-objective differential evaluationInnovations in Systems and Software Engineering10.1007/s11334-022-00474-2Online publication date: 9-Aug-2022
      • (2022)Frequent item-set mining and clustering based ranked biomedical text summarizationThe Journal of Supercomputing10.1007/s11227-022-04578-179:1(139-159)Online publication date: 4-Jul-2022
      • (2021)MultiGBSJournal of Biomedical Informatics10.1016/j.jbi.2021.103706116:COnline publication date: 1-Apr-2021
      • (2021)Extractive multi-document text summarization using dolphin swarm optimization approachMultimedia Tools and Applications10.1007/s11042-020-10176-1Online publication date: 6-Jan-2021
      • (2020)Combining Machine Learning and Natural Language Processing for Language-Specific, Multi-Lingual, and Cross-Lingual Text SummarizationTrends and Applications of Text Summarization Techniques10.4018/978-1-5225-9373-7.ch001(1-31)Online publication date: 2020
      • (2020)The combination of term relations analysis and weighted frequent itemset model for multidocument summarizationComputational Intelligence10.1111/coin.1227036:2(783-812)Online publication date: 29-Jan-2020
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media