research-article

An Effective Scheme for Generating An Overview Report over A Very Large Corpus of Documents

Authors:

Jie WangAuthors Info & Claims

DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019

Article No.: 18, Pages 1 - 11

https://doi.org/10.1145/3342558.3345394

Published: 23 September 2019 Publication History

Abstract

How to efficiently generate an accurate, well-structured overview report (ORPT) over thousands of documents is challenging. A well-structured ORPT is divided into sections of multiple levels (e.g., a two-level structure consists of sections and subsections). None of the existing multi-document summarization (MDS) algorithms is suitable for accomplishing this task. To overcome this obstacle, we devise NDORGS (Numerous Documents' Overview Report Generation Scheme) that integrates text filtering, keyword scoring, single-document summarization (SDS), topic modeling, MDS, and title generation to generate a coherent, well-structured ORPT. We then present a multi-criteria evaluation method using techniques of text mining and multi-attribute decision making on a combination of human judgments, running time, information coverage, and topic diversity. We evaluate ORPTs generated by NDORGS on two large corpora of documents, where one is classified and the other unclassified. We show that, using Saaty's pairwise comparison 9-point scale and TOPSIS, the ORPTs generated on SDS's with the length of 20% of the original documents are the best overall on both datasets.

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014). arXiv:1409.0473 http://arxiv.org/abs/1409.0473

[2]

Regina Barzilay and Michael Elhadad. 1999. Using lexical chains for text summarization. Advances in automatic text summarization (1999), 111--121.

[3]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. In Journal of Machine Learning Research.

[4]

Orkut Buyukkokten, Hector Garcia-Molina, and Andreas Paepcke. 2001. Seeing the Whole in Parts: Text Summarization for Web Browsing on Handheld Devices. In Proceedings of the 10th International Conference on World Wide Web (WWW '01). ACM, 652--662. https://doi.org/10.1145/371920.372178

Digital Library

[5]

Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei. 2017. Improving Multi-Document Summarization via Text Classification. In AAAI. 3053--3059.

[6]

Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei, and Yanran Li. 2016. Attsum: Joint learning of focusing and summarization with neural attention. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. 547--556.

[7]

Janara Christensen, Mausam, Stephen Soderland, and Oren Etzioni. 2013. Towards Coherent Multi-Document Summarization. In HLT-NAACL.

[8]

Janara Christensen, Stephen Soderland, Gagan Bansal, et al. 2014. Hierarchical summarization: Scaling up multi-document summarization. In Proceedings of the 52nd annual meeting of the association for computational linguistics, Vol. 1. 902--912.

[9]

DUC. 2004. Duc2004 Quality Questions. http://duc.nist.gov/duc2004/quality.questions.txt.

[10]

DUC. 2014. Document Understanding Conference. https://www-nlpir.nist.gov/projects/duc/intro.html.

[11]

Günes Erkan and Dragomir R Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22 (2004), 457--479.

[12]

Factiva-Marx. 2018. Marx Dataset. http://www.ndorg.net.

[13]

Maria Fuentes, Horacio Rodriguez, and Daniel Ferrés. 2007. FEMsum at DUC 2007. In the Document Understanding Workshop (presented at the HLT/NAACL).

[14]

Wei Gao, Peng Li, and Kareem Darwish. 2012. Joint topic modeling for event summarization across news and social media streams. In Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 1173--1182.

Digital Library

[15]

Dan Gillick and Benoit Favre. 2009. A scalable global model for summarization. In Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing. Association for Computational Linguistics, 10--18.

Digital Library

[16]

Derek Greene and Pádraig Cunningham. 2006. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering. In Proc. 23rd International Conference on Machine learning (ICML'06). ACM Press, 377--384.

Digital Library

[17]

John A Hartigan and Manchek A Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), 100--108.

Digital Library

[18]

Kai Hong, John M Conroy, Benoit Favre, Alex Kulesza, Hui Lin, and Ani Nenkova. 2014. A Repository of State of the Art and Competitive Baseline Summaries for Generic News Summarization. In LREC. 1608--1616.

[19]

Ching-Lai Hwang and Kwangsun Yoon. 1981. Methods for multiple attribute decision making. In Multiple attribute decision making. Springer, 58--191.

[20]

Dow Jones. 2018. Factiva Global News Database. https://www.dowjones.com/products/factiva/.

[21]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).

[22]

Mirella Lapata. 2003. Probabilistic text structuring: Experiments with sentence ordering. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1. Association for Computational Linguistics, 545--552.

Digital Library

[23]

Changzhou Li, Yao Lu, Junfeng Wu, Yongrui Zhang, Zhongzhou Xia, Tianchen Wang, Dantian Yu, Xurui Chen, Peidong Liu, and Junyu Guo. 2018. LDA Meets Word2Vec: A Novel Model for Academic Abstract Clustering. In Companion of the The Web Conference 2018 on The Web Conference 2018. International World Wide Web Conferences Steering Committee, 1699--1706.

Digital Library

[24]

Chen Li, Xian Qian, and Yang Liu. 2013. Using supervised bigram-based ilp for extractive summarization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1004--1013.

[25]

Peng Li, Jing Jiang, and Yinglin Wang. 2010. Generating Templates of Entity Summaries with an Entity-aspect Model and Pattern Mining. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL '10). Association for Computational Linguistics, 640--649. http://dl.acm.org/citation.cfm?id=1858681.1858747

Digital Library

[26]

Sujian Li, You Ouyang, Wei Wang, and Bin Sun. 2007. Multi-document summarization using support vector regression. In Proceedings of DUC. Citeseer.

[27]

Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by Summarizing Long Sequences. CoRR abs/1801.10198 (2018).

[28]

Hans Peter Luhn. 1958. The automatic creation of literature abstracts. IBM Journal of research and development 2, 2 (1958), 159--165.

Digital Library

[29]

Alessandro Lulli, Thibault Debatty, Matteo Dell'Amico, Pietro Michiardi, and Laura Ricci. 2015. Scalable k-NN based text clustering. In Big Data (Big Data), 2015 IEEE International Conference on. IEEE, 958--963.

Digital Library

[30]

Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing Order Into Texts. In Proceedings of the 2004 conference on empirical methods in natural language processing.

[31]

Ramesh Nallapati, Bing Xiang, and Bowen Zhou. 2016. Sequence-to-Sequence RNNs for Text Summarization. CoRR abs/1602.06023 (2016). arXiv:1602.06023 http://arxiv.org/abs/1602.06023

[32]

Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents. In AAAI. 3075--3081.

[33]

Mir Tafseer Nayeem and Yllias Chali. 2017. Extract with Order for Coherent Multi-Document Summarization. In Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing. Association for Computational Linguistics, 51--56. https://doi.org/10.18653/v1/W17-2407

[34]

Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems. 849--856.

[35]

Jahna Otterbacher, Dragomir Radev, and Omer Kareem. 2006. News to go: hierarchical text summarization for mobile devices. In Proc. of ACM SIGIR. 589--596.

Digital Library

[36]

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.

[37]

Dragomir R. Radev, Hongyan Jing, Magorzata Sty, and Daniel Tam. 2004. Centroid-based summarization of multiple documents. Inf. Process. Manage. 40 (2004), 919--938.

Digital Library

[38]

Stuart J. Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Automatic keyword extraction from individual documents. In In book: Text Mining: Applications and Theory, pp. 1--20.

[39]

Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A Neural Attention Model for Abstractive Sentence Summarization. CoRR abs/1509.00685 (2015). arXiv:1509.00685 http://arxiv.org/abs/1509.00685

[40]

Gerard Salton and Chris Buckley.1988. Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage. 24 (1988), 513--523.

Digital Library

[41]

Christina Sauper and Regina Barzilay. 2009. Automatically Generating Wikipedia Articles: A Structure-aware Approach. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 (ACL '09). Association for Computational Linguistics, 208--216. http://dl.acm.org/citation.cfm?id=1687878.1687909

[42]

Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368 (2017).

[43]

Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, and Jiawei Han. 2018. Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering 30, 10 (2018), 1825--1837.

[44]

Liqun Shao and Jie Wang. 2016. DTATG: An Automatic Title Generator Based on Dependency Trees. In Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2016). SCITEPRESS - Science and Technology Publications, Lda, Portugal, 166--173. https://doi.org/10.5220/0006035101660173

Digital Library

[45]

Lu Wang and Wang Ling. 2016. Neural network-based abstract generation for opinions and arguments. arXiv preprint arXiv:1606.02785 (2016).

[46]

Xun Wang, Masaaki Nishino, Tsutomu Hirao, Katsuhito Sudoh, and Masaaki Nagata. 2016. Exploring text links for coherent multi-document summarization. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. 213--223.

[47]

Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, and Hongwei Hao. 2015. Short text clustering via convolutional neural networks. In Proceedings of NAACL-HLT. 62--69.

[48]

Conglei Yao, Xu Jia, Sicong Shou, Shicong Feng, Feng Zhou, and Hongyan Liu. 2011. Autopedia: Automatic Domain-independent Wikipedia Article Generation. In Proceedings of the 20th International Conference Companion on World Wide Web (WWW '11). ACM, 161--162. https://doi.org/10.1145/1963192.1963274

Digital Library

[49]

Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srinivasan, and Dragomir R. Radev. 2017. Graph-based Neural Multi-Document Summarization. In CoNLL.

[50]

Jianhua Yin and Jianyong Wang. 2014. A dirichlet multinomial mixture model based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 233--242.

Digital Library

[51]

Dani Yogatama, Fei Liu, and Noah A Smith. 2015. Extractive summarization by maximizing semantic volume. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1961--1966.

[52]

Hao Zhang and Jie Wang. 2018. Semantic WordRank: Generating Finer Single-Document Summarizations. ArXiv e-prints (Sept. 2018). arXiv:cs.CL/1809.04649

Cited By

Shah AZanzmera DMehta K(2022)Deep Learning based Automatic Hindi Text Summarization2022 6th International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC53470.2022.9753735(1455-1461)Online publication date: 29-Mar-2022
https://doi.org/10.1109/ICCMC53470.2022.9753735
Zhang HWang J(2020)An unsupervised semantic sentence ranking scheme for text documentsIntegrated Computer-Aided Engineering10.3233/ICA-20062628:1(17-33)Online publication date: 21-Dec-2020
https://doi.org/10.3233/ICA-200626
Wang J(2020)AutoOverview: A Framework for Generating Structured Overviews over Many DocumentsComplexity and Approximation10.1007/978-3-030-41672-0_8(113-150)Online publication date: 21-Feb-2020
https://doi.org/10.1007/978-3-030-41672-0_8

Index Terms

An Effective Scheme for Generating An Overview Report over A Very Large Corpus of Documents
1. Applied computing
  1. Document management and text processing
    1. Document management
      1. Document metadata

Recommendations

Summarization from medical documents: a survey

Objective:: The aim of this paper is to survey the recent work in medical documents summarization. Background:: During the last decade, documents summarization got increasing attention by the AI research community. More recently it also attracted the ...
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Latent dirichlet allocation based multi-document summarization
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Extraction based Multi-Document Summarization Algorithms consist of choosing sentences from the documents using some weighting mechanism and combining them into a summary. In this article we use Latent Dirichlet Allocation to capture the events being ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019

September 2019

254 pages

ISBN:9781450368872

DOI:10.1145/3342558

General Chairs:
Uwe Borghoff,
Sonja Schimmler

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

In-Cooperation

SIGDOC: ACM Special Interest Group for Design of Communications

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 September 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

DocEng '19

Sponsor:

SIGWEB

DocEng '19: ACM Symposium on Document Engineering 2019

September 23 - 26, 2019

Berlin, Germany

Acceptance Rates

DocEng '19 Paper Acceptance Rate 30 of 77 submissions, 39%;

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
168
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shah AZanzmera DMehta K(2022)Deep Learning based Automatic Hindi Text Summarization2022 6th International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC53470.2022.9753735(1455-1461)Online publication date: 29-Mar-2022
https://doi.org/10.1109/ICCMC53470.2022.9753735
Zhang HWang J(2020)An unsupervised semantic sentence ranking scheme for text documentsIntegrated Computer-Aided Engineering10.3233/ICA-20062628:1(17-33)Online publication date: 21-Dec-2020
https://doi.org/10.3233/ICA-200626
Wang J(2020)AutoOverview: A Framework for Generating Structured Overviews over Many DocumentsComplexity and Approximation10.1007/978-3-030-41672-0_8(113-150)Online publication date: 21-Feb-2020
https://doi.org/10.1007/978-3-030-41672-0_8

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten