skip to main content
10.1145/3342558.3345394acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

An Effective Scheme for Generating An Overview Report over A Very Large Corpus of Documents

Published: 23 September 2019 Publication History

Abstract

How to efficiently generate an accurate, well-structured overview report (ORPT) over thousands of documents is challenging. A well-structured ORPT is divided into sections of multiple levels (e.g., a two-level structure consists of sections and subsections). None of the existing multi-document summarization (MDS) algorithms is suitable for accomplishing this task. To overcome this obstacle, we devise NDORGS (Numerous Documents' Overview Report Generation Scheme) that integrates text filtering, keyword scoring, single-document summarization (SDS), topic modeling, MDS, and title generation to generate a coherent, well-structured ORPT. We then present a multi-criteria evaluation method using techniques of text mining and multi-attribute decision making on a combination of human judgments, running time, information coverage, and topic diversity. We evaluate ORPTs generated by NDORGS on two large corpora of documents, where one is classified and the other unclassified. We show that, using Saaty's pairwise comparison 9-point scale and TOPSIS, the ORPTs generated on SDS's with the length of 20% of the original documents are the best overall on both datasets.

References

[1]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014). arXiv:1409.0473 http://arxiv.org/abs/1409.0473
[2]
Regina Barzilay and Michael Elhadad. 1999. Using lexical chains for text summarization. Advances in automatic text summarization (1999), 111--121.
[3]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. In Journal of Machine Learning Research.
[4]
Orkut Buyukkokten, Hector Garcia-Molina, and Andreas Paepcke. 2001. Seeing the Whole in Parts: Text Summarization for Web Browsing on Handheld Devices. In Proceedings of the 10th International Conference on World Wide Web (WWW '01). ACM, 652--662. https://doi.org/10.1145/371920.372178
[5]
Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei. 2017. Improving Multi-Document Summarization via Text Classification. In AAAI. 3053--3059.
[6]
Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei, and Yanran Li. 2016. Attsum: Joint learning of focusing and summarization with neural attention. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. 547--556.
[7]
Janara Christensen, Mausam, Stephen Soderland, and Oren Etzioni. 2013. Towards Coherent Multi-Document Summarization. In HLT-NAACL.
[8]
Janara Christensen, Stephen Soderland, Gagan Bansal, et al. 2014. Hierarchical summarization: Scaling up multi-document summarization. In Proceedings of the 52nd annual meeting of the association for computational linguistics, Vol. 1. 902--912.
[9]
DUC. 2004. Duc2004 Quality Questions. http://duc.nist.gov/duc2004/quality.questions.txt.
[10]
DUC. 2014. Document Understanding Conference. https://www-nlpir.nist.gov/projects/duc/intro.html.
[11]
Günes Erkan and Dragomir R Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22 (2004), 457--479.
[12]
Factiva-Marx. 2018. Marx Dataset. http://www.ndorg.net.
[13]
Maria Fuentes, Horacio Rodriguez, and Daniel Ferrés. 2007. FEMsum at DUC 2007. In the Document Understanding Workshop (presented at the HLT/NAACL).
[14]
Wei Gao, Peng Li, and Kareem Darwish. 2012. Joint topic modeling for event summarization across news and social media streams. In Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 1173--1182.
[15]
Dan Gillick and Benoit Favre. 2009. A scalable global model for summarization. In Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing. Association for Computational Linguistics, 10--18.
[16]
Derek Greene and Pádraig Cunningham. 2006. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering. In Proc. 23rd International Conference on Machine learning (ICML'06). ACM Press, 377--384.
[17]
John A Hartigan and Manchek A Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), 100--108.
[18]
Kai Hong, John M Conroy, Benoit Favre, Alex Kulesza, Hui Lin, and Ani Nenkova. 2014. A Repository of State of the Art and Competitive Baseline Summaries for Generic News Summarization. In LREC. 1608--1616.
[19]
Ching-Lai Hwang and Kwangsun Yoon. 1981. Methods for multiple attribute decision making. In Multiple attribute decision making. Springer, 58--191.
[20]
Dow Jones. 2018. Factiva Global News Database. https://www.dowjones.com/products/factiva/.
[21]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[22]
Mirella Lapata. 2003. Probabilistic text structuring: Experiments with sentence ordering. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1. Association for Computational Linguistics, 545--552.
[23]
Changzhou Li, Yao Lu, Junfeng Wu, Yongrui Zhang, Zhongzhou Xia, Tianchen Wang, Dantian Yu, Xurui Chen, Peidong Liu, and Junyu Guo. 2018. LDA Meets Word2Vec: A Novel Model for Academic Abstract Clustering. In Companion of the The Web Conference 2018 on The Web Conference 2018. International World Wide Web Conferences Steering Committee, 1699--1706.
[24]
Chen Li, Xian Qian, and Yang Liu. 2013. Using supervised bigram-based ilp for extractive summarization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1004--1013.
[25]
Peng Li, Jing Jiang, and Yinglin Wang. 2010. Generating Templates of Entity Summaries with an Entity-aspect Model and Pattern Mining. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL '10). Association for Computational Linguistics, 640--649. http://dl.acm.org/citation.cfm?id=1858681.1858747
[26]
Sujian Li, You Ouyang, Wei Wang, and Bin Sun. 2007. Multi-document summarization using support vector regression. In Proceedings of DUC. Citeseer.
[27]
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by Summarizing Long Sequences. CoRR abs/1801.10198 (2018).
[28]
Hans Peter Luhn. 1958. The automatic creation of literature abstracts. IBM Journal of research and development 2, 2 (1958), 159--165.
[29]
Alessandro Lulli, Thibault Debatty, Matteo Dell'Amico, Pietro Michiardi, and Laura Ricci. 2015. Scalable k-NN based text clustering. In Big Data (Big Data), 2015 IEEE International Conference on. IEEE, 958--963.
[30]
Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing Order Into Texts. In Proceedings of the 2004 conference on empirical methods in natural language processing.
[31]
Ramesh Nallapati, Bing Xiang, and Bowen Zhou. 2016. Sequence-to-Sequence RNNs for Text Summarization. CoRR abs/1602.06023 (2016). arXiv:1602.06023 http://arxiv.org/abs/1602.06023
[32]
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents. In AAAI. 3075--3081.
[33]
Mir Tafseer Nayeem and Yllias Chali. 2017. Extract with Order for Coherent Multi-Document Summarization. In Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing. Association for Computational Linguistics, 51--56. https://doi.org/10.18653/v1/W17-2407
[34]
Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems. 849--856.
[35]
Jahna Otterbacher, Dragomir Radev, and Omer Kareem. 2006. News to go: hierarchical text summarization for mobile devices. In Proc. of ACM SIGIR. 589--596.
[36]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.
[37]
Dragomir R. Radev, Hongyan Jing, Magorzata Sty, and Daniel Tam. 2004. Centroid-based summarization of multiple documents. Inf. Process. Manage. 40 (2004), 919--938.
[38]
Stuart J. Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Automatic keyword extraction from individual documents. In In book: Text Mining: Applications and Theory, pp. 1--20.
[39]
Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A Neural Attention Model for Abstractive Sentence Summarization. CoRR abs/1509.00685 (2015). arXiv:1509.00685 http://arxiv.org/abs/1509.00685
[40]
Gerard Salton and Chris Buckley.1988. Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage. 24 (1988), 513--523.
[41]
Christina Sauper and Regina Barzilay. 2009. Automatically Generating Wikipedia Articles: A Structure-aware Approach. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 (ACL '09). Association for Computational Linguistics, 208--216. http://dl.acm.org/citation.cfm?id=1687878.1687909
[42]
Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368 (2017).
[43]
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, and Jiawei Han. 2018. Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering 30, 10 (2018), 1825--1837.
[44]
Liqun Shao and Jie Wang. 2016. DTATG: An Automatic Title Generator Based on Dependency Trees. In Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2016). SCITEPRESS - Science and Technology Publications, Lda, Portugal, 166--173. https://doi.org/10.5220/0006035101660173
[45]
Lu Wang and Wang Ling. 2016. Neural network-based abstract generation for opinions and arguments. arXiv preprint arXiv:1606.02785 (2016).
[46]
Xun Wang, Masaaki Nishino, Tsutomu Hirao, Katsuhito Sudoh, and Masaaki Nagata. 2016. Exploring text links for coherent multi-document summarization. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. 213--223.
[47]
Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, and Hongwei Hao. 2015. Short text clustering via convolutional neural networks. In Proceedings of NAACL-HLT. 62--69.
[48]
Conglei Yao, Xu Jia, Sicong Shou, Shicong Feng, Feng Zhou, and Hongyan Liu. 2011. Autopedia: Automatic Domain-independent Wikipedia Article Generation. In Proceedings of the 20th International Conference Companion on World Wide Web (WWW '11). ACM, 161--162. https://doi.org/10.1145/1963192.1963274
[49]
Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srinivasan, and Dragomir R. Radev. 2017. Graph-based Neural Multi-Document Summarization. In CoNLL.
[50]
Jianhua Yin and Jianyong Wang. 2014. A dirichlet multinomial mixture model based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 233--242.
[51]
Dani Yogatama, Fei Liu, and Noah A Smith. 2015. Extractive summarization by maximizing semantic volume. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1961--1966.
[52]
Hao Zhang and Jie Wang. 2018. Semantic WordRank: Generating Finer Single-Document Summarizations. ArXiv e-prints (Sept. 2018). arXiv:cs.CL/1809.04649

Cited By

View all
  • (2022)Deep Learning based Automatic Hindi Text Summarization2022 6th International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC53470.2022.9753735(1455-1461)Online publication date: 29-Mar-2022
  • (2020)An unsupervised semantic sentence ranking scheme for text documentsIntegrated Computer-Aided Engineering10.3233/ICA-20062628:1(17-33)Online publication date: 21-Dec-2020
  • (2020)AutoOverview: A Framework for Generating Structured Overviews over Many DocumentsComplexity and Approximation10.1007/978-3-030-41672-0_8(113-150)Online publication date: 21-Feb-2020

Index Terms

  1. An Effective Scheme for Generating An Overview Report over A Very Large Corpus of Documents

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019
    September 2019
    254 pages
    ISBN:9781450368872
    DOI:10.1145/3342558
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 September 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. TOPSIS
    2. document generation
    3. multi-document summarization
    4. topic clustering

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    DocEng '19
    Sponsor:
    DocEng '19: ACM Symposium on Document Engineering 2019
    September 23 - 26, 2019
    Berlin, Germany

    Acceptance Rates

    DocEng '19 Paper Acceptance Rate 30 of 77 submissions, 39%;
    Overall Acceptance Rate 194 of 564 submissions, 34%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Deep Learning based Automatic Hindi Text Summarization2022 6th International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC53470.2022.9753735(1455-1461)Online publication date: 29-Mar-2022
    • (2020)An unsupervised semantic sentence ranking scheme for text documentsIntegrated Computer-Aided Engineering10.3233/ICA-20062628:1(17-33)Online publication date: 21-Dec-2020
    • (2020)AutoOverview: A Framework for Generating Structured Overviews over Many DocumentsComplexity and Approximation10.1007/978-3-030-41672-0_8(113-150)Online publication date: 21-Feb-2020

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media