research-article

Exploiting User Posts for Web Document Summarization

Authors:

Minh-Tien Nguyen,

Le-Minh Nguyen,

Xuan-Hieu PhanAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 12, Issue 4

Article No.: 49, Pages 1 - 28

https://doi.org/10.1145/3186566

Published: 08 June 2018 Publication History

Abstract

Relevant user posts such as comments or tweets of a Web document provide additional valuable information to enrich the content of this document. When creating user posts, readers tend to borrow salient words or phrases in sentences. This can be considered as word variation. This article proposes a framework that models the word variation aspect to enhance the quality of Web document summarization. Technically, the framework consists of two steps: scoring and selection. In the first step, the social information of a Web document such as user posts is exploited to model intra-relations and inter-relations in lexical and semantic levels. These relations are denoted by a mutual reinforcement similarity graph used to score each sentence and user post. After scoring, summaries are extracted by using a ranking approach or concept-based method formulated in the form of Integer Linear Programming. To confirm the efficiency of our framework, sentence and story highlight extraction tasks were taken as a case study on three datasets in two languages, English and Vietnamese. Experimental results show that: (i) the framework can improve ROUGE-scores compared to state-of-the-art baselines of social context summarization and (ii) the combination of the two relations benefits the sentence extraction of single Web documents.

References

[1]

Einat Amitay and Cecile Paris. 2000. Automatically summarising web sites: Is there a way around it? In Proceedings of the 9th International Conference on Information and Knowledge Management. ACM, 173--179.

Digital Library

[2]

Kathleen McKeown Ani Nenkova. 2011. Automatic summarization. Foundations and Trends in Information Retrieval 5, 2--3 (2011), 103--233.

[3]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3 (2003), 993--1022.

Digital Library

[4]

Ziqiang Cao, Chengyao Chen, Wenjie Li, Sujian Li, Furu Wei, and Ming Zhou. 2016. TGSum: Build tweet guided multi-document summarization dataset. In Proceedings of the AAAI Conference on Artificial Intelligence. 2906--2912.

Digital Library

[5]

Ziqiang Cao, Furu Wei, Li Dong, Sujian Li, and Ming Zhou. 2015. Ranking with recursive neural networks and its application to multi-document summarization. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2153--2159.

Digital Library

[6]

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (1995), 273--297.

[7]

Jean-Yves Delort. 2006. Identifying commented passages of documents using implicit hyperlinks. In Proceedings of the 17th Conference on Hypertext and Hypermedia. 89--98.

Digital Library

[8]

J.-Y. Delort, B. Bouchon-Meunier, and M. Rifqi. 2003. Enhanced web document summarization using hyperlinks. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia. ACM, 208--215.

Digital Library

[9]

Gunes Erkan and Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22 (2004), 457--479.

[10]

Wei Gao, Peng Li, and Kareem Darwish. 2012. Joint topic modeling for event summarization across news and social media streams. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, 1173--1182.

Digital Library

[11]

Yihong Gong and Xin Liu. 2001. Generic text summarization using relevant measure and latent semantic analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 19--25.

Digital Library

[12]

Meishan Hu, Aixin Sun, and Ee-Peng Lim. 2008. Comments-oriented document summarization: Understanding document with readers’ feedback. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 291--298.

Digital Library

[13]

Thorsten Joachims. 2006. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 217--226.

Digital Library

[14]

John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. 282--289.

Digital Library

[15]

Chen Li, Zhongyu Wei, Yang Liu, Yang Jin, and Fei Huang. 2016. Using relevant public posts to enhance news article summarization. In Proceedings of the International Conference on Computational Linguistics (COLING’16). 557--566.

[16]

Piji Li, Lidong Bing, Wai Lam, Hang Li, and Yi Liao. 2015. Reader-aware multi-document summarization via sparse coding. In Proceedings of the 24th International Joint Conference on Artificial Intelligence. 1270--1276.

Digital Library

[17]

Chin-Yew Lin and Eduard H. Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, Association for Computational Linguistics, 71--78.

Digital Library

[18]

Hui Lin and Jeff A. Bilmes. 2011. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, Association for Computational Linguistics, 510--520.

Digital Library

[19]

Yue Lu, ChengXiang Zhai, and Neel Sundaresan. 2009. Rated aspect summarization of short comments. In Proceedings of the 18th International Conference on World Wide Web. ACM, 131--140.

Digital Library

[20]

Hans P. Luhn. 1958. The automatic creation of literature abstracts. IBM Journal of Research Development 2, 2 (1958), 159--165.

Digital Library

[21]

Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into texts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 404--411.

[22]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems. 3111--3119.

Digital Library

[23]

Ani Nenkova. 2005. Automatic text summarization of newswire: Lessons learned from the document understanding conference. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 5, 1436--1441.

Digital Library

[24]

Minh-Tien Nguyen, Viet Dac Lai, Phong-Khac Do, Duc-Vu Tran, and Minh-Le Nguyen. 2016a. VSoLSCSum: Building a vietnamese sentence-comment dataset for social context summarization. In Proceedings of the 12th Workshop on Asian Language Resources. Association for Computational Linguistics, 38--48.

[25]

Minh-Tien Nguyen and Minh-Le Nguyen. 2016. SoRTESum: A social context framework for single-document summarization. In Proceedings of the European Conference on Information Retrieval. Springer International Publishing, 3--14.

[26]

Minh-Tien Nguyen and Minh-Le Nguyen. 2017. Intra-relation or inter-relation? Exploiting social information for Web document summarization. Expert Systems with Applications 76 (2017), 71--84.

Digital Library

[27]

Minh-Tien Nguyen, Chien-Xuan Tran, Duc-Vu Tran, and Minh-Le Nguyen. 2016b. SoLSCSum: A linked sentence-comment dataset for social context summarization. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2409--2412.

Digital Library

[28]

Minh-Tien Nguyen, Duc-Vu Tran, Chien-Xuan Tran, and Minh-Le Nguyen. 2016c. Learning to summarize web documents using social information. In Proceedings of International Conference on Tools with Artificial Intelligence (ICTAI’16). IEEE, 619--626.

[29]

Minh-Tien Nguyen, Duc-Vu Tran, Chien-Xuan Tran, and Minh-Le Nguyen. 2017. Summarizing web documents using sequence labeling with user-generated content and third-party sources. In Proceedings of the International Conference on Applications of Natural Language to Information Systems. Springer International Publishing, 454--467.

[30]

Pengjie Ren, Zhumin Chen, Zhaochun Ren, Furu Wei, Jun Ma, and Maarten de Rijke. 2017. Leveraging contextual sentence relations for extractive summarization using a neural attention model. In Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 95--104.

Digital Library

[31]

Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, and Zheng Chen. 2007. Document summarization using conditional random fields. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07), vol. 7, 2862--2867.

Digital Library

[32]

Jian-Tao Sun, Dou Shen, Hua-Jun Zeng, Qiang Yang, Yuchang Lu, and Zheng Chen. 2005. Web-page summarization using clickthrough data. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 194--201.

Digital Library

[33]

Zhongyu Wei and Wei Gao. 2014. Utilizing microblogs for automatic news highlights extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (COLING’14). Association for Computational Linguistics, 872--883.

[34]

Zhongyu Wei and Wei Gao. 2015. Gibberish, assistant, or master? Using tweets linking to news for extractive single-document summarization. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1003--1006.

Digital Library

[35]

Kristian Woodsend and Mirella Lapata. 2010. Automatic generation of story highlights. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 565--574.

Digital Library

[36]

Zi Yang, Keke Cai, Jie Tang, Li Zhang, Zhong Su, and Juanzi Li. 2011. Social context summarization. In Proceedings of the 34th International SIGIR Conference on Research and Development in Information Retrieval. ACM, 255--264.

Digital Library

[37]

Jen-Yuan Yeh, Hao-Ren Ke, Wei-Pang Yang, and I-Heng Meng. 2005. Text summarization using a trainable summarizer and latent semantic analysis. Information Processing 8 Management 41, 1 (2005), 75--95.

Digital Library

Cited By

Le Ngoc Thang Nguyen Minh Tien Do Nhat Minh Nguyen Chi Thanh Le Quang Minh (2024)A method to utilize prior knowledge for extractive summarization based on pre-trained language modelsVietnam Journal of Science and Technology10.15625/2525-2518/20241Online publication date: 5-Dec-2024
https://doi.org/10.15625/2525-2518/20241
Dai YZhang RQi J(2021)Automatic Webpage Briefing2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00152(1727-1738)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00152
Liu HWang LZhao PWu X(2019)Document Specific Supervised Keyphrase Extraction With Strong Semantic RelationsIEEE Access10.1109/ACCESS.2019.29488917(167507-167520)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2948891

Index Terms

Exploiting User Posts for Web Document Summarization
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Summarization
  2. Information systems applications
    1. Data mining

Recommendations

Web document summarization by exploiting social context with matrix co-factorization
Abstract
In the context of social media, users usually post relevant information corresponding to the contents of events mentioned in a Web document. This information posses two important values in that (i) it reflects the content of an event ...
Exploiting User Comments for Document Summarization with Matrix Factorization
SoICT '19: Proceedings of the 10th International Symposium on Information and Communication Technology

Social media presents a new method for readers who can freely discuss the content of an event mentioned in a Web document by posting relevant comments. The comments provide additional information which can be used to enrich the information of the main ...
Enhanced web document summarization using hyperlinks
HYPERTEXT '03: Proceedings of the fourteenth ACM conference on Hypertext and hypermedia

This paper addresses the issue of Web document summarization. As textual content of Web documents is often scarce or irrelevant and existing summarization techniques are based on it, many Web pages and websites cannot be suitably summarized. We consider ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 12, Issue 4

August 2018

354 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3208362

Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
University of Louisiana at Lafayette, USA

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2018

Accepted: 01 February 2018

Revised: 01 January 2018

Received: 01 September 2017

Published in TKDD Volume 12, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

JSPS KAKENHI
Hung Yen University of Technology and Education; and QG.15.29
Vietnam National University, Hanoi (VNU)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
232
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Le Ngoc Thang Nguyen Minh Tien Do Nhat Minh Nguyen Chi Thanh Le Quang Minh (2024)A method to utilize prior knowledge for extractive summarization based on pre-trained language modelsVietnam Journal of Science and Technology10.15625/2525-2518/20241Online publication date: 5-Dec-2024
https://doi.org/10.15625/2525-2518/20241
Dai YZhang RQi J(2021)Automatic Webpage Briefing2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00152(1727-1738)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00152
Liu HWang LZhao PWu X(2019)Document Specific Supervised Keyphrase Extraction With Strong Semantic RelationsIEEE Access10.1109/ACCESS.2019.29488917(167507-167520)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2948891

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents