Summarizing Web Documents Using Sequence Labeling with User-Generated Content and Third-Party Sources

Nguyen, Minh-Tien; Tran, Duc-Vu; Tran, Chien-Xuan; Nguyen, Minh-Le

doi:10.1007/978-3-319-59569-6_54

Minh-Tien Nguyen^17,18,
Duc-Vu Tran¹⁷,
Chien-Xuan Tran¹⁷ &
…
Minh-Le Nguyen¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10260))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

1862 Accesses
2 Citations

Abstract

This paper presents SoCRFSum, a summary model which integrates user-generated content as comments and third-party sources such as relevant articles of a Web document to generate a high-quality summarization. The summarization was formulated as a sequence labeling problem, which exploits the support of external information to model sentences and comments. After modeling, Conditional Random Fields were adopted for sentence selection. SoCRFSum was validated on a dataset collected from Yahoo News. Promising results indicate that by integrating the user-generated and third-party information, our method obtains improvements of ROUGE-scores over state-of-the-art baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://news.yahoo.com.
2.
http://duc.nist.gov/data.html.
3.
https://www.yahoo.com/news/.
4.
https://www.google.com.
5.
We remove stopwords when modeling all features.
6.
http://mallet.cs.umass.edu.
7.
https://en.wikipedia.org/wiki/Viterbi_algorithm.
8.
We do this because baselines also pick up top m sentences.
9.
http://snowball.tartarus.org/algorithms/porter/stemmer.html.
10.
https://code.google.com/p/word2vec/.
11.
https://kheafield.com/code/kenlm/.
12.
https://code.google.com/p/louie-nlp/source/browse/trunk/louie-ml/src/main/java/org/louie/ml/lexrank/?r=10.
13.
http://nlp.stanford.edu/software/corenlp.shtml.
14.
http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
15.
http://kavita-ganesan.com/content/rouge-2.0-documentation.
16.
http://people.cs.umass.edu/~vdang/ranklib.html.
17.
https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html.

References

Amitay, E., Paris, C.: Automatically summarising web sites: is there a way around it? In: CIKM, pp. 173–179 (2000)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: ICML, pp. 89–96 (2005)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Delort, J.Y., Bouchon-Meunier, B., Rifqi, M.: Enhanced web document summarization using hyperlinks. In: Hypertext, pp. 208–215 (2003)
Google Scholar
Erkan, G., Radev, D.R.: Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)
Google Scholar
Freund, Y., Lyeryer, R.D., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4, 933–969 (2003)
MathSciNet MATH Google Scholar
Gao, W., Li, P., Darwish, K.: Joint topic modeling for event summarization across news and social media streams. In: CIKM, pp. 1173–1182 (2012)
Google Scholar
Hu, M., Sun, A., Lim, E.P.: Comments-oriented document summarization: understanding document with readers’ feedback. In: SIGIR, pp. 291–298 (2008)
Google Scholar
Joachims, T.: Training linear svms in linear time. In: KDD, pp. 217–226 (2006)
Google Scholar
Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: SIGIR, pp. 68–73 (1995)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)
Google Scholar
Lin, C.Y., Hovy, E.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: HLT-NAACL, vol. 1, pp. 71–78 (2003)
Google Scholar
Lu, Y., Zhai, C., Sundaresan, N.: Rated aspect summarization of short comments. In: WWW, pp. 131–140 (2009)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Google Scholar
Nenkova, A.: Automatic text summarization of newswire: lessons learned from the document understanding conference. In: AAAI, pp. 1436–1441 (2005)
Google Scholar
Nguyen, M.-T., Nguyen, M.-L.: SoRTESum: a social context framework for single-document summarization. In: Ferro, N., Crestani, F., Moens, M.-F., Mothe, J., Silvestri, F., Nunzio, G.M., Hauff, C., Silvello, G. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 3–14. Springer, Cham (2016). doi:10.1007/978-3-319-30671-1_1
Chapter Google Scholar
Nguyen, M.T., Nguyen, M.L.: Intra-relation or inter-relation?: exploiting social information for web document summarization. Expert Syst. Appl. 76, 71–84 (2017)
Article Google Scholar
Nguyen, M.T., Tran, C.X., Tran, D.V., Nguyen, M.L.: Solscsum: a linked sentence-comment dataset for social context summarization. In: CIKM, pp. 2409–2412 (2016)
Google Scholar
Nguyen, M.T., Tran, D.V., Tran, C.X., Nguyen, M.L.: Learning to summarize web documents using social information. In: ICTAI, pp. 619–626 (2016)
Google Scholar
Shen, D., Sun, J.T., Li, H., Yang, Q., Chen, Z.: Document summarization using conditional random fields. In: IJCAI, pp. 2862–2867 (2007)
Google Scholar
Sun, J.T., Shen, D., Zeng, H.J., Yang, Q., Lu, Y., Chen, Z.: Web-page summarization using clickthrough data. In: SIGIR, pp. 194–201 (2005)
Google Scholar
Svore, K.M., Vanderwende, L., Burges, C.J.: Enhancing single-document summarization by combining ranknet and third-party sources. In: EMNLP-CoNLL, pp. 448–457 (2007)
Google Scholar
Wei, Z., Gao, W.: Utilizing microblogs for automatic news highlights extraction. In: COLING, pp. 872–883 (2014)
Google Scholar
Wei, Z., Gao, W.: Gibberish, assistant, or master?: using tweets linking to news for extractive single-document summarization. In: SIGIR, pp. 1003–1006 (2015)
Google Scholar
Yang, Z., Cai, K., Tang, J., Zhang, L., Su, Z., Li, J.: Social context summarization. In: SIGIR, pp. 255–264 (2011)
Google Scholar

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant number JP15K16048, JSPS KAKENHI Grant Number JP15K12094, and JST CREST Grant Number JPMJCR1513, Japan.

Author information

Authors and Affiliations

Japan Advanced Institute of Science and Technology (JAIST), 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan
Minh-Tien Nguyen, Duc-Vu Tran, Chien-Xuan Tran & Minh-Le Nguyen
Hung Yen University of Technology and Education (UTEHY), Hung Yen, Vietnam
Minh-Tien Nguyen

Authors

Minh-Tien Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Duc-Vu Tran
View author publications
You can also search for this author in PubMed Google Scholar
Chien-Xuan Tran
View author publications
You can also search for this author in PubMed Google Scholar
Minh-Le Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minh-Tien Nguyen .

Editor information

Editors and Affiliations

Erasmus University Rotterdam, Rotterdam, The Netherlands
Flavius Frasincar
University of Liège , Liège, Belgium
Ashwin Ittoo
Japan Advanced Institute of Science and Technology, Nomi, Japan
Le Minh Nguyen
Conservatoire National des Arts et Métiers, Paris, France
Elisabeth Métais

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, MT., Tran, DV., Tran, CX., Nguyen, ML. (2017). Summarizing Web Documents Using Sequence Labeling with User-Generated Content and Third-Party Sources. In: Frasincar, F., Ittoo, A., Nguyen, L., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2017. Lecture Notes in Computer Science(), vol 10260. Springer, Cham. https://doi.org/10.1007/978-3-319-59569-6_54

Download citation

DOI: https://doi.org/10.1007/978-3-319-59569-6_54
Published: 02 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59568-9
Online ISBN: 978-3-319-59569-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics