Skip to main content

Summarizing Web Documents Using Sequence Labeling with User-Generated Content and Third-Party Sources

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10260))

Abstract

This paper presents SoCRFSum, a summary model which integrates user-generated content as comments and third-party sources such as relevant articles of a Web document to generate a high-quality summarization. The summarization was formulated as a sequence labeling problem, which exploits the support of external information to model sentences and comments. After modeling, Conditional Random Fields were adopted for sentence selection. SoCRFSum was validated on a dataset collected from Yahoo News. Promising results indicate that by integrating the user-generated and third-party information, our method obtains improvements of ROUGE-scores over state-of-the-art baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://news.yahoo.com.

  2. 2.

    http://duc.nist.gov/data.html.

  3. 3.

    https://www.yahoo.com/news/.

  4. 4.

    https://www.google.com.

  5. 5.

    We remove stopwords when modeling all features.

  6. 6.

    http://mallet.cs.umass.edu.

  7. 7.

    https://en.wikipedia.org/wiki/Viterbi_algorithm.

  8. 8.

    We do this because baselines also pick up top m sentences.

  9. 9.

    http://snowball.tartarus.org/algorithms/porter/stemmer.html.

  10. 10.

    https://code.google.com/p/word2vec/.

  11. 11.

    https://kheafield.com/code/kenlm/.

  12. 12.

    https://code.google.com/p/louie-nlp/source/browse/trunk/louie-ml/src/main/java/org/louie/ml/lexrank/?r=10.

  13. 13.

    http://nlp.stanford.edu/software/corenlp.shtml.

  14. 14.

    http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

  15. 15.

    http://kavita-ganesan.com/content/rouge-2.0-documentation.

  16. 16.

    http://people.cs.umass.edu/~vdang/ranklib.html.

  17. 17.

    https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html.

References

  1. Amitay, E., Paris, C.: Automatically summarising web sites: is there a way around it? In: CIKM, pp. 173–179 (2000)

    Google Scholar 

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  3. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: ICML, pp. 89–96 (2005)

    Google Scholar 

  4. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  5. Delort, J.Y., Bouchon-Meunier, B., Rifqi, M.: Enhanced web document summarization using hyperlinks. In: Hypertext, pp. 208–215 (2003)

    Google Scholar 

  6. Erkan, G., Radev, D.R.: Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)

    Google Scholar 

  7. Freund, Y., Lyeryer, R.D., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4, 933–969 (2003)

    MathSciNet  MATH  Google Scholar 

  8. Gao, W., Li, P., Darwish, K.: Joint topic modeling for event summarization across news and social media streams. In: CIKM, pp. 1173–1182 (2012)

    Google Scholar 

  9. Hu, M., Sun, A., Lim, E.P.: Comments-oriented document summarization: understanding document with readers’ feedback. In: SIGIR, pp. 291–298 (2008)

    Google Scholar 

  10. Joachims, T.: Training linear svms in linear time. In: KDD, pp. 217–226 (2006)

    Google Scholar 

  11. Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: SIGIR, pp. 68–73 (1995)

    Google Scholar 

  12. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)

    Google Scholar 

  13. Lin, C.Y., Hovy, E.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: HLT-NAACL, vol. 1, pp. 71–78 (2003)

    Google Scholar 

  14. Lu, Y., Zhai, C., Sundaresan, N.: Rated aspect summarization of short comments. In: WWW, pp. 131–140 (2009)

    Google Scholar 

  15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)

    Google Scholar 

  16. Nenkova, A.: Automatic text summarization of newswire: lessons learned from the document understanding conference. In: AAAI, pp. 1436–1441 (2005)

    Google Scholar 

  17. Nguyen, M.-T., Nguyen, M.-L.: SoRTESum: a social context framework for single-document summarization. In: Ferro, N., Crestani, F., Moens, M.-F., Mothe, J., Silvestri, F., Nunzio, G.M., Hauff, C., Silvello, G. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 3–14. Springer, Cham (2016). doi:10.1007/978-3-319-30671-1_1

    Chapter  Google Scholar 

  18. Nguyen, M.T., Nguyen, M.L.: Intra-relation or inter-relation?: exploiting social information for web document summarization. Expert Syst. Appl. 76, 71–84 (2017)

    Article  Google Scholar 

  19. Nguyen, M.T., Tran, C.X., Tran, D.V., Nguyen, M.L.: Solscsum: a linked sentence-comment dataset for social context summarization. In: CIKM, pp. 2409–2412 (2016)

    Google Scholar 

  20. Nguyen, M.T., Tran, D.V., Tran, C.X., Nguyen, M.L.: Learning to summarize web documents using social information. In: ICTAI, pp. 619–626 (2016)

    Google Scholar 

  21. Shen, D., Sun, J.T., Li, H., Yang, Q., Chen, Z.: Document summarization using conditional random fields. In: IJCAI, pp. 2862–2867 (2007)

    Google Scholar 

  22. Sun, J.T., Shen, D., Zeng, H.J., Yang, Q., Lu, Y., Chen, Z.: Web-page summarization using clickthrough data. In: SIGIR, pp. 194–201 (2005)

    Google Scholar 

  23. Svore, K.M., Vanderwende, L., Burges, C.J.: Enhancing single-document summarization by combining ranknet and third-party sources. In: EMNLP-CoNLL, pp. 448–457 (2007)

    Google Scholar 

  24. Wei, Z., Gao, W.: Utilizing microblogs for automatic news highlights extraction. In: COLING, pp. 872–883 (2014)

    Google Scholar 

  25. Wei, Z., Gao, W.: Gibberish, assistant, or master?: using tweets linking to news for extractive single-document summarization. In: SIGIR, pp. 1003–1006 (2015)

    Google Scholar 

  26. Yang, Z., Cai, K., Tang, J., Zhang, L., Su, Z., Li, J.: Social context summarization. In: SIGIR, pp. 255–264 (2011)

    Google Scholar 

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant number JP15K16048, JSPS KAKENHI Grant Number JP15K12094, and JST CREST Grant Number JPMJCR1513, Japan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minh-Tien Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Nguyen, MT., Tran, DV., Tran, CX., Nguyen, ML. (2017). Summarizing Web Documents Using Sequence Labeling with User-Generated Content and Third-Party Sources. In: Frasincar, F., Ittoo, A., Nguyen, L., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2017. Lecture Notes in Computer Science(), vol 10260. Springer, Cham. https://doi.org/10.1007/978-3-319-59569-6_54

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59569-6_54

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59568-9

  • Online ISBN: 978-3-319-59569-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics