Skip to main content

A Zipf-Like Distant Supervision Approach for Multi-document Summarization Using Wikinews Articles

  • Conference paper
String Processing and Information Retrieval (SPIRE 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7608))

Included in the following conference series:

  • 1232 Accesses

Abstract

This work presents a sentence ranking strategy based on distant supervision for the multi-document summarization problem. Due to the difficulty of obtaining large training datasets formed by document clusters and their respective human-made summaries, we propose building a training and a testing corpus from Wikinews. Wikinews articles are modeled as “distant” summaries of their cited sources, considering that first sentences of Wikinews articles tend to summarize the event covered in the news story. Sentences from cited sources are represented as tuples of numerical features and labeled according to a relationship with the given distant summary that is based on the Zipf law. Ranking functions are trained using linear regressions and ranking SVMs, which are also combined using Borda count. Top ranked sentences are concatenated and used to build summaries, which are compared with the first sentences of the distant summary using ROUGE evaluation measures. Experimental results obtained show the effectiveness of the proposed method and that the combination of different ranking techniques outperforms the quality of the generated summary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aslam, J.A., Montague, M.: Models for metasearch. In: SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 276–284. ACM, New York (2001)

    Chapter  Google Scholar 

  2. Bravo-Marquez, F., L’Huillier, G., Ríos, S.A., Velásquez, J.D.: Hypergeometric Language Model and Zipf-Like Scoring Function for Web Document Similarity Retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 303–308. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  3. Chali, Y., Hasan, S.A., Joty, S.R.: A SVM-Based Ensemble Approach to Multi-Document Summarization. In: Gao, Y., Japkowicz, N. (eds.) AI 2009. LNCS, vol. 5549, pp. 199–202. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  4. Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Int. Res. 22(1), 457–479 (2004)

    Google Scholar 

  5. Geng, X., Liu, T.-Y., Qin, T., Li, H.: Feature selection for ranking. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007, pp. 407–414. ACM, New York (2007)

    Chapter  Google Scholar 

  6. Goldstein, J., Kantrowitz, M., Mittal, V., Carbonell, J.: Summarizing text documents: sentence selection and evaluation metrics. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 121–128. ACM, New York (1999)

    Chapter  Google Scholar 

  7. He, C., Wang, C., Zhong, Y.-X., Li, R.-F.: A survey on learning to rank. In: 2008 International Conference on Machine Learning and Cybernetics, pp. 1734–1739. IEEE (July 2008)

    Google Scholar 

  8. Herbrich, R., Graepel, T., Obermayer, K.: Large margin rank boundaries for ordinal regression. MIT Press, Cambridge (2000)

    Google Scholar 

  9. Joachims, T.: Training linear svms in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 217–226. ACM, New York (2006)

    Chapter  Google Scholar 

  10. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM 2010, pp. 441–450. ACM, New York (2010)

    Chapter  Google Scholar 

  11. Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10, 707 (1966)

    MathSciNet  Google Scholar 

  12. Litvak, M., Last, M., Friedman, M.: A new approach to improving multilingual summarization using a genetic algorithm. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 927–936. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  13. Liu, F., Liu, Y.: Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries. In: Proceedings of ACL 2008: HLT, Short Papers, pp. 201–204. Association for Computational Linguistics, Columbus (2008)

    Google Scholar 

  14. Luhn, H.P.: The automatic creation of literature abstracts. IBM J. Res. Dev. 2, 159–165 (1958)

    Article  MathSciNet  Google Scholar 

  15. Mani, I.: Advances in Automatic Text Summarization. MIT Press, Cambridge (1999)

    Google Scholar 

  16. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  17. Mihalcea, R., Tarau, P.: TextRank: Bringing order into texts. In: Proceedings of EMNLP 2004 and the 2004 Conference on Empirical Methods in Natural Language Processing (July 2004)

    Google Scholar 

  18. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2, ACL 2009, pp. 1003–1011. Association for Computational Linguistics, Stroudsburg (2009)

    Google Scholar 

  19. Larocca Neto, J., Santos, A.D., Kaestner, C.A.A., Freitas, A.A.: Generating Text Summaries through the Relative Importance of Topics. In: Monard, M.C., Sichman, J.S. (eds.) SBIA 2000 and IBERAMIA 2000. LNCS (LNAI), vol. 1952, pp. 300–309. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  20. Radev, D.R., Jing, H., Budzikowska, M.: Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. In: Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization, NAACL-ANLP-AutoSum 2000, pp. 21–30. Association for Computational Linguistics, Stroudsburg (2000)

    Chapter  Google Scholar 

  21. Fisher, S., Roark, B.: Feature expansion for query-focused supervised sentence ranking. In: Document Understanding (DUC 2007) Workshop Papers and Agenda (2007)

    Google Scholar 

  22. Wan, X., Yang, J.: Multi-document summarization using cluster-based link analysis. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, pp. 299–306. ACM, New York (2008)

    Chapter  Google Scholar 

  23. Wang, C., Jing, F., Zhang, L., Zhang, H.-J.: Learning query-biased web page summarization. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007, pp. 555–562. ACM, New York (2007)

    Chapter  Google Scholar 

  24. Wang, D., Li, T.: Many are better than one: improving multi-document summarization via weighted consensus. In: SIGIR, pp. 809–810 (2010)

    Google Scholar 

  25. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Workshop in Text Summarization, ACL, pp. 25–26 (2004)

    Google Scholar 

  26. Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley (1949)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bravo-Marquez, F., Manriquez, M. (2012). A Zipf-Like Distant Supervision Approach for Multi-document Summarization Using Wikinews Articles. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds) String Processing and Information Retrieval. SPIRE 2012. Lecture Notes in Computer Science, vol 7608. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34109-0_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34109-0_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34108-3

  • Online ISBN: 978-3-642-34109-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics