A Zipf-Like Distant Supervision Approach for Multi-document Summarization Using Wikinews Articles

Bravo-Marquez, Felipe; Manriquez, Manuel

doi:10.1007/978-3-642-34109-0_15

Felipe Bravo-Marquez²⁰ &
Manuel Manriquez²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7608))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1232 Accesses

Abstract

This work presents a sentence ranking strategy based on distant supervision for the multi-document summarization problem. Due to the difficulty of obtaining large training datasets formed by document clusters and their respective human-made summaries, we propose building a training and a testing corpus from Wikinews. Wikinews articles are modeled as “distant” summaries of their cited sources, considering that first sentences of Wikinews articles tend to summarize the event covered in the news story. Sentences from cited sources are represented as tuples of numerical features and labeled according to a relationship with the given distant summary that is based on the Zipf law. Ranking functions are trained using linear regressions and ranking SVMs, which are also combined using Borda count. Top ranked sentences are concatenated and used to build summaries, which are compared with the first sentences of the distant summary using ROUGE evaluation measures. Experimental results obtained show the effectiveness of the proposed method and that the combination of different ranking techniques outperforms the quality of the generated summary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aslam, J.A., Montague, M.: Models for metasearch. In: SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 276–284. ACM, New York (2001)
Chapter Google Scholar
Bravo-Marquez, F., L’Huillier, G., Ríos, S.A., Velásquez, J.D.: Hypergeometric Language Model and Zipf-Like Scoring Function for Web Document Similarity Retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 303–308. Springer, Heidelberg (2010)
Chapter Google Scholar
Chali, Y., Hasan, S.A., Joty, S.R.: A SVM-Based Ensemble Approach to Multi-Document Summarization. In: Gao, Y., Japkowicz, N. (eds.) AI 2009. LNCS, vol. 5549, pp. 199–202. Springer, Heidelberg (2009)
Chapter Google Scholar
Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Int. Res. 22(1), 457–479 (2004)
Google Scholar
Geng, X., Liu, T.-Y., Qin, T., Li, H.: Feature selection for ranking. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007, pp. 407–414. ACM, New York (2007)
Chapter Google Scholar
Goldstein, J., Kantrowitz, M., Mittal, V., Carbonell, J.: Summarizing text documents: sentence selection and evaluation metrics. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 121–128. ACM, New York (1999)
Chapter Google Scholar
He, C., Wang, C., Zhong, Y.-X., Li, R.-F.: A survey on learning to rank. In: 2008 International Conference on Machine Learning and Cybernetics, pp. 1734–1739. IEEE (July 2008)
Google Scholar
Herbrich, R., Graepel, T., Obermayer, K.: Large margin rank boundaries for ordinal regression. MIT Press, Cambridge (2000)
Google Scholar
Joachims, T.: Training linear svms in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 217–226. ACM, New York (2006)
Chapter Google Scholar
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM 2010, pp. 441–450. ACM, New York (2010)
Chapter Google Scholar
Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10, 707 (1966)
MathSciNet Google Scholar
Litvak, M., Last, M., Friedman, M.: A new approach to improving multilingual summarization using a genetic algorithm. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 927–936. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Liu, F., Liu, Y.: Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries. In: Proceedings of ACL 2008: HLT, Short Papers, pp. 201–204. Association for Computational Linguistics, Columbus (2008)
Google Scholar
Luhn, H.P.: The automatic creation of literature abstracts. IBM J. Res. Dev. 2, 159–165 (1958)
Article MathSciNet Google Scholar
Mani, I.: Advances in Automatic Text Summarization. MIT Press, Cambridge (1999)
Google Scholar
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book MATH Google Scholar
Mihalcea, R., Tarau, P.: TextRank: Bringing order into texts. In: Proceedings of EMNLP 2004 and the 2004 Conference on Empirical Methods in Natural Language Processing (July 2004)
Google Scholar
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2, ACL 2009, pp. 1003–1011. Association for Computational Linguistics, Stroudsburg (2009)
Google Scholar
Larocca Neto, J., Santos, A.D., Kaestner, C.A.A., Freitas, A.A.: Generating Text Summaries through the Relative Importance of Topics. In: Monard, M.C., Sichman, J.S. (eds.) SBIA 2000 and IBERAMIA 2000. LNCS (LNAI), vol. 1952, pp. 300–309. Springer, Heidelberg (2000)
Chapter Google Scholar
Radev, D.R., Jing, H., Budzikowska, M.: Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. In: Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization, NAACL-ANLP-AutoSum 2000, pp. 21–30. Association for Computational Linguistics, Stroudsburg (2000)
Chapter Google Scholar
Fisher, S., Roark, B.: Feature expansion for query-focused supervised sentence ranking. In: Document Understanding (DUC 2007) Workshop Papers and Agenda (2007)
Google Scholar
Wan, X., Yang, J.: Multi-document summarization using cluster-based link analysis. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, pp. 299–306. ACM, New York (2008)
Chapter Google Scholar
Wang, C., Jing, F., Zhang, L., Zhang, H.-J.: Learning query-biased web page summarization. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007, pp. 555–562. ACM, New York (2007)
Chapter Google Scholar
Wang, D., Li, T.: Many are better than one: improving multi-document summarization via weighted consensus. In: SIGIR, pp. 809–810 (2010)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Workshop in Text Summarization, ACL, pp. 25–26 (2004)
Google Scholar
Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley (1949)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Chile, Chile
Felipe Bravo-Marquez
University of Santiago of Chile, Chile
Manuel Manriquez

Authors

Felipe Bravo-Marquez
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Manriquez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information Technologies Research Group, Universidad Autónoma de Bucaramanga, Bucaramanga, Colombia
Liliana Calderón-Benavides
Information Technologies and Research Group, Universidad Autónoma de Bucaramanga, Bucaramanga, Colombia
Cristina González-Caro
School of Physics and Mathematics, Universidad Michoacana, Edificio ”B”, Ciudad Universitaria,, 58000, Morelia, Mexico
Edgar Chávez
Department of Computer Science, Universidade Federal de Minas Gerais, Av. Antonio Carlos 6627, Pampulha, 31270-010, Belo Horizonte, Brazil
Nivio Ziviani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bravo-Marquez, F., Manriquez, M. (2012). A Zipf-Like Distant Supervision Approach for Multi-document Summarization Using Wikinews Articles. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds) String Processing and Information Retrieval. SPIRE 2012. Lecture Notes in Computer Science, vol 7608. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34109-0_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-34109-0_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34108-3
Online ISBN: 978-3-642-34109-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics