Multilingual Statistical News Summarization

Kabadjov, Mijail; Steinberger, Josef; Steinberger, Ralf

doi:10.1007/978-3-642-28569-1_11

Mijail Kabadjov⁵,
Josef Steinberger⁵ &
Ralf Steinberger⁵

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

2089 Accesses

Abstract

In this chapter we present a generic approach for summarizing clusters of multilingual news articles such as the ones produced by the Europe Media Monitor (EMM) system. Our approach uses robust statistical techniques as well as multilingual tools for named entity recognition and disambiguation to produce entity-centered summaries. We run experiments with the TAC 2008 and 2009 data sets (English corpora for summarization research), and we obtained very promising results; at TAC 2009 our runs attained top rank for linguistic quality and second best for overall responsiveness. We also run a small-scale evaluation on languages other than English, demonstrating thereby the multilinguality of our approach, but also providing interesting evidence that contradicts the pervasive assumption “if it works for English, it works for any language”. Finally, we present an online system currently under development which will eventually incorporate all the elements of the summarization approach discussed hereby and we show sample output summaries in various languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Experiments in Newswire Summarisation

Using Distributed Word Representations and mRMR Discriminant Analysis for Multilingual Text Summarization

UIDS: A Multilingual Document Summarization Framework Based on Summary Diversity and Hierarchical Topics

Notes

1.
http://www.nist.gov/tac
2.
http://www.project-syndicate.org/
3.
http://www.project-syndicate.org/
4.
The multilingual MeSH term recognition software was developed by Health-on-the-Net HON, see http://www.hon.ch/
5.
The use of the multilingual tools in higher-level applications can be seen at http://emm.newsexplorer.eu/
6.
The Medical Subject Headings (MeSH) thesaurus is prepared by the US National Library of Medicine for indexing, cataloging, and searching for biomedical and health-related information and documents. Although, it was initially meant for biomedical and health-related documents, since it represent a large IS-A taxonomy it can be used in more general tasks (http://www.nlm.nih.gov/mesh/meshhome.html)
7.
Steinberger et al. [33] worked on monolingual single-document summarization.
8.
The multilingual named entity disambiguator and geo-tagger developed at the JRC have already been used for cross-lingual linking of multilingual news clusters produced by the EMM system [34].
9.
Note that the statistical test we used to attest significance was ran against the improved version of the lexical-only summarizer and not the official TAC-08 scores, since we considered it was the fairer comparison.
10.
The purpose of the update summarization task is to produce a summary of only the novel information contained in a newer set of news articles with respect to an older set, both covering the same news story.
11.
For more details on EMM see [1].
12.
Online demo of the system is available at http://emm-labs.jrc.it/EMMLabs/NewsGist.html
13.
http://tomcat.apache.org/
14.
http://code.google.com/p/matrix-toolkits-java/
15.
http://math.nist.gov/javanumerics/jama/

References

Atkinson, M., der Goot, E.V.: Near real time information mining in multilingual news. In: Proceedings of the 18th International World Wide Web Conference (WWW 2009), Madrid, pp. 1153–1154 (2009)
Google Scholar
Barzilay, R., Elhadad, M.: Using lexical chains for text summarization. In: Mani, I. (ed.) Proceedings of the Workshop on Intelligent and Scalable Text Summarization at the Annual Joint Meeting of the ACL/EACL, Madrid (1997)
Google Scholar
Barzilay, R., Lapata, M.: Modeling local coherence: an entity-based approach. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor (2005)
Google Scholar
Boguraev, B., Kennedy, C.: Salience-based content characterisation of text documents. In: Mani, I. (ed.) Proceedings of the Workshop on Intelligent and Scalable Text Summarization at the Annual Joint Meeting of the ACL/EACL, Madrid (1997)
Google Scholar
Ding, C.H.Q.: A probabilistic model for latent semantic indexing. J. Am. Soc. Inf. Sci. Technol. 56(6), 597–608 (2005)
Google Scholar
Edmundson, H.: New methods in automatic extracting. J. Assoc. Comput. Mach. 16(2), 264–285 (1969)
Google Scholar
Gong, Y., Liu, X.: Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of ACM SIGIR, New Orleans (2002)
Google Scholar
Grosz, B., Aravind, J., Scott, W.: Centering: a framework for modelling the local coherence of discourse. Comput. Linguist. 21(2), 203–225 (1995)
Google Scholar
Hirschman, L.: MUC-7 coreference task definition, version 3.0. In: Chinchor, N. (ed.) Proceedings of the 7th Message Understanding Conference, Virginia. NIST (1998). Available online at http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html
Hovy, E., Lin, C.: Automated text summarization in summarist. In: Mani, I. (ed.) Proceedings of the Workshop on Intelligent and Scalable Text Summarization at the Annual Joint Meeting of the ACL/EACL, Madrid (1997)
Google Scholar
Jones, K.S.: Automatic summarising: factors and directions. In: Mani, I., Maybury, M. (eds.) Advances in Automatic Text Summarization. MIT, Cambridge (1999)
Google Scholar
Kabadjov, M.A.: A comprehensive evaluation of anaphora resolution and discourse-new recognition. Ph.D. thesis, Department of Computer Science, University of Essex (2007)
Google Scholar
Kabadjov, M.A., Steinberger, J., Pouliquen, B., Steinberger, R., Poesio, M.: Multilingual statistical news summarisation: preliminary experiments with english. In: Proceedings of the Workshop on Intelligent Analysis and Processing of Web News Content at the IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT), Milan (2009)
Google Scholar
Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, pp. 68–73 (1995)
Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, Barcelona (2004)
Google Scholar
Litvak, M., Last, M., Friedman, M.: A new approach to improving multilingual summarization using a genetic algorithm. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, pp. 927–936. Association for Computational Linguistics (2010)
Google Scholar
Luhn, H.: The automatic creation of literature abstracts. IBM J. Res. Dev. 2(2), 159–165 (1958)
Google Scholar
Mani, I. (ed.): Proceedings of the Workshop on Intelligent and Scalable Text Summarization at the Annual Joint Meeting of the ACL/EACL, Madrid (1997)
Google Scholar
Mani, I., Maybury, M. (eds.): Advances in Automatic Text Summarization. MIT, Cambridge (1999)
Google Scholar
Marcu, D.: From discourse structures to text summaries. In: Mani, I. (ed.) Proceedings of the Workshop on Intelligent and Scalable Text Summarization at the Annual Joint Meeting of the ACL/EACL, Madrid (1997)
Google Scholar
Maybury, M.: Generating summaries from event data. In: Mani, I., Maybury, M. (eds.) Advances in Automatic Text Summarization. MIT, Cambridge (1999)
Google Scholar
McKeown, K., Radev, D.: Generating summaries of multiple news articles. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, pp. 74–82 (1995)
Google Scholar
Nenkova, A., Louis, A.: Can you summarize this? identifying correlates of input difficulty for generic multi-document summarization. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, Columbus, pp. 825–833. Association for Computational Linguistics (2008)
Google Scholar
Nenkova, A., Passonneau, R.: Evaluating content selection in summarization: the pyramid method. In: Proceedings of the Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), Boston (2004)
Google Scholar
Nenkova, A., Passonneau, R., McKeown, K.: The pyramid method: incorporating human content selection variation in summarization evaluation. ACM Trans. Speech Lang. Process. 4(2), 4 (2007)
Google Scholar
Over, P., Dang, H., Harman, D.: DUC in context. Inf. Process. Manag. 43(6), 1506–1520 (2007). Special Issue on Text Summarisation (Donna Harman, ed.)
Google Scholar
Piskorski, J.: CORLEONE – core linguistic entity online extraction. Tech. Rep. EN 23393, Joint Research Centre of the European Commission (2008)
Google Scholar
Pouliquen, B., Kimler, M., Steinberger, R., Ignat, C., Oellinger, T., Blackler, K., Fuart, F., Zaghouani, W., Widiger, A., Forslund, A.C., Best, C.: Geocoding multilingual texts: recognition, disambiguation and visualisation. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, pp. 53–58 (2006)
Google Scholar
Pouliquen, B., Steinberger, R.: Automatic construction of multilingual name dictionaries. In: Goutte, C., Cancedda, N., Dymetman, M., Foster, G. (eds.) Learning Machine Translation. NIPS series. MIT, Cambridge (2009)
Google Scholar
Saggion, H., Torres-Moreno, J.M., da Cunha, I., SanJuan, E., Velazquez-Morales, P.: Multilingual summarization evaluation without human models. In: Proceedings of the International Conference on Computational Linguistics, Beijing, pp. 1059–1067 (2010)
Google Scholar
Steinberger, J., Jez̆ek, K.: Update summarization based on novel topic distribution. In: Proceedings of the 9th ACM DocEng, Munich (2009)
Google Scholar
Steinberger, J., Kabadjov, M.A., Poesio, M., Sanchez-Graillet, O.: Improving LSA-based summarization with anaphora resolution. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Vancouver (2005)
Google Scholar
Steinberger, J., Poesio, M., Kabadjov, M.A., Jez̆ek, K.: Two uses of anaphora resolution in summarization. Inf. Process. Manag. 43(6), 1663–1680 (2007). Special Issue on Text Summarisation (Donna Harman, ed.)
Google Scholar
Steinberger, R., Pouliquen, B., Ignat, C.: Using language-independent rules to achieve high multilinguality in text mining. In: Fogelman-Soulié, F., Perrotta, D., Piskorski, J., Steinberger, R. (eds.) Mining Massive Data Sets for Security. IOS-Press, Amsterdam/Holland (2009)
Google Scholar
Stewart, J.G.: Genre oriented summarization. Ph.D. thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University (2008)
Google Scholar
Teufel, S., Moens, M.: Sentence extraction as a classification task. In: Mani, I. (ed.) Proceedings of the Workshop on Intelligent and Scalable Text Summarization at the Annual Joint Meeting of the ACL/EACL, Madrid (1997)
Google Scholar
Turchi, M., Steinberger, J., Kabadjov, M., Steinberger, R.: Using parallel corpora for multilingual (multi-document) summarisation evaluation. In: Proceedings of CLEF-10, Padua, pp. 52–63. Springer, Berlin (2010)
Google Scholar
Wan, X., Li, H., Xiao, J.: Cross-language document summarization based on machine translation quality prediction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, pp. 917–926. Association for Computational Linguistics (2010)
Google Scholar

Download references

Acknowledgements

We would like to thank the EMM team for providing a stable and robust news gathering infrastructure.

Author information

Authors and Affiliations

EC Joint Research Centre, 21027, Ispra (VA), Italy
Mijail Kabadjov, Josef Steinberger & Ralf Steinberger

Authors

Mijail Kabadjov
View author publications
You can also search for this author in PubMed Google Scholar
Josef Steinberger
View author publications
You can also search for this author in PubMed Google Scholar
Ralf Steinberger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mijail Kabadjov .

Editor information

Editors and Affiliations

Universite Sorbonne Nouvelle, LATTICE-CNRS, Ecole Normale Superieure and, rue d'Ulm 45, Paris, 75005, France
Thierry Poibeau
, Information & Communication Technologies, Universitat Pompeu Fabra, C/ Tanger 122-140, Barcelona, 08018, Spain
Horacio Saggion
Institute for Computer Science, Polish Acadmey of Science, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
Jakub Piskorski
Department of Computer Science, University of Helsinki, Gustaf Hällströmin katu 2, Helsinki, 00014, Finland
Roman Yangarber

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kabadjov, M., Steinberger, J., Steinberger, R. (2013). Multilingual Statistical News Summarization. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28569-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-28569-1_11
Published: 12 July 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28568-4
Online ISBN: 978-3-642-28569-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics