DUC in context

doi:10.1016/j.ipm.2007.01.019

Information Processing & Management

Volume 43, Issue 6, November 2007, Pages 1506-1520

https://doi.org/10.1016/j.ipm.2007.01.019 Get rights and content

Abstract

Recent years have seen increased interest in text summarization with emphasis on evaluation of prototype systems. Many factors can affect the design of such evaluations, requiring choices among competing alternatives. This paper examines several major themes running through three evaluations: SUMMAC, NTCIR, and DUC, with a concentration on DUC. The themes are extrinsic and intrinsic evaluation, evaluation procedures and methods, generic versus focused summaries, single- and multi-document summaries, length and compression issues, extracts versus abstracts, and issues with genre.

Introduction

Recent years have seen increased interest in text summarization with emphasis on evaluation of prototype systems.¹ Many factors can affect the design of such evaluations, requiring choices among competing alternatives. The realization of such designs seldom goes entirely as planned and the evaluations have complex effects on the researchers and their work.

What issues have the major evaluations addressed, what choices have they made and why, and what have been the consequences? This paper examines several major themes running through the Document Understanding Conference (DUC) evaluations (2001–2006) but also present in the Summarization Evaluation Conference (SUMMAC) and the National Institute for Informatics Test Collection for IR (NTCIR) systems workshops.

SUMMAC (Mani et al., 1999) was a large scale evaluation of text summarization systems that took place in 1998 as part of the Defense Advanced Research Projects Administration (DARPA) TIPSTER program. There were 16 systems that took part, and two major summarization tasks that were evaluated in some manner. The Japanese NTCIR evaluations included summarization tasks in 2000, 2002, and 2004 with about 10 systems working on two different summarization tasks each year.

In 2000 a new summarization evaluation program was begun, again initially sponsored by DARPA; a group of expert summarization researchers contributed to a roadmap (Baldwin et al., 2000) that provided guidance for DUC, with a pilot run in 2000, and the first major evaluation in the fall of 2001. The roadmap called for evaluation of summaries of both single documents and sets of multiple documents, at specified levels of text compression. It suggested that the initial evaluation was to be intrinsic (direct evaluation), with extrinsic evaluation (looking at how the summary affects performance on a task) to be phased in over time, along with requirements of deeper text understanding techniques that can lead to more complicated summaries.

Over the course of its first six years DUC has examined automatic single- and multi-document summarization of newspaper/wire articles, with both generic tasks and various focused tasks. The results have been evaluated in terms of linguistic quality as well as their completeness with respect to content chosen by human summarizers (or in comparison with very simple automatic systems run at NIST to serve as baselines). Participation has grown from 15 research groups to over three dozen.

Table 1 gives a quick summary of the various tasks and evaluation methodologies that have been used in DUC in 2001–2006, and provides a chronological view of the DUC evaluations. This paper, however, examines DUC not chronologically, but in the context of evaluation issues and in the context of the state-of-the-art in automatic summarization. Seven different but interconnected themes are explored.

1.
Intrinsic versus extrinsic evaluation.
2.
Generic versus focused summaries.
3.
Single- and multi-document summaries.
4.
Length and compression issues.
5.
Extracts versus abstracts.
6.
Issues with genre.
7.
The evolution of specific DUC evaluation procedures and methods.

Section snippets

Intrinsic versus extrinsic evaluation

Two major types of evaluation have been used for testing summaries: intrinsic evaluation where the emphasis is on measuring the quality of the created summary directly, and extrinsic evaluation where the emphasis is on measuring how well the summary aids performance on a given task.

Extrinsic evaluation requires the selection of a task that could use summarization and then measuring the effect of using automatic summaries instead of the original text. Critical issues here are the task selection

Generic versus focused summaries

The history of summarization has concentrated on the production of generic summaries, that is, summaries that are produced with only minimal specification regarding their intended situation, audience, and use. The idea of producing automatic abstracts of single documents was the initial driver of research, and generic summarization has formed the bulk of research up until recently. It was therefore natural that the DUC roadmap called for generic summary evaluation rather than using focused

Single- and multi-document summaries

DUC has addressed both single-document summarization and summarization of a set of documents on the same topic. The roadmap called for summarization of single documents – the traditional target of summarization systems. But the task of creating generic summaries of news articles (often by largely extractive means) turned out to be much less interesting than expected. Simple “take the lead sentence/paragraph” baselines could achieve very good results in news – the challenges of single-document

Length and compression issues

The length of the output summary was initially felt to be an important characteristic for users to be able to control and a key factor in system effectiveness to be investigated. In 2001 and 2002, target multi-document summary lengths of 50, 100, 200, and 400 words were set. While scores generally dropped as the target size decreased, results showed little difference in the relative performance of systems based on target size. Table 1 shows the various lengths that have been used for single-

Extracts and abstracts

The DUC organizers expected that participants, coming mostly from the natural language processing community, would quickly move beyond extraction to address the problems of deeper analysis of material to be summarized and to emphasize the synthesis of summaries. This has not generally happened except in the creation of very short summaries. It can be instructive to see what occurred and consider in retrospect why.

The approaches used in DUC have been largely extractive, i.e., they have been

Issues with genre

Newspaper articles are part of the vast open source literature of interest to many people including the US intelligence community. Such material has been the basis for research in information retrieval (TREC), information extraction (MUC), topic detection and tracking (TDT), and summarization (SUMMAC). In large part the choice of newspaper articles followed from their availability and the fact that research groups had already worked on this genre.

The use of newspaper and newswire text as

The evolution of DUC evaluation procedures and metrics

From the beginning, the DUC evaluations have tried to evaluate automatically produced summaries along two dimensions: their linguistic well-formedness and the degree to which their content agrees with human-created summaries of the same material (coverage). These two dimensions are not independent since extreme lack of well-formedness can affect the ability to judge content overlap. There has been significant evolution in the evaluation of both dimensions, especially in coverage.

Conclusions and prospects

Over the years, datasets, tasks, and systems have changed, as well as metrics and evaluation procedures. Nevertheless, DUC coverage results have been similar in the following ways:

•
Most manual summaries are clearly better than most automatic summaries.
•
Most automatic summaries do not differ significantly.
•
Automatic summaries at the extremes usually differ significantly.
•
Automatic summaries seldom performed better than simple baselines based on the structure of news articles.

Manual comparison of

References (33)

Baldwin, B., Donaway, R., Hovy, E., Liddy, E., Mani, I., Marcu, D., et al. (2000). An Evaluation Road Map for...
Blair-Goldensohn, S. (2005). From Definitions to Complex Topics: Columbia University at DUC 2005. Available from...
Blair-Goldensohn, S., Evans, D., Hatzivassiloglou, V., McKeown, K., Nenkova, A., Passonneau, B., et al. (2004)....
Copeck, T., & Szpakowicz, S. (2004). Vocabulary Agreement Among Model Summaries and Source Documents. Available from...
Dang, H.T. (2005). Overview of DUC 2005. Available from...
D’Avanzo, E., & Magnini, B. (2005). A Keyphrase-Based Approach to Summarization: the LAKE System at DUC-2005. Available...
B.J. Dorr et al.
Extrinsic Evaluation of Automatic Metrics for Summarization
(2004)
Farzindar, A., Rozon, F., & Lapalme, G. (2005). CATS A Topic-Oriented Multi-Document Summarization System at DUC 2005....
Harman, D., & Over, P. (2004). The Effects of Human Variation in DUC Summarization Evaluation. In Proceedings of the...
Harnly, A., Nenkova, A., Passonneau, R., & Rambow, O. (2005). Automation of summary evaluation by the pyramid method....

Hovy, E., Lin, C.-Y., Zhou, L., & Fukumoto, J. (2005). Basic Elements. Available from...

Jagadeesh, J., Pingali, P., & Varma, V. (2005). A Relevance-Based Language Modeling Approach to DUC 2005. Available...

Lacatusu, F., Hickl, A., Aarseth, P., & Taylor, L. (2005). Lite-GISTexter at DUC 2005. Available from...

Lacatusu, V.F., Parker, P., & Harabagiu, S.M. (2003). Lite-GISTexter:Generating Short Summaries with Minimal Resources....

Lin, C.-Y. (2001). Summary Evaluation Environment (SEE). Available from...

Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the ACL-04 Workshop: Text...

Cited by (194)

WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs
2023, Information Fusion
Citation Excerpt :
They both focused on generic and question summaries of English newspapers and newswire articles. DUC contains two evaluation methods: a baseline by an automatic system in NIST and human evaluation by the linguistic quality and conciseness [19]. As a pioneer of the guided summarization task, TAC 2010 required generating a 100-word summary for a given topic from a set of 10 Newswire articles [20].
As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many Natural Language Processing (NLP) tasks, such as information retrieval, knowledge base building, machine translation, text classification, and text summarization. In this paper, we introduce WikiDes, a novel dataset to generate short descriptions of Wikipedia articles for the problem of text summarization. The dataset consists of over 80k English samples on 6987 topics. We set up a two-phase summarization method — description generation (Phase I) and candidate ranking (Phase II) — as a strong approach that relies on transfer and contrastive learning. For description generation, T5 and BART show their superiority compared to other small-scale pre-trained models. By applying contrastive learning with the diverse input from beam search, the metric fusion-based ranking models outperform the direct description generation models significantly up to $\approx$ 22 ROUGE in topic-exclusive split and topic-independent split. Furthermore, the outcome descriptions in Phase II are supported by human evaluation in over 45.33% chosen compared to 23.66% in Phase I against the gold descriptions. In the aspect of sentiment analysis, the generated descriptions cannot effectively capture all sentiment polarities from paragraphs while doing this task better from the gold descriptions. The automatic generation of new descriptions reduces the human efforts in creating them and enriches Wikidata-based knowledge graphs. Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions. Finally, we expect WikiDes to be a useful dataset for related works in capturing salient information from short paragraphs. The curated dataset is publicly available at: https://github.com/declare-lab/WikiDes.
Modeling Unified Semantic Discourse Structure for High-quality Headline Generation
2024, arXiv
Automatic Generation of Discharge Summary of EMRs Based on Multi-granularity Information Fusion
2024, Communications in Computer and Information Science
A global and local information extraction model incorporating selection mechanism for abstractive text summarization
2024, Multimedia Tools and Applications
Automatic text summarization using deep reinforced model coupling contextualized word representation and attention mechanism
2024, Multimedia Tools and Applications
Multimodal text summarization with evaluation approaches
2023, Sadhana - Academy Proceedings in Engineering Sciences

View all citing articles on Scopus

View full text

Published by Elsevier Ltd.