DUC in context
Introduction
Recent years have seen increased interest in text summarization with emphasis on evaluation of prototype systems.1 Many factors can affect the design of such evaluations, requiring choices among competing alternatives. The realization of such designs seldom goes entirely as planned and the evaluations have complex effects on the researchers and their work.
What issues have the major evaluations addressed, what choices have they made and why, and what have been the consequences? This paper examines several major themes running through the Document Understanding Conference (DUC) evaluations (2001–2006) but also present in the Summarization Evaluation Conference (SUMMAC) and the National Institute for Informatics Test Collection for IR (NTCIR) systems workshops.
SUMMAC (Mani et al., 1999) was a large scale evaluation of text summarization systems that took place in 1998 as part of the Defense Advanced Research Projects Administration (DARPA) TIPSTER program. There were 16 systems that took part, and two major summarization tasks that were evaluated in some manner. The Japanese NTCIR evaluations included summarization tasks in 2000, 2002, and 2004 with about 10 systems working on two different summarization tasks each year.
In 2000 a new summarization evaluation program was begun, again initially sponsored by DARPA; a group of expert summarization researchers contributed to a roadmap (Baldwin et al., 2000) that provided guidance for DUC, with a pilot run in 2000, and the first major evaluation in the fall of 2001. The roadmap called for evaluation of summaries of both single documents and sets of multiple documents, at specified levels of text compression. It suggested that the initial evaluation was to be intrinsic (direct evaluation), with extrinsic evaluation (looking at how the summary affects performance on a task) to be phased in over time, along with requirements of deeper text understanding techniques that can lead to more complicated summaries.
Over the course of its first six years DUC has examined automatic single- and multi-document summarization of newspaper/wire articles, with both generic tasks and various focused tasks. The results have been evaluated in terms of linguistic quality as well as their completeness with respect to content chosen by human summarizers (or in comparison with very simple automatic systems run at NIST to serve as baselines). Participation has grown from 15 research groups to over three dozen.
Table 1 gives a quick summary of the various tasks and evaluation methodologies that have been used in DUC in 2001–2006, and provides a chronological view of the DUC evaluations. This paper, however, examines DUC not chronologically, but in the context of evaluation issues and in the context of the state-of-the-art in automatic summarization. Seven different but interconnected themes are explored.
- 1.
Intrinsic versus extrinsic evaluation.
- 2.
Generic versus focused summaries.
- 3.
Single- and multi-document summaries.
- 4.
Length and compression issues.
- 5.
Extracts versus abstracts.
- 6.
Issues with genre.
- 7.
The evolution of specific DUC evaluation procedures and methods.
Section snippets
Intrinsic versus extrinsic evaluation
Two major types of evaluation have been used for testing summaries: intrinsic evaluation where the emphasis is on measuring the quality of the created summary directly, and extrinsic evaluation where the emphasis is on measuring how well the summary aids performance on a given task.
Extrinsic evaluation requires the selection of a task that could use summarization and then measuring the effect of using automatic summaries instead of the original text. Critical issues here are the task selection
Generic versus focused summaries
The history of summarization has concentrated on the production of generic summaries, that is, summaries that are produced with only minimal specification regarding their intended situation, audience, and use. The idea of producing automatic abstracts of single documents was the initial driver of research, and generic summarization has formed the bulk of research up until recently. It was therefore natural that the DUC roadmap called for generic summary evaluation rather than using focused
Single- and multi-document summaries
DUC has addressed both single-document summarization and summarization of a set of documents on the same topic. The roadmap called for summarization of single documents – the traditional target of summarization systems. But the task of creating generic summaries of news articles (often by largely extractive means) turned out to be much less interesting than expected. Simple “take the lead sentence/paragraph” baselines could achieve very good results in news – the challenges of single-document
Length and compression issues
The length of the output summary was initially felt to be an important characteristic for users to be able to control and a key factor in system effectiveness to be investigated. In 2001 and 2002, target multi-document summary lengths of 50, 100, 200, and 400 words were set. While scores generally dropped as the target size decreased, results showed little difference in the relative performance of systems based on target size. Table 1 shows the various lengths that have been used for single-
Extracts and abstracts
The DUC organizers expected that participants, coming mostly from the natural language processing community, would quickly move beyond extraction to address the problems of deeper analysis of material to be summarized and to emphasize the synthesis of summaries. This has not generally happened except in the creation of very short summaries. It can be instructive to see what occurred and consider in retrospect why.
The approaches used in DUC have been largely extractive, i.e., they have been
Issues with genre
Newspaper articles are part of the vast open source literature of interest to many people including the US intelligence community. Such material has been the basis for research in information retrieval (TREC), information extraction (MUC), topic detection and tracking (TDT), and summarization (SUMMAC). In large part the choice of newspaper articles followed from their availability and the fact that research groups had already worked on this genre.
The use of newspaper and newswire text as
The evolution of DUC evaluation procedures and metrics
From the beginning, the DUC evaluations have tried to evaluate automatically produced summaries along two dimensions: their linguistic well-formedness and the degree to which their content agrees with human-created summaries of the same material (coverage). These two dimensions are not independent since extreme lack of well-formedness can affect the ability to judge content overlap. There has been significant evolution in the evaluation of both dimensions, especially in coverage.
Conclusions and prospects
Over the years, datasets, tasks, and systems have changed, as well as metrics and evaluation procedures. Nevertheless, DUC coverage results have been similar in the following ways:
- •
Most manual summaries are clearly better than most automatic summaries.
- •
Most automatic summaries do not differ significantly.
- •
Automatic summaries at the extremes usually differ significantly.
- •
Automatic summaries seldom performed better than simple baselines based on the structure of news articles.
References (33)
- Baldwin, B., Donaway, R., Hovy, E., Liddy, E., Mani, I., Marcu, D., et al. (2000). An Evaluation Road Map for...
- Blair-Goldensohn, S. (2005). From Definitions to Complex Topics: Columbia University at DUC 2005. Available from...
- Blair-Goldensohn, S., Evans, D., Hatzivassiloglou, V., McKeown, K., Nenkova, A., Passonneau, B., et al. (2004)....
- Copeck, T., & Szpakowicz, S. (2004). Vocabulary Agreement Among Model Summaries and Source Documents. Available from...
- Dang, H.T. (2005). Overview of DUC 2005. Available from...
- D’Avanzo, E., & Magnini, B. (2005). A Keyphrase-Based Approach to Summarization: the LAKE System at DUC-2005. Available...
- et al.
Extrinsic Evaluation of Automatic Metrics for Summarization
(2004) - Farzindar, A., Rozon, F., & Lapalme, G. (2005). CATS A Topic-Oriented Multi-Document Summarization System at DUC 2005....
- Harman, D., & Over, P. (2004). The Effects of Human Variation in DUC Summarization Evaluation. In Proceedings of the...
- Harnly, A., Nenkova, A., Passonneau, R., & Rambow, O. (2005). Automation of summary evaluation by the pyramid method....
Cited by (194)
WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs
2023, Information FusionCitation Excerpt :They both focused on generic and question summaries of English newspapers and newswire articles. DUC contains two evaluation methods: a baseline by an automatic system in NIST and human evaluation by the linguistic quality and conciseness [19]. As a pioneer of the guided summarization task, TAC 2010 required generating a 100-word summary for a given topic from a set of 10 Newswire articles [20].
Automatic Generation of Discharge Summary of EMRs Based on Multi-granularity Information Fusion
2024, Communications in Computer and Information ScienceA global and local information extraction model incorporating selection mechanism for abstractive text summarization
2024, Multimedia Tools and ApplicationsAutomatic text summarization using deep reinforced model coupling contextualized word representation and attention mechanism
2024, Multimedia Tools and ApplicationsMultimodal text summarization with evaluation approaches
2023, Sadhana - Academy Proceedings in Engineering Sciences