Abstract
While Web or newspaper archives store large amounts of articles, they also contain a lot of near-duplicate information. Examples include articles about the same event published by multiple news agencies or articles about evolving events that lead to copies of paragraphs to provide background information. To support journalists, who attempt to read all information on a given topic at once, we propose an approach that, given a topic and a text collection, extracts a set of articles with broad coverage of the topic and minimum amount of duplicates.
We start by extracting articles related to the input topic and detecting duplicate paragraphs. We keep only one instance from each group of duplicates by using a weighted quadratic optimization problem. It finds the best position for all paragraphs, such that some articles consist mainly of distinct paragraphs and others consist mainly of duplicates. Finally, we present to the reader the articles with more distinct paragraphs. Our experiments show the high precision and recall of our approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S.: Diversifying search results. In: WSDM, pp. 5–14 (2009)
Carbonell, J., et al.: The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: SIGIR, pp. 335–336 (1998)
Drosou, M., Pitoura, E.: Search result diversification. SIGMOD Rec. 39(1), 41–47 (2010)
Gollapudi, S., Sharma, A.: An axiomatic approach for result diversification. In: WWW, pp. 381–390 (2009)
Nenkova, A., McKeown, K.: Automatic summarization. Foundations and Trends in Information Retrieval 5(2-3), 103–233 (2011)
Parker, R., et al.: English Gigaword, 5th edn., Linguistic Data Consortium (2011)
Ravi, S.S., Rosenkrantz, D.J., Tayi, G.K.: Heuristic and special case algorithms for dispersion problems. Operations Research 42(2), 299–310 (1994)
Schlaefer, N., Chu-Carroll, J., Nyberg, E., Fan, J., Zadrozny, W., Ferrucci, D.: Statistical source expansion for question answering. In: CIKM, pp. 345–354 (2011)
Takamura, H., Okumura, M.: Text summarization model based on maximum coverage problem and its variant. In: EACL, pp. 781–789 (2009)
Taneva, B., Weikum, G.: Gem-based entity-knowledge maintenance. In: CIKM, pp. 149–158 (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Schulte, C., Taneva, B., Weikum, G. (2015). On-topic Cover Stories from News Archives. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds) Advances in Information Retrieval. ECIR 2015. Lecture Notes in Computer Science, vol 9022. Springer, Cham. https://doi.org/10.1007/978-3-319-16354-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-16354-3_4
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16353-6
Online ISBN: 978-3-319-16354-3
eBook Packages: Computer ScienceComputer Science (R0)