Duplicate News Story Detection Revisited

Alonso, Omar; Fetterly, Dennis; Manasse, Mark

doi:10.1007/978-3-642-45068-6_18

Omar Alonso²⁰,
Dennis Fetterly²¹ &
Mark Manasse²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8281))

Included in the following conference series:

Asia Information Retrieval Symposium

1508 Accesses
6 Citations

Abstract

In this paper, we investigate near-duplicate detection, particularly looking at the detection of evolving news stories. These stories often consist primarily of syndicated information, with local replacement of headlines, captions, and the addition of locally-relevant content. By detecting near-duplicates, we can offer users only those stories with content materially different from previously-viewed versions of the story. We expand on previous work and improve the performance of near-duplicate document detection by weighting the phrases in a sliding window based on the term frequency within the document of terms in that window and inverse document frequency of those phrases. We experiment on a subset of a publicly available web collection that is comprised solely of documents from news web sites. News articles are particularly challenging due to the prevalence of syndicated articles, where very similar articles are run with different headlines and surrounded by different HTML markup and site templates. We evaluate these algorithmic weightings using human judgments to determine similarity. We find that our techniques outperform the state of the art with statistical significance and are more discriminating when faced with a diverse collection of documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alonso, O.: Implementing crowdsourcing-based relevance experimentation: An industrial perspective. In: Information Retrieval, pp. 1–20 (2012)
Google Scholar
Bendersky, M., Croft, W.B.: Finding text reuse on the web. In: WSDM, pp. 262–271 (2009)
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: WWW, pp. 1157–1166 (1997)
Google Scholar
Buckley, C., Salton, G., Allan, J.: Automatic retrieval with locality information using SMART. In: TREC-1, pp. 69–72 (1992)
Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: ACM STOC, pp. 380–388 (2002)
Google Scholar
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM TOIS 20(2) (2002)
Google Scholar
Fetterly, D., Manasse, M., Najork, M.: Detecting phrase-level duplication on the world wide web. In: ACM SIGIR, pp. 170–177 (2005)
Google Scholar
Gibson, J., Wellner, B., Lubar, S.: Identification of duplicate news stories in web pages. In: Proceedings of the 4th Web as CorpusWorkshop, WAC-4 (2008)
Google Scholar
Gollapudi, S., Panigrahy, R.: Exploiting asymmetry in hierarchical topic extraction. In: ACM CIKM, pp. 475–482 (2006)
Google Scholar
Henzinger, M.: Finding near-duplicate web pages: A large-scale evaluation of algorithms. In: ACM SIGIR, pp. 284–291 (2006)
Google Scholar
Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inf. Sci. Technol. 54(3), 203–215 (2003)
Article Google Scholar
Ioffe, S.: Improved consistent sampling, weighted minhash and ℓ₁ sketching. In: IEEE ICDM, pp. 246–255 (2010)
Google Scholar
Kienreich, W., Granitzer, M., Sabol, V., Klieber, W.: Plagiarism detection in large sets of press agency news articles. In: Database and Expert Systems Applications, pp. 181–188 (2006)
Google Scholar
Manasse, M., McSherry, F., Talwar, K.: Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research (2010)
Google Scholar
Manber, U.: Finding similar files in a large file system. In: USENIX WTEC, Berkeley (1994)
Google Scholar
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)
Google Scholar
Muthitacharoen, A., Chen, B., Mazières, D.: A low-bandwidth network file system. In: ACM SOSP, pp. 174–187 (2001)
Google Scholar
Najork, M.: Detecting quilted web pages at scale. In: ACM SIGIR (2012)
Google Scholar
Pasternack, J., Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: WWW, pp. 971–980 (2009)
Google Scholar
Patel, R.: UHRS overview, http://research.microsoft.com/en-us/um/redmond/events/fs2012/presentations/Rajesh_Patel.pdf
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: ACM SIGMOD, pp. 76–85 (2003)
Google Scholar
Stein, B., zu Eissen, S.M., Potthast, M.: Strategies for retrieving plagiarized documents. In: ACM SIGIR, pp. 825–826 (2007)
Google Scholar
Teodosiu, D., Bjørner, N., Gurevich, Y., Manasse, M., Porkka, J.: Optimizing file replication over limited-bandwidth networks using remote differential compression. Technical Report MSR-TR-2006-157, Microsoft Research (2006)
Google Scholar
Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: ACM SIGIR, pp. 563–570 (2008)
Google Scholar
Tridgell, A., Mackerras, P.: The rsync algorithm. Technical Report TR-CS-96-05, Australian National University, Dept. of Computer Science (June 1996), http://rsync.samba.org

Download references

Author information

Authors and Affiliations

Microsoft Corporation, USA
Omar Alonso
Microsoft Research, Silicon Valley Lab, USA
Dennis Fetterly & Mark Manasse

Authors

Omar Alonso
View author publications
You can also search for this author in PubMed Google Scholar
Dennis Fetterly
View author publications
You can also search for this author in PubMed Google Scholar
Mark Manasse
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Infocomm Research, Human Language Technology, 1 Fusionopolis Way #21-01, Connexis South, 138632, Singapore
Rafael E. Banchs , Min Zhang & Sheng Gao , &
Yahoo Labs, Avinguda Diagonal 177, 08018, Barcelona, Spain
Fabrizio Silvestri
Microsoft Research Asia, No. 5, Danling Street, Haidian District, 100080, Beijing, China
Tie-Yan Liu
Institute for Infocomm Research, Human Language Technology, 1 Fusionopolis Way #21-01, Connexis South,, 138632, Singapore
Jun Lang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alonso, O., Fetterly, D., Manasse, M. (2013). Duplicate News Story Detection Revisited. In: Banchs, R.E., Silvestri, F., Liu, TY., Zhang, M., Gao, S., Lang, J. (eds) Information Retrieval Technology. AIRS 2013. Lecture Notes in Computer Science, vol 8281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45068-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-45068-6_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45067-9
Online ISBN: 978-3-642-45068-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics