Skip to main content

Duplicate News Story Detection Revisited

  • Conference paper
Information Retrieval Technology (AIRS 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8281))

Included in the following conference series:

Abstract

In this paper, we investigate near-duplicate detection, particularly looking at the detection of evolving news stories. These stories often consist primarily of syndicated information, with local replacement of headlines, captions, and the addition of locally-relevant content. By detecting near-duplicates, we can offer users only those stories with content materially different from previously-viewed versions of the story. We expand on previous work and improve the performance of near-duplicate document detection by weighting the phrases in a sliding window based on the term frequency within the document of terms in that window and inverse document frequency of those phrases. We experiment on a subset of a publicly available web collection that is comprised solely of documents from news web sites. News articles are particularly challenging due to the prevalence of syndicated articles, where very similar articles are run with different headlines and surrounded by different HTML markup and site templates. We evaluate these algorithmic weightings using human judgments to determine similarity. We find that our techniques outperform the state of the art with statistical significance and are more discriminating when faced with a diverse collection of documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alonso, O.: Implementing crowdsourcing-based relevance experimentation: An industrial perspective. In: Information Retrieval, pp. 1–20 (2012)

    Google Scholar 

  2. Bendersky, M., Croft, W.B.: Finding text reuse on the web. In: WSDM, pp. 262–271 (2009)

    Google Scholar 

  3. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: WWW, pp. 1157–1166 (1997)

    Google Scholar 

  4. Buckley, C., Salton, G., Allan, J.: Automatic retrieval with locality information using SMART. In: TREC-1, pp. 69–72 (1992)

    Google Scholar 

  5. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: ACM STOC, pp. 380–388 (2002)

    Google Scholar 

  6. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM TOIS 20(2) (2002)

    Google Scholar 

  7. Fetterly, D., Manasse, M., Najork, M.: Detecting phrase-level duplication on the world wide web. In: ACM SIGIR, pp. 170–177 (2005)

    Google Scholar 

  8. Gibson, J., Wellner, B., Lubar, S.: Identification of duplicate news stories in web pages. In: Proceedings of the 4th Web as CorpusWorkshop, WAC-4 (2008)

    Google Scholar 

  9. Gollapudi, S., Panigrahy, R.: Exploiting asymmetry in hierarchical topic extraction. In: ACM CIKM, pp. 475–482 (2006)

    Google Scholar 

  10. Henzinger, M.: Finding near-duplicate web pages: A large-scale evaluation of algorithms. In: ACM SIGIR, pp. 284–291 (2006)

    Google Scholar 

  11. Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inf. Sci. Technol. 54(3), 203–215 (2003)

    Article  Google Scholar 

  12. Ioffe, S.: Improved consistent sampling, weighted minhash and ℓ1 sketching. In: IEEE ICDM, pp. 246–255 (2010)

    Google Scholar 

  13. Kienreich, W., Granitzer, M., Sabol, V., Klieber, W.: Plagiarism detection in large sets of press agency news articles. In: Database and Expert Systems Applications, pp. 181–188 (2006)

    Google Scholar 

  14. Manasse, M., McSherry, F., Talwar, K.: Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research (2010)

    Google Scholar 

  15. Manber, U.: Finding similar files in a large file system. In: USENIX WTEC, Berkeley (1994)

    Google Scholar 

  16. Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)

    Google Scholar 

  17. Muthitacharoen, A., Chen, B., Mazières, D.: A low-bandwidth network file system. In: ACM SOSP, pp. 174–187 (2001)

    Google Scholar 

  18. Najork, M.: Detecting quilted web pages at scale. In: ACM SIGIR (2012)

    Google Scholar 

  19. Pasternack, J., Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: WWW, pp. 971–980 (2009)

    Google Scholar 

  20. Patel, R.: UHRS overview, http://research.microsoft.com/en-us/um/redmond/events/fs2012/presentations/Rajesh_Patel.pdf

  21. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: ACM SIGMOD, pp. 76–85 (2003)

    Google Scholar 

  22. Stein, B., zu Eissen, S.M., Potthast, M.: Strategies for retrieving plagiarized documents. In: ACM SIGIR, pp. 825–826 (2007)

    Google Scholar 

  23. Teodosiu, D., Bjørner, N., Gurevich, Y., Manasse, M., Porkka, J.: Optimizing file replication over limited-bandwidth networks using remote differential compression. Technical Report MSR-TR-2006-157, Microsoft Research (2006)

    Google Scholar 

  24. Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: ACM SIGIR, pp. 563–570 (2008)

    Google Scholar 

  25. Tridgell, A., Mackerras, P.: The rsync algorithm. Technical Report TR-CS-96-05, Australian National University, Dept. of Computer Science (June 1996), http://rsync.samba.org

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Alonso, O., Fetterly, D., Manasse, M. (2013). Duplicate News Story Detection Revisited. In: Banchs, R.E., Silvestri, F., Liu, TY., Zhang, M., Gao, S., Lang, J. (eds) Information Retrieval Technology. AIRS 2013. Lecture Notes in Computer Science, vol 8281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45068-6_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-45068-6_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-45067-9

  • Online ISBN: 978-3-642-45068-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics