Abstract
In this paper, we investigate near-duplicate detection, particularly looking at the detection of evolving news stories. These stories often consist primarily of syndicated information, with local replacement of headlines, captions, and the addition of locally-relevant content. By detecting near-duplicates, we can offer users only those stories with content materially different from previously-viewed versions of the story. We expand on previous work and improve the performance of near-duplicate document detection by weighting the phrases in a sliding window based on the term frequency within the document of terms in that window and inverse document frequency of those phrases. We experiment on a subset of a publicly available web collection that is comprised solely of documents from news web sites. News articles are particularly challenging due to the prevalence of syndicated articles, where very similar articles are run with different headlines and surrounded by different HTML markup and site templates. We evaluate these algorithmic weightings using human judgments to determine similarity. We find that our techniques outperform the state of the art with statistical significance and are more discriminating when faced with a diverse collection of documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alonso, O.: Implementing crowdsourcing-based relevance experimentation: An industrial perspective. In: Information Retrieval, pp. 1–20 (2012)
Bendersky, M., Croft, W.B.: Finding text reuse on the web. In: WSDM, pp. 262–271 (2009)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: WWW, pp. 1157–1166 (1997)
Buckley, C., Salton, G., Allan, J.: Automatic retrieval with locality information using SMART. In: TREC-1, pp. 69–72 (1992)
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: ACM STOC, pp. 380–388 (2002)
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM TOISÂ 20(2) (2002)
Fetterly, D., Manasse, M., Najork, M.: Detecting phrase-level duplication on the world wide web. In: ACM SIGIR, pp. 170–177 (2005)
Gibson, J., Wellner, B., Lubar, S.: Identification of duplicate news stories in web pages. In: Proceedings of the 4th Web as CorpusWorkshop, WAC-4 (2008)
Gollapudi, S., Panigrahy, R.: Exploiting asymmetry in hierarchical topic extraction. In: ACM CIKM, pp. 475–482 (2006)
Henzinger, M.: Finding near-duplicate web pages: A large-scale evaluation of algorithms. In: ACM SIGIR, pp. 284–291 (2006)
Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inf. Sci. Technol. 54(3), 203–215 (2003)
Ioffe, S.: Improved consistent sampling, weighted minhash and ℓ1 sketching. In: IEEE ICDM, pp. 246–255 (2010)
Kienreich, W., Granitzer, M., Sabol, V., Klieber, W.: Plagiarism detection in large sets of press agency news articles. In: Database and Expert Systems Applications, pp. 181–188 (2006)
Manasse, M., McSherry, F., Talwar, K.: Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research (2010)
Manber, U.: Finding similar files in a large file system. In: USENIX WTEC, Berkeley (1994)
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)
Muthitacharoen, A., Chen, B., Mazières, D.: A low-bandwidth network file system. In: ACM SOSP, pp. 174–187 (2001)
Najork, M.: Detecting quilted web pages at scale. In: ACM SIGIR (2012)
Pasternack, J., Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: WWW, pp. 971–980 (2009)
Patel, R.: UHRS overview, http://research.microsoft.com/en-us/um/redmond/events/fs2012/presentations/Rajesh_Patel.pdf
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: ACM SIGMOD, pp. 76–85 (2003)
Stein, B., zu Eissen, S.M., Potthast, M.: Strategies for retrieving plagiarized documents. In: ACM SIGIR, pp. 825–826 (2007)
Teodosiu, D., Bjørner, N., Gurevich, Y., Manasse, M., Porkka, J.: Optimizing file replication over limited-bandwidth networks using remote differential compression. Technical Report MSR-TR-2006-157, Microsoft Research (2006)
Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: ACM SIGIR, pp. 563–570 (2008)
Tridgell, A., Mackerras, P.: The rsync algorithm. Technical Report TR-CS-96-05, Australian National University, Dept. of Computer Science (June 1996), http://rsync.samba.org
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alonso, O., Fetterly, D., Manasse, M. (2013). Duplicate News Story Detection Revisited. In: Banchs, R.E., Silvestri, F., Liu, TY., Zhang, M., Gao, S., Lang, J. (eds) Information Retrieval Technology. AIRS 2013. Lecture Notes in Computer Science, vol 8281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45068-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-45068-6_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45067-9
Online ISBN: 978-3-642-45068-6
eBook Packages: Computer ScienceComputer Science (R0)