research-article

Finding text reuse on the web

Authors:

Michael Bendersky,

W. Bruce CroftAuthors Info & Claims

WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining

Pages 262 - 271

https://doi.org/10.1145/1498759.1498835

Published: 09 February 2009 Publication History

Abstract

With the overwhelming number of reports on similar events originating from different sources on the web, it is often hard, using existing web search paradigms, to find the original source of "facts", statements, rumors, and opinions, and to track their development. Several techniques have been previously proposed for detecting such text reuse between different sources, however these techniques have been tested against relatively small and homogeneous TREC collections. In this work, we test the feasibility of text reuse detection techniques in the setting of web search. In addition to text reuse detection, we develop a novel technique that addresses the unique challenges of finding original sources on the web, such as defining a timeline. We also explore the use of link analysis for identifying reliable and relevant reports. Our experimental results show that the proposed techniques can operate on the scale of the web, are significantly more accurate than standard web search for finding text reuse, and provide a richer representation for tracking the information flow.

References

[1]

E. Adar and L. Adamic. Tracking Information Epidemics in Blogspace. In Proceedings of WI, pages 207--214, 2005.

Digital Library

[2]

E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proceedings of WSDM, pages 183--194, 2008.

Digital Library

[3]

J. Allan, R. Gupta, and V. Khandelwal. Temporal summaries of news topics. In Proceedings of SIGIR, pages 10--18, 2001.

Digital Library

[4]

R. Baeza-Yates, Á. Pereira, and N. Ziviani. Genealogical trees on the web: a search engine user perspective. In Proceedings of WWW, 2008.

Digital Library

[5]

N. Balasubramanian, J. Allan, and W. B. Croft. A comparison of sentence retrieval techniques. In Proceedings of SIGIR, pages 813--814, 2007.

Digital Library

[6]

M. Bendersky and O. Kurland. Utilizing passage-based language models for document retrieval. In Proceedings of ECIR, pages 162--174, 2008.

Digital Library

[7]

Y. Bernstein and J. Zobel. A Scalable System for Identifying Co-derivative Documents. In Proceedings of SPIRE, 2004.

[8]

S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998.

Digital Library

[9]

A. Broder. Identifying and Filtering Near-Duplicate Documents. In Proceedings of CPM, pages 1--10, 2000.

Digital Library

[10]

M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of STOC, pages 380--388, 2002.

Digital Library

[11]

F. Diaz and R. Jones. Using temporal profiles of queries for precision prediction. In Proceedings of SIGIR, pages 18--24, 2004.

Digital Library

[12]

D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of SIGIR, pages 170--177, 2005.

Digital Library

[13]

S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm. DOM-based content extraction of HTML documents. In Proceedings of WWW, pages 207--214, 2003.

Digital Library

[14]

T. Haveliwala. Topic-sensitive PageRank. In Proceedings of WWW, 2002.

Digital Library

[15]

M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of SIGIR, pages 284--291, 2006.

Digital Library

[16]

J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604--632, 1999.

Digital Library

[17]

V. Lavrenko and W. Croft. Relevance based language models. In Proceedings of SIGIR, pages 120--127, 2001.

Digital Library

[18]

J. Leskovec, S. Dumais, and E. Horvitz. Web projections: learning from contextual subgraphs of the web. In Proceedings of WWW, pages 471--480, 2007.

Digital Library

[19]

X. Li and B. W. Croft. Time-based language models. In In Proceedings of CIKM, pages 469--475, 2003.

Digital Library

[20]

A. McCallum. Information extraction: distilling structured data from unstructured text. Queue, 3(9):48--57, 2005.

Digital Library

[21]

Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In Proceeding of KDD, pages 198--207, 2005.

Digital Library

[22]

D. Metzler, Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel. Similarity measures for tracking information flow. In Proceedings of CIKM, 2005.

Digital Library

[23]

D. Metzler and W. B. Croft. A Markov random field model for term dependencies. In Proceedings of SIGIR, pages 472--479, 2005.

Digital Library

[24]

V. Murdock and W. B. Croft. A translation model for sentence retrieval. In Proceedings of HLT/EMNLP, pages 684--691, 2005.

Digital Library

[25]

J. M. Ponte and B. W. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR, pages 275--281, 1998.

Digital Library

[26]

J. C. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of ANLP, pages 16--19, 1997.

Digital Library

[27]

M. Ringel, E. Cutrell, S. Dumais, and E. Horvitz. Milestones in Time: The Value of Landmarks in Retrieving Information from Personal Stores. In Proceedings of INTERACT, pages 184--191, 2003.

[28]

J. Seo and W. B. Croft. Local text reuse detection. In Proceedings of SIGIR, 2008.

Digital Library

[29]

N. Shivakumar and H. Garcia-Molina. SCAM: Copy detection mechanisms for digital documents. In Proceedings of Digital Libraries, 1995.

[30]

R. Swan and D. Jensen. Timemines: Constructing timelines with statistical models of word usage. In Proceedings of KDD, pages 73--80, 2000.

Cited By

Li PZhao WOosterhuis HBast HXiong C(2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672523
Li PZhao WAl Hasan MXiong L(2022)GCWSNetProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557332(1188-1198)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557332
Li PLi XSamorodnitsky GZhao W(2021)Consistent Sampling Through Extremal ProcessProceedings of the Web Conference 202110.1145/3442381.3449955(1317-1327)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3442381.3449955
Show More Cited By

Index Terms

Finding text reuse on the web
1. Information systems
  1. Information retrieval

Recommendations

Local text reuse detection
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Text reuse occurs in many different types of documents and for many different reasons. One form of reuse, duplicate or near-duplicate documents, has been a focus of researchers because of its importance in Web search. Local text reuse occurs when ...
Evaluating text reuse discovery on the web
IIiX '10: Proceedings of the third symposium on Information interaction in context

Text reuse detection aims to identify duplicates, reformulations or partial rewrites of a given text. Some previous research has focused on determining text reuse instances accurately on local corpora. However, the practical usage of finding text reuse ...
Web search engine multimedia functionality

Web search engines are beginning to offer access to multimedia searching, including audio, video and image searching. In this paper we report findings from a study examining the state of multimedia search functionality on major general and specialized ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining

February 2009

314 pages

ISBN:9781605583907

DOI:10.1145/1498759

Editors:
Ricardo Baeza-Yates
Yahoo! Research, Spain
,
Paolo Boldi
Universita degli Studi di Milano, Italy
,
Berthier Ribeiro-Neto
Google Engineering, Brazil & CS Dept., Univ. Fed. de Minas Gerais, Brazil
,
B. Barla Cambazoglu
Yahoo! Research

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Yahoo! Research
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
Nokia
Google Inc.
SIGIR: ACM Special Interest Group on Information Retrieval
Microsoft: Microsoft

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Division of Information and Intelligent Systems

Conference

WSDM'09

Sponsor:

WSDM'09: Second ACM International Conference on Web Search and Web Data Mining

February 9 - 12, 2009

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

52
Total Citations
View Citations
721
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li PZhao WOosterhuis HBast HXiong C(2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672523
Li PZhao WAl Hasan MXiong L(2022)GCWSNetProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557332(1188-1198)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557332
Li PLi XSamorodnitsky GZhao W(2021)Consistent Sampling Through Extremal ProcessProceedings of the Web Conference 202110.1145/3442381.3449955(1317-1327)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3442381.3449955
Yousef TJanicke S(2021)A Survey of Text Alignment VisualizationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.302897527:2(1149-1159)Online publication date: Feb-2021
https://doi.org/10.1109/TVCG.2020.3028975
Kong XGu ZWang LYin LLi SHan W(2020)Resume Extraction and Validation from Public InformationProceedings of the 2020 International Conference on Cyberspace Innovation of Advanced Technologies10.1145/3444370.3444597(355-359)Online publication date: 4-Dec-2020
https://dl.acm.org/doi/10.1145/3444370.3444597
Evans NEdge DLarson JWhite CBernhaupt RMueller FVerweij DAndres JMcGrenere JCockburn AAvellino IGoguey ABjørn PZhao SSamson BKocielnik R(2020)News Provenance: Revealing News Text Reuse at Web-Scale in an Augmented News Search ExperienceExtended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems10.1145/3334480.3375225(1-8)Online publication date: 25-Apr-2020
https://dl.acm.org/doi/10.1145/3334480.3375225
Gómez-Adorno HBel-Enguix GSierra GTorres-Moreno JMartinez RSerrano P(2020)Evaluation of Similarity Measures in a Benchmark for Spanish Paraphrasing DetectionAdvances in Computational Intelligence10.1007/978-3-030-60887-3_19(214-223)Online publication date: 7-Oct-2020
https://doi.org/10.1007/978-3-030-60887-3_19
Li PLi XZhang CWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Re-randomized densification for one permutation hashing and bin-wise consistent weighted samplingProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455714(15926-15936)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455714
Alshomary MVölske MLicht TWachsmuth HStein BHagen MPotthast M(2019)Wikipedia Text Reuse: Within and WithoutAdvances in Information Retrieval10.1007/978-3-030-15712-8_49(747-754)Online publication date: 7-Apr-2019
https://doi.org/10.1007/978-3-030-15712-8_49
Shen BForstall CRocha AScheirer W(2018)Practical Text Phylogeny for Real-World SettingsIEEE Access10.1109/ACCESS.2018.28568656(41002-41012)Online publication date: 2018
https://doi.org/10.1109/ACCESS.2018.2856865
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten