skip to main content
10.1145/1498759.1498835acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Finding text reuse on the web

Published: 09 February 2009 Publication History

Abstract

With the overwhelming number of reports on similar events originating from different sources on the web, it is often hard, using existing web search paradigms, to find the original source of "facts", statements, rumors, and opinions, and to track their development. Several techniques have been previously proposed for detecting such text reuse between different sources, however these techniques have been tested against relatively small and homogeneous TREC collections. In this work, we test the feasibility of text reuse detection techniques in the setting of web search. In addition to text reuse detection, we develop a novel technique that addresses the unique challenges of finding original sources on the web, such as defining a timeline. We also explore the use of link analysis for identifying reliable and relevant reports. Our experimental results show that the proposed techniques can operate on the scale of the web, are significantly more accurate than standard web search for finding text reuse, and provide a richer representation for tracking the information flow.

References

[1]
E. Adar and L. Adamic. Tracking Information Epidemics in Blogspace. In Proceedings of WI, pages 207--214, 2005.
[2]
E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proceedings of WSDM, pages 183--194, 2008.
[3]
J. Allan, R. Gupta, and V. Khandelwal. Temporal summaries of news topics. In Proceedings of SIGIR, pages 10--18, 2001.
[4]
R. Baeza-Yates, Á. Pereira, and N. Ziviani. Genealogical trees on the web: a search engine user perspective. In Proceedings of WWW, 2008.
[5]
N. Balasubramanian, J. Allan, and W. B. Croft. A comparison of sentence retrieval techniques. In Proceedings of SIGIR, pages 813--814, 2007.
[6]
M. Bendersky and O. Kurland. Utilizing passage-based language models for document retrieval. In Proceedings of ECIR, pages 162--174, 2008.
[7]
Y. Bernstein and J. Zobel. A Scalable System for Identifying Co-derivative Documents. In Proceedings of SPIRE, 2004.
[8]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998.
[9]
A. Broder. Identifying and Filtering Near-Duplicate Documents. In Proceedings of CPM, pages 1--10, 2000.
[10]
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of STOC, pages 380--388, 2002.
[11]
F. Diaz and R. Jones. Using temporal profiles of queries for precision prediction. In Proceedings of SIGIR, pages 18--24, 2004.
[12]
D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of SIGIR, pages 170--177, 2005.
[13]
S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm. DOM-based content extraction of HTML documents. In Proceedings of WWW, pages 207--214, 2003.
[14]
T. Haveliwala. Topic-sensitive PageRank. In Proceedings of WWW, 2002.
[15]
M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of SIGIR, pages 284--291, 2006.
[16]
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604--632, 1999.
[17]
V. Lavrenko and W. Croft. Relevance based language models. In Proceedings of SIGIR, pages 120--127, 2001.
[18]
J. Leskovec, S. Dumais, and E. Horvitz. Web projections: learning from contextual subgraphs of the web. In Proceedings of WWW, pages 471--480, 2007.
[19]
X. Li and B. W. Croft. Time-based language models. In In Proceedings of CIKM, pages 469--475, 2003.
[20]
A. McCallum. Information extraction: distilling structured data from unstructured text. Queue, 3(9):48--57, 2005.
[21]
Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In Proceeding of KDD, pages 198--207, 2005.
[22]
D. Metzler, Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel. Similarity measures for tracking information flow. In Proceedings of CIKM, 2005.
[23]
D. Metzler and W. B. Croft. A Markov random field model for term dependencies. In Proceedings of SIGIR, pages 472--479, 2005.
[24]
V. Murdock and W. B. Croft. A translation model for sentence retrieval. In Proceedings of HLT/EMNLP, pages 684--691, 2005.
[25]
J. M. Ponte and B. W. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR, pages 275--281, 1998.
[26]
J. C. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of ANLP, pages 16--19, 1997.
[27]
M. Ringel, E. Cutrell, S. Dumais, and E. Horvitz. Milestones in Time: The Value of Landmarks in Retrieving Information from Personal Stores. In Proceedings of INTERACT, pages 184--191, 2003.
[28]
J. Seo and W. B. Croft. Local text reuse detection. In Proceedings of SIGIR, 2008.
[29]
N. Shivakumar and H. Garcia-Molina. SCAM: Copy detection mechanisms for digital documents. In Proceedings of Digital Libraries, 1995.
[30]
R. Swan and D. Jensen. Timemines: Constructing timelines with statistical models of word usage. In Proceedings of KDD, pages 73--80, 2000.

Cited By

View all

Index Terms

  1. Finding text reuse on the web

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining
    February 2009
    314 pages
    ISBN:9781605583907
    DOI:10.1145/1498759
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 February 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. information flow
    2. text reuse
    3. web search

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    WSDM'09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
    • (2022)GCWSNetProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557332(1188-1198)Online publication date: 17-Oct-2022
    • (2021)Consistent Sampling Through Extremal ProcessProceedings of the Web Conference 202110.1145/3442381.3449955(1317-1327)Online publication date: 19-Apr-2021
    • (2021)A Survey of Text Alignment VisualizationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.302897527:2(1149-1159)Online publication date: Feb-2021
    • (2020)Resume Extraction and Validation from Public InformationProceedings of the 2020 International Conference on Cyberspace Innovation of Advanced Technologies10.1145/3444370.3444597(355-359)Online publication date: 4-Dec-2020
    • (2020)News Provenance: Revealing News Text Reuse at Web-Scale in an Augmented News Search ExperienceExtended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems10.1145/3334480.3375225(1-8)Online publication date: 25-Apr-2020
    • (2020)Evaluation of Similarity Measures in a Benchmark for Spanish Paraphrasing DetectionAdvances in Computational Intelligence10.1007/978-3-030-60887-3_19(214-223)Online publication date: 7-Oct-2020
    • (2019)Re-randomized densification for one permutation hashing and bin-wise consistent weighted samplingProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455714(15926-15936)Online publication date: 8-Dec-2019
    • (2019)Wikipedia Text Reuse: Within and WithoutAdvances in Information Retrieval10.1007/978-3-030-15712-8_49(747-754)Online publication date: 7-Apr-2019
    • (2018)Practical Text Phylogeny for Real-World SettingsIEEE Access10.1109/ACCESS.2018.28568656(41002-41012)Online publication date: 2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media