Exploiting Sentence-Level Features for Near-Duplicate Document Detection

Wang, Jenq-Haur; Chang, Hung-Chi

doi:10.1007/978-3-642-04769-5_18

Jenq-Haur Wang²³ &
Hung-Chi Chang²⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5839))

Included in the following conference series:

Asia Information Retrieval Symposium

920 Accesses
7 Citations

Abstract

Digital documents are easy to copy. How to effectively detect possible near-duplicate copies is critical in Web search. Conventional copy detection approaches such as document fingerprinting and bag-of-word similarity target at different levels of granularity in document features, from word n-grams to whole documents. In this paper, we focus on the mutual-inclusive type of near-duplicates where only partial overlap among documents makes them similar. We propose using a simple and compact sentence-level feature, the sequence of sentence lengths, for near-duplicate copy detection. Various configurations of sentence-level and word-level algorithms are evaluated. The experimental results show that sentence-level algorithms achieved higher efficiency with comparable precision and recall rates.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bernstein, Y., Zobel, J.: Accurate Discovery of Co-derivative Documents via Duplicate Text Detection. Information Systems 31(7), 595–609 (2006)
Article Google Scholar
Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: The 1995 ACM International Conference on Management of Data (SIGMOD 1995), pp. 398–409 (1995)
Google Scholar
Broder, A.: On the Resemblance and Containment of Documents. In: Compression and Complexity of Sequences, pp. 21–29 (1997)
Google Scholar
Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic Clustering of the Web. In: The 6th International Conference on World Wide Web (WWW 1997), pp. 393–404 (1997)
Google Scholar
Chang, H.C., Wang, J.H.: Organizing News Archives by Near-duplicate Copy Detection in Digital Libraries. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 410–419. Springer, Heidelberg (2007)
Chapter Google Scholar
Chang, H.C., Wang, J.H., Chiu, C.Y.: Finding Event-Relevant Content from the Web Using a Near-duplicate Detection Approach. In: The 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), pp. 291–294 (2007)
Google Scholar
Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: The 34th Annual ACM Symposium on Theory of Computing (STOC 2002), pp. 380–388 (2002)
Google Scholar
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection Statistics for Fast Duplicate Document Detection. ACM Transactions on Information Systems (TOIS) 20(2), 171–191 (2002)
Article Google Scholar
Damerau, F.J.: A Technique for Computer Detection and Correction of Spelling Errors. Communications of the ACM 7(3), 171–176 (1964)
Article Google Scholar
Fetterly, D., Manasse, M., Najork, M.: Detecting Phrase-level Duplication on the World Wide Web. In: The 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), pp. 170–177 (2005)
Google Scholar
Heintze, N.: Scalable Document Fingerprinting. In: The 2nd USENIX Workshop on Electronic Commerce (1996)
Google Scholar
Henzinger, M.: Finding Near-duplicate Web Pages: A Large-scale Evaluation of Algorithms. In: The 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pp. 284–291 (2006)
Google Scholar
Hoad, T.C., Zobel, J.: Methods for Identifying Versioned and Plagiarized Documents. Journal of the American Society for Information Science and Technology 54(3), 203–215 (2003)
Article Google Scholar
Huffman, S.B., Lehman, A.R., Stolboushkin, A.P., Wong-Toi, H., Yang, F., Roehrig, H.: Multiple-signal Duplicate Detection for Search Evaluation. In: The 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2007), pp. 223–230 (2007)
Google Scholar
Manber, U.: Finding Similar Files in a Large File System. In: USENIX Winter Technical Conference, pp. 1–10 (1994)
Google Scholar
Manku, G.S., Jain, A., Sarma, A.D.: Detecting Near-duplicates for Web Crawling. In: The 16th International Conference on World Wide Web (WWW 2007), pp. 141–150 (2007)
Google Scholar
Metzler, D., Bernstein, Y., Croft, W.B., Moffat, A., Zobel, J.: Similarity Measures for Tracking Information Flow. In: The 14th ACM Conference on Information and Knowledge Management (CIKM 2005), pp. 517–524 (2005)
Google Scholar
NIST. Secure hash standard. Federal Information Processing Standards, FIPS 180-1 (1995)
Google Scholar
NTCIR (NII Test Collection for IR Systems) project, http://research.nii.ac.jp/ntcir/ (accessed on January 23, 2009)
Seo, J., Croft, W.B.: Local Text Reuse Detection. In: The 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 571–578 (2008)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: International Conference on Theory and Practice of Digital Libraries (1995)
Google Scholar
Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections. In: The 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 563–570 (2008)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection. In: The 17th International Conference on World Wide Web (WWW 2008), pp. 131–140 (2008)
Google Scholar
Yang, H., Callan, J.: Near-duplicate Detection by Instance-level Constrained Clustering. In: The 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pp. 421–428 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

National Taipei University of Technology, Taiwan
Jenq-Haur Wang
Academia Sinica, Taiwan
Hung-Chi Chang

Authors

Jenq-Haur Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hung-Chi Chang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, 790-784, Pohang, Korea
Gary Geunbae Lee
School of Computing, The Robert Gordon University, St Andrew Street, AB25 1HG, Aberdeen, UK
Dawei Song
Microsoft Reseach Asia, 5F Beijing Sigma Center, 49 Zhichun Road, Haidian District, 100190, Beijing, P.R. China
Chin-Yew Lin
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, 101-8430, Tokyo, Japan
Akiko Aizawa
School of Literature, Shirayuri College, 1-25 Midorigaoka, Chofu-shi, 182-8525, Tokyo, Japan
Kazuko Kuriyama
Graduate School of Information Science and Technology, Hokkaido University, North 14 West 9, Kita-ku. Sapporo-shi, 060-0814, Hokkaido, Japan
Masaharu Yoshioka
Microsoft Research Asia, 5F Beijing Sigma Center, 49 Zhichun Road, Haidian District, 100190, Beijing, P.R. China
Tetsuya Sakai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, JH., Chang, HC. (2009). Exploiting Sentence-Level Features for Near-Duplicate Document Detection. In: Lee, G.G., et al. Information Retrieval Technology. AIRS 2009. Lecture Notes in Computer Science, vol 5839. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04769-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-04769-5_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04768-8
Online ISBN: 978-3-642-04769-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics