Abstract
Digital documents are easy to copy. How to effectively detect possible near-duplicate copies is critical in Web search. Conventional copy detection approaches such as document fingerprinting and bag-of-word similarity target at different levels of granularity in document features, from word n-grams to whole documents. In this paper, we focus on the mutual-inclusive type of near-duplicates where only partial overlap among documents makes them similar. We propose using a simple and compact sentence-level feature, the sequence of sentence lengths, for near-duplicate copy detection. Various configurations of sentence-level and word-level algorithms are evaluated. The experimental results show that sentence-level algorithms achieved higher efficiency with comparable precision and recall rates.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bernstein, Y., Zobel, J.: Accurate Discovery of Co-derivative Documents via Duplicate Text Detection. Information Systems 31(7), 595–609 (2006)
Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: The 1995 ACM International Conference on Management of Data (SIGMOD 1995), pp. 398–409 (1995)
Broder, A.: On the Resemblance and Containment of Documents. In: Compression and Complexity of Sequences, pp. 21–29 (1997)
Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic Clustering of the Web. In: The 6th International Conference on World Wide Web (WWW 1997), pp. 393–404 (1997)
Chang, H.C., Wang, J.H.: Organizing News Archives by Near-duplicate Copy Detection in Digital Libraries. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 410–419. Springer, Heidelberg (2007)
Chang, H.C., Wang, J.H., Chiu, C.Y.: Finding Event-Relevant Content from the Web Using a Near-duplicate Detection Approach. In: The 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), pp. 291–294 (2007)
Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: The 34th Annual ACM Symposium on Theory of Computing (STOC 2002), pp. 380–388 (2002)
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection Statistics for Fast Duplicate Document Detection. ACM Transactions on Information Systems (TOIS) 20(2), 171–191 (2002)
Damerau, F.J.: A Technique for Computer Detection and Correction of Spelling Errors. Communications of the ACM 7(3), 171–176 (1964)
Fetterly, D., Manasse, M., Najork, M.: Detecting Phrase-level Duplication on the World Wide Web. In: The 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), pp. 170–177 (2005)
Heintze, N.: Scalable Document Fingerprinting. In: The 2nd USENIX Workshop on Electronic Commerce (1996)
Henzinger, M.: Finding Near-duplicate Web Pages: A Large-scale Evaluation of Algorithms. In: The 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pp. 284–291 (2006)
Hoad, T.C., Zobel, J.: Methods for Identifying Versioned and Plagiarized Documents. Journal of the American Society for Information Science and Technology 54(3), 203–215 (2003)
Huffman, S.B., Lehman, A.R., Stolboushkin, A.P., Wong-Toi, H., Yang, F., Roehrig, H.: Multiple-signal Duplicate Detection for Search Evaluation. In: The 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2007), pp. 223–230 (2007)
Manber, U.: Finding Similar Files in a Large File System. In: USENIX Winter Technical Conference, pp. 1–10 (1994)
Manku, G.S., Jain, A., Sarma, A.D.: Detecting Near-duplicates for Web Crawling. In: The 16th International Conference on World Wide Web (WWW 2007), pp. 141–150 (2007)
Metzler, D., Bernstein, Y., Croft, W.B., Moffat, A., Zobel, J.: Similarity Measures for Tracking Information Flow. In: The 14th ACM Conference on Information and Knowledge Management (CIKM 2005), pp. 517–524 (2005)
NIST. Secure hash standard. Federal Information Processing Standards, FIPS 180-1 (1995)
NTCIR (NII Test Collection for IR Systems) project, http://research.nii.ac.jp/ntcir/ (accessed on January 23, 2009)
Seo, J., Croft, W.B.: Local Text Reuse Detection. In: The 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 571–578 (2008)
Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: International Conference on Theory and Practice of Digital Libraries (1995)
Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections. In: The 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 563–570 (2008)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection. In: The 17th International Conference on World Wide Web (WWW 2008), pp. 131–140 (2008)
Yang, H., Callan, J.: Near-duplicate Detection by Instance-level Constrained Clustering. In: The 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pp. 421–428 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, JH., Chang, HC. (2009). Exploiting Sentence-Level Features for Near-Duplicate Document Detection. In: Lee, G.G., et al. Information Retrieval Technology. AIRS 2009. Lecture Notes in Computer Science, vol 5839. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04769-5_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-04769-5_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04768-8
Online ISBN: 978-3-642-04769-5
eBook Packages: Computer ScienceComputer Science (R0)