A Fusion of Algorithms in Near Duplicate Document Detection

Fan, Jun; Huang, Tiejun

doi:10.1007/978-3-642-28320-8_20

Jun Fan^23,24 &
Tiejun Huang^23,24

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7104))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1526 Accesses

Abstract

With the rapid development of the World Wide Web, there are a huge number of fully or fragmentally duplicated pages in the Internet. Return of these near duplicated results to the users greatly affects user experiences. In the process of deploying digital libraries, the protection of intellectual property and removal of duplicate contents needs to be considered. This paper fuses some “state of the art” algorithms to reach a better performance. We first introduce the three major algorithms (shingling, I-match, simhash) in duplicate document detection and their developments in the following days. We take sequences of words (shingles) as the feature of simhash algorithm. We then import the random lexicons based multi fingerprints generation method into shingling base simhash algorithm and named it shingling based multi fingerprints simhash algorithm. We did some preliminary experiments on the synthetic dataset based on the “China-US Million Book Digital Library Project”. The experiment result proves the efficiency of these algorithms.

The work is partially supported by National Natural Science Foundation of China under grant No. 90820003, the Important Scientific and Technological Engineering Projects of GAPP of China under grant No. GAPP-ZDKJ-BQ/15-6, and CADAL project.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: Proceedings of the ACM SIGMOD Annual Conference (1995)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: SCAM: A copy detection mechanism for digital documents. In: Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries, DL 1995 (1995)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: Building a scalable and accurate copy detection mechanism. In: Proceedings of the 1st ACM Conference on Digital Libraries, DL 1996 (1996)
Google Scholar
Chowdhury, A., Frieder, O., Grossman, D., Mccabe, M.C.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems 20(2) (2002)
Google Scholar
Kołcz, A., Chowdhury, A., Alspector, J.: Improved Robustness of Signature-Based Near-Replica Detection via Lexicon Randomization. In: Proceedings of the tenth ACM SIGKDD, Seattle, WA, USA (2004)
Google Scholar
Conrad, J.G., Guo, X.S., Schriber, C.P.: Online Duplicate Document Detection: Signature Reliability in a Dynamic Retrieval Environment. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management (2003)
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S.: Syntactic clustering of the Web. In: Proceedings of the 6th International Web Conference (1997)
Google Scholar
Broder, A.Z., Charikar, M., Frieze, A., Mitzenmacher, M.: Min-Wise Independent Permutations. Journal of Computer and System Sciences, 630–659 (2000)
Google Scholar
Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: Proceedings of First Latin American Web Congress, pp. 37–45 (2003)
Google Scholar
Fetterly, D., Manasse, M., Najork, M.: Detecting Phrase-level Duplication on the World Wide Web. In: The 28th ACM SIGIR, pp. 170–177 (2005)
Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of 34th Annual Symposium on Theory of Computing (2002)
Google Scholar
Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th ACM SIGIR, pp. 284–291 (2006)
Google Scholar
Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150 (2007)
Google Scholar
Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of ACM SIGIR (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

National Engineering Laboratory for Video Technology, School of EE & CS, Peking University, Beijing, 100871, China
Jun Fan & Tiejun Huang
Shenzhen Graduate School, Peking University, Shenzhen, 518055, China
Jun Fan & Tiejun Huang

Authors

Jun Fan
View author publications
You can also search for this author in PubMed Google Scholar
Tiejun Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Engineering and Information Technology, University of Technology Sydney, Broadway, PO Box 123, NSW 2007, Sydney, Australia
Longbing Cao
Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences, 518055, Shenzhen, China
Joshua Zhexue Huang & Jun Luo &
The University of Melbourne, VIC 3010, Melbourne, Australia
James Bailey
The University of Auckland, Auckland, New Zealand
Yun Sing Koh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, J., Huang, T. (2012). A Fusion of Algorithms in Near Duplicate Document Detection. In: Cao, L., Huang, J.Z., Bailey, J., Koh, Y.S., Luo, J. (eds) New Frontiers in Applied Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 7104. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28320-8_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-28320-8_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28319-2
Online ISBN: 978-3-642-28320-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics