Detecting Near-Duplicate Documents Using Sentence Level Features

Feng, Jinbo; Wu, Shengli

doi:10.1007/978-3-319-22852-5_17

Detecting Near-Duplicate Documents Using Sentence Level Features

Jinbo Feng¹⁸ &
Shengli Wu^18,19

Conference paper
First Online: 01 January 2015

844 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9262))

Abstract

In Web search engines, digital libraries and other types of online information services, duplicates and near-duplicates may cause severe problems if unaddressed. Typical problems include more space needed than necessary, longer indexing time and redundant results presented to users. In this paper, we propose a method of detecting near-duplicate documents. Two sentence level features, number of terms and terms at particular positions, are used in the method. Suffix tree is used to match sentence blocks very efficiently. Experiments are carried out to compare our method with two other representative methods and show that our method is effective and efficient. It has potential to be used in practice.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Andrei, Z.B., Steven, C.G., Mark, S., Manasse, G.Z.: Syntactic clustering of the web. Comput. Netw. 29(8–13), 1157–1166 (1997)
Google Scholar
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20(2), 171–191 (2002)
Article Google Scholar
Lin, Y.S., Liao, T.Y., Lee, S.J.: Detecting near-duplicate documents using sentence-level features and supervised learning. Expert Syst. Appl. 40, 1467–1476 (2013)
Article Google Scholar
Wang, J.-H., Chang, H.-C.: Exploiting sentence-level features for near-duplicate document detection. In: Lee, G.G., Song, D., Lin, C.-Y., Aizawa, A., Kuriyama, K., Yoshioka, M., Sakai, T. (eds.) AIRS 2009. LNCS, vol. 5839, pp. 205–217. Springer, Heidelberg (2009)
Chapter Google Scholar
Shivakumar, N., Garcia-Molina, H.: SCAM: a copy detection mechanism for digital documents. In: Proceedings of the International Conference on Theory and Practice of Digital Libraries (1995)
Google Scholar
Zhang, Q., Zhang, Y., Yu, H.M., Huang, X.J.: Efficient partial-duplicate detection based on sequence matching. In: Proceedings of ACM SIGIR, pp. 675–682 (2010)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating System Design and Implementation (2004)
Google Scholar
Chang, H.C., Wang, J.H., Chiu, C.Y.: Finding event-relevant content from the web using a near-duplicate detection approach. In: Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 291–294 (2007)
Google Scholar
Hoad, T., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inf. Sci. Technol. 203–215, 54 (2003)
MATH Google Scholar
Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150 (2007)
Google Scholar
Schleimer, S., Wilkerson, D., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD, pp. 76–85 (2003)
Google Scholar
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of ACM SIGMOD, pp. 388–409 (1995)
Google Scholar
Salton, G.: The state of retrieval system evaluation. Inf. Process. Manage. 28(4), 441–448 (1992)
Article Google Scholar
Frakes, W.B., Baeza-Yates, R.A.: Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs (1992)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and technology Behind Search. Pearson Education Limited, Harlow (2011)
Google Scholar
Ukkonen, E.: On-line construction of suffix tree. Algorithmica 14(3), 249–260 (1995)
Article MathSciNet MATH Google Scholar
Huang, L., Wang, L., Li, X.: Achieving both high precision and high recall in near-duplicate detection. In: Proceedings of ACM CIKM, pp. 63–72 (2008)
Google Scholar
Yerra, R., Ng, Y.-K.: A sentence-based copy detection approach for web documents. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3613, pp. 557–570. Springer, Heidelberg (2005)
Chapter Google Scholar
Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of ACM SIGIR, pp. 563–570 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Jiangsu University, Zhenjiang, 212013, China
Jinbo Feng & Shengli Wu
School of Computing and Mathematics, Ulster University, Newtownabbey, BT37 0QB, UK
Shengli Wu

Authors

Jinbo Feng
View author publications
You can also search for this author in PubMed Google Scholar
Shengli Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shengli Wu .

Editor information

Editors and Affiliations

Hewlett-Packard Enterprise, Sunnyvale, California, USA
Qiming Chen
Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
Blaise Pascal University, Aubiere, France
Farouk Toumani
University of Linz, Linz, Austria
Roland Wagner
Universidad Politécnica de Valencia, Valencia, Spain
Hendrik Decker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feng, J., Wu, S. (2015). Detecting Near-Duplicate Documents Using Sentence Level Features. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9262. Springer, Cham. https://doi.org/10.1007/978-3-319-22852-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-22852-5_17
Published: 11 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22851-8
Online ISBN: 978-3-319-22852-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics