Identifying File Similarity in Large Data Sets by Modulo File Length

Zhou, Yongtao; Deng, Yuhui; Chen, Xiaoguang; Xie, Junjie

doi:10.1007/978-3-319-11194-0_11

Yongtao Zhou²⁵,
Yuhui Deng²⁵,
Xiaoguang Chen²⁵ &
…
Junjie Xie²⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8631))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

2628 Accesses
2 Citations

Abstract

Identifying file similarity is very important for data management. Sampling files is a simple and effective approach to identify the file similarity. However, the traditional sampling algorithm(TSA) is very sensitive to file modification. For example, a single bit shift would result in a failure of similarity detection. Many research efforts have been invested in solving/alleviating this problem. This paper proposes a Position-Aware Sampling(PAS) algorithm to identify file similarity in large data sets by modulo file length. This method is very effective in dealing with file modification when performing similarity detection. Comprehensive experimental results demonstrate that PAS significantly outperforms a well-known similarity detection algorithm called simhash in terms of precision and recall. Furthermore, the time overhead, CPU and memory occupation of PAS are much less than that of simhash.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, N., Bolosky, W.J., Douceur, J.R., Lorch, J.R.: A five-year study of file-system metadata. ACM Transactions on Storage (TOS) 3(3), 9 (2007)
Article Google Scholar
Baker, B.S.: On finding duplication and near-duplication in large software systems. In: Proceedings of 2nd Working Conference on Reverse Engineering,1995, pp. 86–95. IEEE (1995)
Google Scholar
Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M.: Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In: IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, MASCOTS 2009, pp. 1–9. IEEE (2009)
Google Scholar
Biswas, S., Franklin, D., Savage, A., Dixon, R., Sherwood, T., Chong, F.T.: Multi-execution: Multicore caching for data-similar executions. In: ACM SIGARCH Computer Architecture News, vol. 37, pp. 164–173. ACM (2009)
Google Scholar
Bitton, D., DeWitt, D.J.: Duplicate record elimination in large data files. ACM Transactions on Database Systems (TODS) 8(2), 255–265 (1983)
Article MATH Google Scholar
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: ACM SIGMOD Record, vol. 24, pp. 398–409. ACM (1995)
Google Scholar
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences 1997, pp. 21–29. IEEE (1997)
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8), 1157–1166 (1997)
Article Google Scholar
Buckland, M.K., Gey, F.C.: The relationship between recall and precision. JASIS 45(1), 12–19 (1994)
Article Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, pp. 380–388. ACM (2002)
Google Scholar
Cox, L.P., Murray, C.D., Noble, B.D.: Pastiche: Making backup cheap and easy. ACM SIGOPS Operating Systems Review 36(SI), 285–298 (2002)
Article Google Scholar
Forman, G., Eshghi, K., Chiocchetti, S.: Finding similar files in large document repositories. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 394–400. ACM (2005)
Google Scholar
Gens, F.: Top 10 predictions idc predictions Competing on the 3rd platform (2013), http://www.idc.com/research/Predictions13/downloadable/238044.pdf
Hua, Y., Liu, X., Feng, D.: Data similarity-aware computation infrastructure for the cloud. IEEE Transactions on Computers p. 1 (2013)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)
Google Scholar
Labs, F.: Tokyo cabinet, http://fallabs.com/tokyocabinet/
Song, L., Deng, Y., Xie, J.: Exploiting fingerprint prefetching to improve the performance of data deduplication. In: Proceedings of the 15th IEEE International Conference on High Performance Computing and Communications. IEEE (2013)
Google Scholar
Manber, U., et al.: Finding similar files in a large file system. Usenix Winter 94, 1–10 (1994)
Google Scholar
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150. ACM (2007)
Google Scholar
Meyer, D.T., Bolosky, W.J.: A study of practical deduplication. ACM Transactions on Storage (TOS) 7(4), 14 (2012)
Google Scholar
Muthitacharoen, A., Chen, B., Mazieres, D.: A low-bandwidth network file system. In: ACM SIGOPS Operating Systems Review, vol. 35, pp. 174–187. ACM (2001)
Google Scholar
Ouyang, Z., Memon, N., Suel, T., Trendafilov, D.: Cluster-based delta compression of a collection of files. In: Proceedings of the Third International Conference on Web Information Systems Engineering, WISE 2002, pp. 257–266. IEEE (2002)
Google Scholar
Powers, D.M.: Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation. Journal of Machine Learning Technologies 2(1), 37–63 (2011)
MathSciNet Google Scholar
Quinlan, S., Dorward, S.: Venti: A new approach to archival storage. In: FAST, vol. 2, pp. 89–101 (2002)
Google Scholar
Ruijter, M.: Lessfs, http://www.lessfs.com/wordpress/
Sapuntzakis, C.P., Chandra, R., Pfaff, B., Chow, J., Lam, M.S., Rosenblum, M.: Optimizing the migration of virtual computers. ACM SIGOPS Operating Systems Review 36(SI), 377–390 (2002)
Article Google Scholar
Shivakumar, N., Garcia-Molina, H.: Building a scalable and accurate copy detection mechanism. In: Proceedings of the First ACM International Conference on Digital Libraries, pp. 160–168. ACM (1996)
Google Scholar
Teodosiu, D., Bjorner, N., Gurevich, Y., Manasse, M., Porkka, J.: Optimizing file replication over limited bandwidth networks using remote differential compression. Microsoft Research TR-2006-157 (2006)
Google Scholar
Xia, W., Jiang, H., Feng, D., Hua, Y.: Silo: A similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, pp. 26–28. USENIX Association (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Jinan University, Guangzhou, 510632, P.R. China
Yongtao Zhou, Yuhui Deng, Xiaoguang Chen & Junjie Xie

Authors

Yongtao Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yuhui Deng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoguang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Junjie Xie
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Illinois Institute of Technology, 60616-3793, Chicago, IL, USA
Xian-he Sun
School of Computer Science and Technology, Dalian Maritime University, 1 Linghai Road, 116026, Dalian, China
Wenyu Qu
SEECS, University of Ottawa, 8, King Edward Ave, K1N 6N5, Ottawa, ON, Canada
Ivan Stojmenovic
Deakin University, 221 Burwood Highway, 3125, Burwood, VIC, Australia
Wanlei Zhou
Dalian Maritime University, NO.1 Linhai Road Dailian, 116026, China
Zhiyang Li
BeiHang University, XueYuan Road No.37, HaiDian District, Beijing, China
Hua Guo
University of Bradford, BD7 1DP, Bradford, West Yorkshire, United Kingdom
Geyong Min
Dalian Maritime University, NO.1 Linhai Road Dailian, China, 116026
Tingting Yang
Computer Network Information Center, Chinese Academy of Sciences, 100190, Beijing, China
Yulei Wu
Shandong University, 27 Shanda Nanlu, 250100, Jinan City, Shandong Province, China
Lei Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, Y., Deng, Y., Chen, X., Xie, J. (2014). Identifying File Similarity in Large Data Sets by Modulo File Length. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-11194-0_11
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11193-3
Online ISBN: 978-3-319-11194-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics