Skip to main content

Identifying File Similarity in Large Data Sets by Modulo File Length

  • Conference paper
Algorithms and Architectures for Parallel Processing (ICA3PP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8631))

Abstract

Identifying file similarity is very important for data management. Sampling files is a simple and effective approach to identify the file similarity. However, the traditional sampling algorithm(TSA) is very sensitive to file modification. For example, a single bit shift would result in a failure of similarity detection. Many research efforts have been invested in solving/alleviating this problem. This paper proposes a Position-Aware Sampling(PAS) algorithm to identify file similarity in large data sets by modulo file length. This method is very effective in dealing with file modification when performing similarity detection. Comprehensive experimental results demonstrate that PAS significantly outperforms a well-known similarity detection algorithm called simhash in terms of precision and recall. Furthermore, the time overhead, CPU and memory occupation of PAS are much less than that of simhash.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, N., Bolosky, W.J., Douceur, J.R., Lorch, J.R.: A five-year study of file-system metadata. ACM Transactions on Storage (TOS) 3(3), 9 (2007)

    Article  Google Scholar 

  2. Baker, B.S.: On finding duplication and near-duplication in large software systems. In: Proceedings of 2nd Working Conference on Reverse Engineering,1995, pp. 86–95. IEEE (1995)

    Google Scholar 

  3. Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M.: Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In: IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, MASCOTS 2009, pp. 1–9. IEEE (2009)

    Google Scholar 

  4. Biswas, S., Franklin, D., Savage, A., Dixon, R., Sherwood, T., Chong, F.T.: Multi-execution: Multicore caching for data-similar executions. In: ACM SIGARCH Computer Architecture News, vol. 37, pp. 164–173. ACM (2009)

    Google Scholar 

  5. Bitton, D., DeWitt, D.J.: Duplicate record elimination in large data files. ACM Transactions on Database Systems (TODS) 8(2), 255–265 (1983)

    Article  MATH  Google Scholar 

  6. Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: ACM SIGMOD Record, vol. 24, pp. 398–409. ACM (1995)

    Google Scholar 

  7. Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences 1997, pp. 21–29. IEEE (1997)

    Google Scholar 

  8. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8), 1157–1166 (1997)

    Article  Google Scholar 

  9. Buckland, M.K., Gey, F.C.: The relationship between recall and precision. JASIS 45(1), 12–19 (1994)

    Article  Google Scholar 

  10. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, pp. 380–388. ACM (2002)

    Google Scholar 

  11. Cox, L.P., Murray, C.D., Noble, B.D.: Pastiche: Making backup cheap and easy. ACM SIGOPS Operating Systems Review 36(SI), 285–298 (2002)

    Article  Google Scholar 

  12. Forman, G., Eshghi, K., Chiocchetti, S.: Finding similar files in large document repositories. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 394–400. ACM (2005)

    Google Scholar 

  13. Gens, F.: Top 10 predictions idc predictions Competing on the 3rd platform (2013), http://www.idc.com/research/Predictions13/downloadable/238044.pdf

  14. Hua, Y., Liu, X., Feng, D.: Data similarity-aware computation infrastructure for the cloud. IEEE Transactions on Computers p. 1 (2013)

    Google Scholar 

  15. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)

    Google Scholar 

  16. Labs, F.: Tokyo cabinet, http://fallabs.com/tokyocabinet/

  17. Song, L., Deng, Y., Xie, J.: Exploiting fingerprint prefetching to improve the performance of data deduplication. In: Proceedings of the 15th IEEE International Conference on High Performance Computing and Communications. IEEE (2013)

    Google Scholar 

  18. Manber, U., et al.: Finding similar files in a large file system. Usenix Winter 94, 1–10 (1994)

    Google Scholar 

  19. Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150. ACM (2007)

    Google Scholar 

  20. Meyer, D.T., Bolosky, W.J.: A study of practical deduplication. ACM Transactions on Storage (TOS) 7(4), 14 (2012)

    Google Scholar 

  21. Muthitacharoen, A., Chen, B., Mazieres, D.: A low-bandwidth network file system. In: ACM SIGOPS Operating Systems Review, vol. 35, pp. 174–187. ACM (2001)

    Google Scholar 

  22. Ouyang, Z., Memon, N., Suel, T., Trendafilov, D.: Cluster-based delta compression of a collection of files. In: Proceedings of the Third International Conference on Web Information Systems Engineering, WISE 2002, pp. 257–266. IEEE (2002)

    Google Scholar 

  23. Powers, D.M.: Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation. Journal of Machine Learning Technologies 2(1), 37–63 (2011)

    MathSciNet  Google Scholar 

  24. Quinlan, S., Dorward, S.: Venti: A new approach to archival storage. In: FAST, vol. 2, pp. 89–101 (2002)

    Google Scholar 

  25. Ruijter, M.: Lessfs, http://www.lessfs.com/wordpress/

  26. Sapuntzakis, C.P., Chandra, R., Pfaff, B., Chow, J., Lam, M.S., Rosenblum, M.: Optimizing the migration of virtual computers. ACM SIGOPS Operating Systems Review 36(SI), 377–390 (2002)

    Article  Google Scholar 

  27. Shivakumar, N., Garcia-Molina, H.: Building a scalable and accurate copy detection mechanism. In: Proceedings of the First ACM International Conference on Digital Libraries, pp. 160–168. ACM (1996)

    Google Scholar 

  28. Teodosiu, D., Bjorner, N., Gurevich, Y., Manasse, M., Porkka, J.: Optimizing file replication over limited bandwidth networks using remote differential compression. Microsoft Research TR-2006-157 (2006)

    Google Scholar 

  29. Xia, W., Jiang, H., Feng, D., Hua, Y.: Silo: A similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, pp. 26–28. USENIX Association (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhou, Y., Deng, Y., Chen, X., Xie, J. (2014). Identifying File Similarity in Large Data Sets by Modulo File Length. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11194-0_11

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11193-3

  • Online ISBN: 978-3-319-11194-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics