Abstract
Identifying file similarity is very important for data management. Sampling files is a simple and effective approach to identify the file similarity. However, the traditional sampling algorithm(TSA) is very sensitive to file modification. For example, a single bit shift would result in a failure of similarity detection. Many research efforts have been invested in solving/alleviating this problem. This paper proposes a Position-Aware Sampling(PAS) algorithm to identify file similarity in large data sets by modulo file length. This method is very effective in dealing with file modification when performing similarity detection. Comprehensive experimental results demonstrate that PAS significantly outperforms a well-known similarity detection algorithm called simhash in terms of precision and recall. Furthermore, the time overhead, CPU and memory occupation of PAS are much less than that of simhash.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agrawal, N., Bolosky, W.J., Douceur, J.R., Lorch, J.R.: A five-year study of file-system metadata. ACM Transactions on Storage (TOS)Â 3(3), 9 (2007)
Baker, B.S.: On finding duplication and near-duplication in large software systems. In: Proceedings of 2nd Working Conference on Reverse Engineering,1995, pp. 86–95. IEEE (1995)
Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M.: Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In: IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, MASCOTS 2009, pp. 1–9. IEEE (2009)
Biswas, S., Franklin, D., Savage, A., Dixon, R., Sherwood, T., Chong, F.T.: Multi-execution: Multicore caching for data-similar executions. In: ACM SIGARCH Computer Architecture News, vol. 37, pp. 164–173. ACM (2009)
Bitton, D., DeWitt, D.J.: Duplicate record elimination in large data files. ACM Transactions on Database Systems (TODS) 8(2), 255–265 (1983)
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: ACM SIGMOD Record, vol. 24, pp. 398–409. ACM (1995)
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences 1997, pp. 21–29. IEEE (1997)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8), 1157–1166 (1997)
Buckland, M.K., Gey, F.C.: The relationship between recall and precision. JASIS 45(1), 12–19 (1994)
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, pp. 380–388. ACM (2002)
Cox, L.P., Murray, C.D., Noble, B.D.: Pastiche: Making backup cheap and easy. ACM SIGOPS Operating Systems Review 36(SI), 285–298 (2002)
Forman, G., Eshghi, K., Chiocchetti, S.: Finding similar files in large document repositories. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 394–400. ACM (2005)
Gens, F.: Top 10 predictions idc predictions Competing on the 3rd platform (2013), http://www.idc.com/research/Predictions13/downloadable/238044.pdf
Hua, Y., Liu, X., Feng, D.: Data similarity-aware computation infrastructure for the cloud. IEEE Transactions on Computers p. 1 (2013)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)
Labs, F.: Tokyo cabinet, http://fallabs.com/tokyocabinet/
Song, L., Deng, Y., Xie, J.: Exploiting fingerprint prefetching to improve the performance of data deduplication. In: Proceedings of the 15th IEEE International Conference on High Performance Computing and Communications. IEEE (2013)
Manber, U., et al.: Finding similar files in a large file system. Usenix Winter 94, 1–10 (1994)
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150. ACM (2007)
Meyer, D.T., Bolosky, W.J.: A study of practical deduplication. ACM Transactions on Storage (TOS)Â 7(4), 14 (2012)
Muthitacharoen, A., Chen, B., Mazieres, D.: A low-bandwidth network file system. In: ACM SIGOPS Operating Systems Review, vol. 35, pp. 174–187. ACM (2001)
Ouyang, Z., Memon, N., Suel, T., Trendafilov, D.: Cluster-based delta compression of a collection of files. In: Proceedings of the Third International Conference on Web Information Systems Engineering, WISE 2002, pp. 257–266. IEEE (2002)
Powers, D.M.: Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation. Journal of Machine Learning Technologies 2(1), 37–63 (2011)
Quinlan, S., Dorward, S.: Venti: A new approach to archival storage. In: FAST, vol. 2, pp. 89–101 (2002)
Ruijter, M.: Lessfs, http://www.lessfs.com/wordpress/
Sapuntzakis, C.P., Chandra, R., Pfaff, B., Chow, J., Lam, M.S., Rosenblum, M.: Optimizing the migration of virtual computers. ACM SIGOPS Operating Systems Review 36(SI), 377–390 (2002)
Shivakumar, N., Garcia-Molina, H.: Building a scalable and accurate copy detection mechanism. In: Proceedings of the First ACM International Conference on Digital Libraries, pp. 160–168. ACM (1996)
Teodosiu, D., Bjorner, N., Gurevich, Y., Manasse, M., Porkka, J.: Optimizing file replication over limited bandwidth networks using remote differential compression. Microsoft Research TR-2006-157 (2006)
Xia, W., Jiang, H., Feng, D., Hua, Y.: Silo: A similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, pp. 26–28. USENIX Association (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhou, Y., Deng, Y., Chen, X., Xie, J. (2014). Identifying File Similarity in Large Data Sets by Modulo File Length. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-11194-0_11
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11193-3
Online ISBN: 978-3-319-11194-0
eBook Packages: Computer ScienceComputer Science (R0)