Abstract
File similarity is a numerical indicator that how many duplicated data exist in target files. With this information, we can reduce storage capacity with data deduplication scheme, further it can be exploited in the digital forensic field for finding malicious software. However, measuring file similarity between files can cause a high overhead in terms of processing time and the capacity of disk storage. For this reason, in this paper, we propose a novel file similarity evaluation algorithm called PHISA (Partial Hash Information String Algorithm). To evaluate the performance of the proposed system, we compare PHISA to well-known file similarity tools. The evaluation result shows that PHISA reduces the processing time and increases the similarity evaluation accuracy.
Similar content being viewed by others
References
Bhagwat D, Eshghi K, Long DD, Lillibridge M (2009) Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In: Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS'09. IEEE International Symposium on, 2009. IEEE, pp 1–9
Breitinger F, Baier H (2012) Performance Issues about Context-Triggered Piecewise Hashing. Digital Forensics and Cyber Crime. Springer, In, pp. 141–155
Breitinger F, Stivaktakis G, Baier H (2013) FRASH: A framework to test algorithms of similarity hashing. Digit Investig 10:S50–S58
Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M (2009) HYDRAstor: A Scalable Secondary Storage. FAST 2009:197–210
El-Shimi A, Kalach R, Kumar A, Ottean A, Li J, Sengupta S (2012) Primary Data Deduplication-Large Scale Study and System Design. USENIX Annual Technical Conference 2012:285–296
Hua Y, Liu X, Feng D (2013) Data similarity-aware computation infrastructure for the cloud. IEEE Transactions on Computers p 1
Ko Y-W, Jung H-M, Lee W-Y, Kim M-J, Yoo C (2013) Stride Static Chunking Algorithm for Deduplication System. IEICE Trans Inf Syst 96(7):1544–1547
Kornblum J (2006) Identifying almost identical files using context triggered piecewise hashing. digital investigation 3:91–97
Li R, Ju L, Peng Z, Yu Z, Wang C (2011): Batch text similarity search with mapreduce. In: Du, X., Fan, W., Peng, Z., Sharaf, M.A. (eds.) APWeb. Lecture Notes in Computer Science, vol. 6612, pp. 412–423. Springer, Heidelberg
Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P (2009) Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. Fast 2009:111–123
Manber U (1994) Finding Similar Files in a Large File System. Usenix Winter 1994:1–10
Meyer DT, Bolosky WJ (2012) A study of practical deduplication. ACM Transactions on Storage (TOS) 7(4):14
Muthitacharoen A, Chen B, Mazieres D (2001) A low-bandwidth network file system. In: ACM SIGOPS Operating Systems Review, 2001. vol 5. ACM, pp 174–187
Pucha H, Andersen DG, Kaminsky M (2007) Exploiting Similarity for Multi-Source Downloads Using File Handprints. NSDI, In
Quinlan S, Dorward S (2002) Venti: A New Approach to Archival Storage. FAST 2002:89–101
Roussev V (2010) Data fingerprinting with similarity digests. Advances in Digital Forensics VI. Springer, In, pp. 207–226
Roussev V, Quates C (2012) Content triage with similarity digests: The M57 case study. Digit Investig 9:S60–S68
Song L, Deng Y, Xie J (2013): Exploiting fingerprint prefetching to improve the performance of data deduplication. In: Proceedings of the 15th IEEE International Conference on High Performance Computing and Communications. IEEE
Xia W, Jiang H, Feng D, Hua Y (2011a): Silo: A similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, pp. 26–28. USENIX Association
Xia W, Jiang H, Feng D, Hua Y (2011b) (2011) SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput. USENIX Annual Technical Conference, In
Yang J, Huang T, Su L (2014) Using similarity analysis to detect frame duplication forgery in videos. Multimedia Tools and Applications 1-19
Acknowledgments
This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT and future Planning(2014R1A2A1A11054160).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kim, BK., Oh, SJ., Jang, SB. et al. File similarity evaluation scheme for multimedia data using partial hash information. Multimed Tools Appl 76, 19649–19663 (2017). https://doi.org/10.1007/s11042-016-3373-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-016-3373-7