Skip to main content
Log in

File similarity evaluation scheme for multimedia data using partial hash information

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

File similarity is a numerical indicator that how many duplicated data exist in target files. With this information, we can reduce storage capacity with data deduplication scheme, further it can be exploited in the digital forensic field for finding malicious software. However, measuring file similarity between files can cause a high overhead in terms of processing time and the capacity of disk storage. For this reason, in this paper, we propose a novel file similarity evaluation algorithm called PHISA (Partial Hash Information String Algorithm). To evaluate the performance of the proposed system, we compare PHISA to well-known file similarity tools. The evaluation result shows that PHISA reduces the processing time and increases the similarity evaluation accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Bhagwat D, Eshghi K, Long DD, Lillibridge M (2009) Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In: Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS'09. IEEE International Symposium on, 2009. IEEE, pp 1–9

  2. Breitinger F, Baier H (2012) Performance Issues about Context-Triggered Piecewise Hashing. Digital Forensics and Cyber Crime. Springer, In, pp. 141–155

    Google Scholar 

  3. Breitinger F, Stivaktakis G, Baier H (2013) FRASH: A framework to test algorithms of similarity hashing. Digit Investig 10:S50–S58

    Article  Google Scholar 

  4. Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M (2009) HYDRAstor: A Scalable Secondary Storage. FAST 2009:197–210

    Google Scholar 

  5. El-Shimi A, Kalach R, Kumar A, Ottean A, Li J, Sengupta S (2012) Primary Data Deduplication-Large Scale Study and System Design. USENIX Annual Technical Conference 2012:285–296

    Google Scholar 

  6. Hua Y, Liu X, Feng D (2013) Data similarity-aware computation infrastructure for the cloud. IEEE Transactions on Computers p 1

  7. Ko Y-W, Jung H-M, Lee W-Y, Kim M-J, Yoo C (2013) Stride Static Chunking Algorithm for Deduplication System. IEICE Trans Inf Syst 96(7):1544–1547

    Article  Google Scholar 

  8. Kornblum J (2006) Identifying almost identical files using context triggered piecewise hashing. digital investigation 3:91–97

  9. Li R, Ju L, Peng Z, Yu Z, Wang C (2011): Batch text similarity search with mapreduce. In: Du, X., Fan, W., Peng, Z., Sharaf, M.A. (eds.) APWeb. Lecture Notes in Computer Science, vol. 6612, pp. 412–423. Springer, Heidelberg

  10. Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P (2009) Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. Fast 2009:111–123

    Google Scholar 

  11. Manber U (1994) Finding Similar Files in a Large File System. Usenix Winter 1994:1–10

    Google Scholar 

  12. Meyer DT, Bolosky WJ (2012) A study of practical deduplication. ACM Transactions on Storage (TOS) 7(4):14

    Google Scholar 

  13. Muthitacharoen A, Chen B, Mazieres D (2001) A low-bandwidth network file system. In: ACM SIGOPS Operating Systems Review, 2001. vol 5. ACM, pp 174–187

  14. Pucha H, Andersen DG, Kaminsky M (2007) Exploiting Similarity for Multi-Source Downloads Using File Handprints. NSDI, In

    Google Scholar 

  15. Quinlan S, Dorward S (2002) Venti: A New Approach to Archival Storage. FAST 2002:89–101

    Google Scholar 

  16. Roussev V (2010) Data fingerprinting with similarity digests. Advances in Digital Forensics VI. Springer, In, pp. 207–226

    Google Scholar 

  17. Roussev V, Quates C (2012) Content triage with similarity digests: The M57 case study. Digit Investig 9:S60–S68

    Article  Google Scholar 

  18. Song L, Deng Y, Xie J (2013): Exploiting fingerprint prefetching to improve the performance of data deduplication. In: Proceedings of the 15th IEEE International Conference on High Performance Computing and Communications. IEEE

  19. Xia W, Jiang H, Feng D, Hua Y (2011a): Silo: A similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, pp. 26–28. USENIX Association

  20. Xia W, Jiang H, Feng D, Hua Y (2011b) (2011) SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput. USENIX Annual Technical Conference, In

    Google Scholar 

  21. Yang J, Huang T, Su L (2014) Using similarity analysis to detect frame duplication forgery in videos. Multimedia Tools and Applications 1-19

Download references

Acknowledgments

This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT and future Planning(2014R1A2A1A11054160).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Young-Woong Ko.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, BK., Oh, SJ., Jang, SB. et al. File similarity evaluation scheme for multimedia data using partial hash information. Multimed Tools Appl 76, 19649–19663 (2017). https://doi.org/10.1007/s11042-016-3373-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-016-3373-7

Keywords

Navigation