Skip to main content

Taxonomy of Data Fragment Classification Techniques

  • Conference paper
  • First Online:
Digital Forensics and Cyber Crime (ICDF2C 2013)

Abstract

Several fields of digital forensics (i.e. file carving, memory forensics, network forensics) require the reliable data type classification of digital fragments. Up to now, a multitude of research papers proposing new classification approaches have been published. Within this paper we comprehensively review existing classification approaches and classify them into categories. For each category, approaches are grouped based on shared commonalities. The major contribution of this paper is a novel taxonomy of existing data fragment classification approaches. We highlight progress made by previous work facilitating the identification of future research directions. Furthermore, the taxonomy can provide the foundation for future knowledge-based classification approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 72.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Beverly, R., Garfinkel, S., Cardwell, G.: Forensic carving of network packet and associated data structures. Digtial Invest. 8, 78–89 (2011)

    Article  Google Scholar 

  2. Ahmed, I., Lhee, K.-S.: Classification of packet contents for malware detection. J. Comput. Virol. 7(4), 279–295 (2011)

    Article  Google Scholar 

  3. Garfinkel, S.L.: Digital forensics research: the next 10 years. Digital Invest. 7(1), S64–S73 (2010). Proceedings of the Tenth Annual DFRWS Conference

    Google Scholar 

  4. Roussev, V., Quates, C., Martell, R.: Real-time digital forensics and triage. Digital Invest. 10, 20–30 (2013)

    Article  Google Scholar 

  5. Young, J., Foster, K., Garfinkel, S., Fairbanks, K.: Distinct sector hashes for target file detection. Computer 45(12), 28–35 (2012)

    Article  Google Scholar 

  6. Shannon, M.M.: Forensic relative strength scoring: ASCII and entropy scoring. Int. J. Digital Evid. 2(4), 1–19 (2004)

    Google Scholar 

  7. Roussev, V., Garfinkel, S.L.: File fragment classification-the case for specialized approaches. In: Proceedings of the: Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering (SADFE2009), Berkeley, CA, USA, IEEE, pp. 3–14 (2009)

    Google Scholar 

  8. Beebe, N.: Digital forensic research: the good, the bad and the unaddressed. In: Peterson, G., Shenoi, S. (eds.) Advances in Digital Forensics V. IFIP AICT, vol. 306, pp. 17–36. Springer, Heidelberg (2009). doi:10.1007/978-3-642-04155-6_2

    Chapter  Google Scholar 

  9. Speirs, W.R., Cole, E.B.: Methods for categorizing input data. U.S. Patent 20 070 116 267, 05 24 (2007)

    Google Scholar 

  10. Cao, D., Luo, J., Yin, M., Yang, H.: Feature selection based file type identification algorithm. In: IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS’10), vol. 3, pp. 58–62. IEEE (2010)

    Google Scholar 

  11. Fitzgerald, S., Mathews, G., Morris, C., Zhulyn, O.: Using NLP techniques for file fragment classification. Digital Invest. 9, S44–S49 (2012)

    Article  Google Scholar 

  12. Carter, J.M.: Locating executable fragments with concordia, a scalable, semantics-based architecture. In: Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop, Series of CSIIRW ’13, pp. 24:1–24:4. ACM, New York (2013)

    Google Scholar 

  13. Pal, A., Memon, N.D.: The evolution of file carving. IEEE Sign. Process. Mag. 26(2), 59–71 (2009)

    Article  Google Scholar 

  14. Poisel, R., Tjoa, S., Tavolato, P.: Advanced file carving approaches for multimedia files. J. Wirel. Mob. Netw. Ubiquitous Comput. Dependable Appl. (JoWUA) 2(4), 42–58 (2011)

    Google Scholar 

  15. Raghavan, S.: Digital forensic research: current state of the art. CSI Trans. ICT 1(1), 91–114 (2013)

    Article  Google Scholar 

  16. Garfinkel, S.L., Farrell, P., Roussev, V., Dinolt, G.: Bringing science to digital forensics with standardized forensic corpora. Digital Invest. 6(1), S2–S11 (2009). Proceedings of the Ninth Annual DFRWS Conference

    Google Scholar 

  17. Kessler, G.: File signature table, May 2013. http://www.garykessler.net/library/file_sigs.html. Accessed 17 May 2013

  18. Li, W., Wang, K., Stolfo, S.J., Herzog, B.: Fileprints: identifying file types by n-gram analysis. In: Proceedings of the Sixth Systems, Man and Cybernetics: Information Assurance Workshop (IAW’05), pp. 64–71. IEEE, New York (2005)

    Google Scholar 

  19. file(1). ftp://ftp.astron.com/pub/file/. Accessed 15 April 2013

  20. Richard, G.G., Roussev, V.: Scalpel: a frugal, high performance file carver. In: Proceedings of the Fifth Annual DFRWS Conference, New Orleans, LA, pp. 1–10, August 2005. http://www.dfrws.org/2005/proceedings/richard_scalpel.pdf

  21. Foremost. http://foremost.sourceforge.net/. Accessed 21 May 2013

  22. ReviveIT. https://code.google.com/p/reviveit/. Accessed 21 May 2013

  23. PhotoRec. http://www.cgsecurity.org/wiki/PhotoRec. Accessed 15 April 2013

  24. Pal, A., Sencar, H.T., Memon, N.D.: Detecting file fragmentation point using sequential hypothesis testing. Digital Invest. 5(Supplement 1), S2–S13 (2008)

    Article  Google Scholar 

  25. Al-Dahir, O., Hua, J., Marziale, L., Nino, J., Richard III, G.G., Roussev, V.: Mp3 scalpel. Technical report, University of New Orleans (2007). http://sandbox.dfrws.org/2007/UNO/uno-submission.doc

  26. Garfinkel, S.L., Nelson, A., White, D., Roussev, V.: Using purpose-built functions and block hashes to enable small block and sub-file forensics. Digital Invest. 7(1), S13–S23 (2010). Proceedings of the Tenth Annual DFRWS Conference

    Google Scholar 

  27. National Institute of Standards and Technology, National Software Reference Library (NSRL). http://www.nsrl.nist.gov/. Accessed 15 April 2013

  28. Mead, S.: Unique file identification in the national software reference library. Digital Invest. 3(3), 138–150 (2006)

    Article  Google Scholar 

  29. Garfinkel, S.: Lessons learned writing digital forensics tools and managing a 30TB digital evidence corpus. Digital Invest. 9, S80–S89 (2012)

    Article  Google Scholar 

  30. Kim, K., Park, S., Chang, T., Lee, C., Baek, S.: Lessons learned from the construction of a korean software reference data set for digital forensics. Digital Invest. 6, S108–S113 (2009)

    Article  Google Scholar 

  31. Ruback, M., Hoelz, B., Ralha, C.: A new approach for creating forensic hashsets. In: Peterson, G., Shenoi, S. (eds.) Advances in Digital Forensics VIII. IFIP AICT, vol. 383, pp. 83–97. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  32. Chawathe, S.: Effective whitelisting for filesystem forensics. In: Proceedings of International Conference on Intelligence and Security Informatics (ISI 2009), IEEE, pp. 131–136 (2009)

    Google Scholar 

  33. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the International Conference on Very Large Data Bases, pp. 518–529 (1999)

    Google Scholar 

  34. Garfinkel, S.L.: Forensic feature extraction and cross-drive analysis. Digital Invest. 3, 71–81 (2006)

    Article  Google Scholar 

  35. Dandass, Y.S., Necaise, N.J., Thomas, S.R.: An empirical analysis of disk sector hashes for data carving. J. Digital Forensic Pract. 2, 95–106 (2008). http://www.informaworld.com/10.1080/15567280802050436

    Article  Google Scholar 

  36. Collange, S., Dandass, Y.S., Daumas, M., Defour, D.: Using graphics processors for parallelizing hash-based data carving. In: Proceedings of the 42nd Hawaii International Conference on System Sciences, HICSS’09. IEEE, Los Alamitos, pp. 1–10 (2009)

    Google Scholar 

  37. Foster, K.: Using distinct sectors in media sampling and full media analysis to detect presence of documents from a corpus. Master’s thesis, Naval Postgraduate School, Monterey, California, September 2012

    Google Scholar 

  38. Farrell, P., Garfinkel, S., White, D.: Practical applications of bloom filters to the nist rds and hard drive triage. In: 2008 Proceedings of Annual Computer Security Applications Conference, (ACSAC 2008), pp. 13–22 (2008)

    Google Scholar 

  39. Roussev, V.: An evaluation of forensic similarity hashes. Digital Investl. 8, S34–S41 (2011)

    Article  Google Scholar 

  40. Roussev, V.: Data fingerprinting with similarity digests. In: Chow, K.P., Shenoi, S. (eds.) Advances in Digital Forensics VI. IFIP AICT, vol. 337, pp. 207–226. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  41. Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Invest. 3, 91–97 (2006)

    Article  Google Scholar 

  42. Roussev, V., Quates, C.: Content triage with similarity digests: the M57 case study. Digital Invest. 9, S60–S68 (2012)

    Article  Google Scholar 

  43. Breitinger, F., Baier, H.: A Fuzzy Hashing Approach based on Random Sequences and Hamming Distance, May 2012, forthcoming issue

    Google Scholar 

  44. Baier, H., Breitinger, F.: Security aspects of piecewise hashing in computer forensics. In: 2011 Sixth International Conference on IT Security Incident Management and IT Forensics (IMF), pp. 21–36 (2011)

    Google Scholar 

  45. Breitinger, F., Stivaktakis, G., Baier, H.: FRASH: a framework to test algorithms of similarity hashing, August 2013, forthcoming issue

    Google Scholar 

  46. Breitinger, F., Astebøl, K.P., Baier, H., Busch, C.: mvHash-B - a new approach for similarity preserving hashing. In: 7th International Conference on IT Security Incident Management & IT Forensics (IMF), Nürnberg, March 2013

    Google Scholar 

  47. Breitinger, F., Petrov, K.: Reducing time cost in hashing operations. In: Proceedings of the 9th Annual IFIP WG 11.9 International Conference on Digital Forensics, Orlando, FL, USA, January 2013

    Google Scholar 

  48. McDaniel, M.B.: An algorithm for content-based automated file type recognition. Master’s thesis, James Madison University (2001)

    Google Scholar 

  49. McDaniel, M., Heydari, M.H.: Content based file type detection algorithms. In: Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS’03) - Track 9, Washington, DC, USA, IEEE CS, p. 332.1 (2003)

    Google Scholar 

  50. Dhanalakshmi, R., Chellappan, C.: File format identification and information extraction. In: World Congress on Nature Biologically Inspired Computing, NaBIC, pp. 1497–1501 (2009)

    Google Scholar 

  51. Karresand, M.: Completing the picture: fragments and back again. Master’s thesis, Linkoepings universitet (2008). http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-11752. Accessed 22 January 2013

  52. Karresand, M., Shahmehri, N.: Reassembly of fragmented JPEG images containing restart markers. In: Proceedings of the European Conference on Computer Network Defense (EC2ND), Dublin, Ireland, IEEE CS, pp. 25–32 (2008)

    Google Scholar 

  53. Mayer, R.C.: Filetype identification using long, summarized n-grams. Master’s thesis, Naval Postgraduate School, Monterey, California, March 2011

    Google Scholar 

  54. Collins, M.: Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Series of ACL ’02. Association for Computational Linguistics, Stroudsburg, pp. 489–496 (2002)

    Google Scholar 

  55. Karresand, M., Shahmehri, N.: Oscar - file type identification of binary data in disk clusters and RAM pages. In: Fischer-Hübner, S., Rannenberg, K., Yngström, L., Lindskog, S. (eds.) Security and Privacy in Dynamic Environments. IFIP, vol. 201, pp. 413–424. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  56. Karresand, M., Shahmehri, N.: File type identification of data fragments by their binary structure. In: Proceedings of the IEEE Information Assurance Workshop, pp. 140–147. IEEE, New York (2006)

    Google Scholar 

  57. Hall, G., Davis, W.: Sliding window measurement for file type identification. Technical report, Mantech Security and Mission Assurance (2006)

    Google Scholar 

  58. Erbacher, R.F., Mulholland, J.: Identification and localization of data types within large-scale file systems. In: Systematic Approaches to Digital Forensic Engineering (SADFE), pp. 55–70 (2007)

    Google Scholar 

  59. Moody, S.J., Erbacher, R.F.: Sádi - statistical analysis for data type identification. In: Systematic Approaches to Digital Forensic Engineering (SADFE), pp. 41–54 (2008)

    Google Scholar 

  60. Veenman, C.J.: Statistical disk cluster classification for file carving. In: Proceedings of the International Symposium on Information Assurance and Security (IAS’07), Manchester, UK, IEEE CS, pp. 393–398 (2007)

    Google Scholar 

  61. Calhoun, W.C., Coles, D.: Predicting the types of file fragments. Digital Invest. 5, 14–20 (2008)

    Article  Google Scholar 

  62. Axelsson, S.: Using normalized compression distance for classifying file fragments. In: Proceedings of the International Conference on Availability, Reliability and Security (ARES 2010), Krakow, Poland, IEEE CS, pp. 641–646 (2010)

    Google Scholar 

  63. Axelsson, S.: The normalised compression distance as a file fragment classifier. Digital Investl. 7, S24–S31 (2010)

    Article  Google Scholar 

  64. Savoldi, A., Piccinelli, M., Gubian, P.: A statistical method for detecting on-disk wiped areas. Digital Invest. 8(3–4), 194–214 (2012)

    Article  Google Scholar 

  65. Harris, R.: Arriving at an anti-forensics consensus: examining how to define and control the anti-forensics problem. Digital Invest. 3, 44–49 (2006)

    Article  Google Scholar 

  66. Rukhin, A., Soto, J., Nechvatal, J., Smid, M., Barker, E.: A statistical test suite for random and pseudorandom number generators for cryptographic applications. Information for the Defense Community, Technical report, May 2001. http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA393366

  67. Ariu, D., Giacinto, G., Roli, F.: Machine learning in computer forensics (and the lessons learned from machine learning in computer security). In: Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, Series of AISec ’11, pp. 99–104. ACM, New York (2011)

    Google Scholar 

  68. Pontello, M.: TrID - File Identifier. http://mark0.net/soft-trid-e.html. Accessed 21 April 2013

  69. Ahmed, I., Lhee, K.-S., Shin, H., Hong, M.: Fast file-type identification. In: 2010 Proceedings of the ACM Symposium on Applied Computing, pp. 1601–1602. ACM, New York (2010)

    Google Scholar 

  70. Ahmed, I., suk Lhee, K., Shin, H., Hong, M.: Content-based file-type identification using cosine similarity and a divide-and-conquer approach. IETE Tech. Rev. 27, 465–477 (2010). http://tr.ietejournals.org/text.asp?2010/27/6/465/67149

  71. Ahmed, I., Lhee, K.-S., Shin, H.-J., Hong, M.-P.: Fast content-based file type identification. In: Peterson, G.L., Shenoi, S. (eds.) Advances in Digital Forensics VII. IFIP AICT, vol. 361, pp. 65–75. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  72. Li, Q., Ong, A., Suganthan, P., Thing, V.: A novel support vector machine approach to high entropy data fragment classification. In: Proceedings of the South African Information Security Multi-Conference (SAISMC 2010) (2010)

    Google Scholar 

  73. Gopal, S., Yang, Y., Salomatin, K., Carbonell, J.: File-type identification with incomplete information. In: Proceedings of the Tenth Conference on Machine Learning and Applications, Honolulu, Hawaii, IEEE, December 2011

    Google Scholar 

  74. Gopal, S., Yang, Y., Salomatin, K., Carbonell, J.: Statistical learning for file-type identification. In: Proceedings of the 10th International Conference on Machine Learning and Applications and Workshops (ICMLA), vol. 1, pp. 68–73 (2011)

    Google Scholar 

  75. Sportiello, L., Zanero, S.: File block classification by support vector machines. In: Proceedings of the 6th International Conference on Availability, Reliability and Security (ARES 2011), pp. 307–312 (2011)

    Google Scholar 

  76. Beebe, N.L., Maddox, L.A., Liu, L., Sun, M.: Sceadan: using concatentated n-gram vectors for improved data/file type classification (2013, forthcoming issue)

    Google Scholar 

  77. Digital Forensics Research Conference (DFRWS), DFRWS 2012 Forensics Challenge (2012). http://www.dfrws.org/2012/challenge/. Accessed 5 April 2013

  78. Sportiello, L., Zanero, S.: Context-based file block classification. In: Peterson, G.L., Shenoi, S. (eds.) Advances in Digital Forensics VIII. IFIP AICT, vol. 383, pp. 67–82. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  79. Amirani, M.C., Toorani, M., Beheshti, A.A.: A new approach to content-based file type detection. In: Proceedings of the 13th IEEE Symposium on Computers and Communications (ISCC’08), pp. 1103–1108 (2008)

    Google Scholar 

  80. Amirani, M.C., Toorani, M., Mihandoost, S.: Feature-based type identification of file fragments. Secur. Commun. Netw. 6(1), 115–128 (2013)

    Article  Google Scholar 

  81. Kattan, A., Galván-López, E., Poli, R., O’Neill, M.: GP-fileprints: file types detection using genetic programming. In: Esparcia-Alcázar, A.I., Ekárt, A., Silva, S., Dignum, S., Uyar, A.Ş. (eds.) EuroGP 2010. LNCS, vol. 6021, pp. 134–145. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  82. Garcia, J., Holleboom, T.: Retention of micro-fragments in cluster slack - a first model. In: First IEEE International Workshop on Information Forensics and Security, WIFS 2009, December 2009, pp. 31–35 (2009)

    Google Scholar 

  83. Holleboom, T., Garcia, J.: Fragment retention characteristics in slack space - analysis and measurements. In: Proceedings of the 2nd International Workshop on Security and Communication Networks (IWSCN), pp. 1–6, May 2010

    Google Scholar 

  84. Blacher, Z.: Cluster-slack retention characteristics: a study of the NTFS filesystem. Master’s thesis, Karlstad University, Faculty of Economic Sciences, Communication and IT (2010)

    Google Scholar 

  85. Xu, M., Yang, H.-R., Xu, J., Xu, Y., Zheng, N.: An adaptive method to identify disk cluster size based on block content. Digital Invest. 7(1–2), 48–55 (2010)

    Article  Google Scholar 

  86. Li, Q.: Searching and extracting digital image evidence. In: Sencar, H.T., Memon, N. (eds.) Digital Image Forensics, pp. 123–153. Springer, New York (2013)

    Chapter  Google Scholar 

  87. Conti, G., Bratus, S., Shubina, A., Lichtenberg, A., Ragsdale, R., Perez-Alemany, R., Sangster, B., Supan, M.: A visual study of primitive binary fragment types. White Paper, Black Hat USA 2010, Technical report, United States Military Academy, July 2010

    Google Scholar 

  88. Conti, G., Bratus, S., Shubina, A., Sangster, B., Ragsdale, R., Supan, M., Lichtenberg, A., Perez-Alemany, R.: Automated mapping of large binary objects using primitive fragment type classification. Digital Invest. 7, S3–S12 (2010)

    Article  Google Scholar 

  89. Noy, N.F., McGuinness, D.L.: Ontology development 101: a guide to creating your first ontology (2001). http://protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html. Accessed 22 January 2013

  90. Hoss, A., Carver, D.: Weaving ontologies to support digital forensic analysis. In: IEEE International Conference on Intelligence and Security Informatics, ISI ’09, pp. 203–205, June 2009

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rainer Poisel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Cite this paper

Poisel, R., Rybnicek, M., Tjoa, S. (2014). Taxonomy of Data Fragment Classification Techniques. In: Gladyshev, P., Marrington, A., Baggili, I. (eds) Digital Forensics and Cyber Crime. ICDF2C 2013. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 132. Springer, Cham. https://doi.org/10.1007/978-3-319-14289-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-14289-0_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-14288-3

  • Online ISBN: 978-3-319-14289-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics