Abstract
Several fields of digital forensics (i.e. file carving, memory forensics, network forensics) require the reliable data type classification of digital fragments. Up to now, a multitude of research papers proposing new classification approaches have been published. Within this paper we comprehensively review existing classification approaches and classify them into categories. For each category, approaches are grouped based on shared commonalities. The major contribution of this paper is a novel taxonomy of existing data fragment classification approaches. We highlight progress made by previous work facilitating the identification of future research directions. Furthermore, the taxonomy can provide the foundation for future knowledge-based classification approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Beverly, R., Garfinkel, S., Cardwell, G.: Forensic carving of network packet and associated data structures. Digtial Invest. 8, 78–89 (2011)
Ahmed, I., Lhee, K.-S.: Classification of packet contents for malware detection. J. Comput. Virol. 7(4), 279–295 (2011)
Garfinkel, S.L.: Digital forensics research: the next 10 years. Digital Invest. 7(1), S64–S73 (2010). Proceedings of the Tenth Annual DFRWS Conference
Roussev, V., Quates, C., Martell, R.: Real-time digital forensics and triage. Digital Invest. 10, 20–30 (2013)
Young, J., Foster, K., Garfinkel, S., Fairbanks, K.: Distinct sector hashes for target file detection. Computer 45(12), 28–35 (2012)
Shannon, M.M.: Forensic relative strength scoring: ASCII and entropy scoring. Int. J. Digital Evid. 2(4), 1–19 (2004)
Roussev, V., Garfinkel, S.L.: File fragment classification-the case for specialized approaches. In: Proceedings of the: Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering (SADFE2009), Berkeley, CA, USA, IEEE, pp. 3–14 (2009)
Beebe, N.: Digital forensic research: the good, the bad and the unaddressed. In: Peterson, G., Shenoi, S. (eds.) Advances in Digital Forensics V. IFIP AICT, vol. 306, pp. 17–36. Springer, Heidelberg (2009). doi:10.1007/978-3-642-04155-6_2
Speirs, W.R., Cole, E.B.: Methods for categorizing input data. U.S. Patent 20 070 116 267, 05 24 (2007)
Cao, D., Luo, J., Yin, M., Yang, H.: Feature selection based file type identification algorithm. In: IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS’10), vol. 3, pp. 58–62. IEEE (2010)
Fitzgerald, S., Mathews, G., Morris, C., Zhulyn, O.: Using NLP techniques for file fragment classification. Digital Invest. 9, S44–S49 (2012)
Carter, J.M.: Locating executable fragments with concordia, a scalable, semantics-based architecture. In: Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop, Series of CSIIRW ’13, pp. 24:1–24:4. ACM, New York (2013)
Pal, A., Memon, N.D.: The evolution of file carving. IEEE Sign. Process. Mag. 26(2), 59–71 (2009)
Poisel, R., Tjoa, S., Tavolato, P.: Advanced file carving approaches for multimedia files. J. Wirel. Mob. Netw. Ubiquitous Comput. Dependable Appl. (JoWUA) 2(4), 42–58 (2011)
Raghavan, S.: Digital forensic research: current state of the art. CSI Trans. ICT 1(1), 91–114 (2013)
Garfinkel, S.L., Farrell, P., Roussev, V., Dinolt, G.: Bringing science to digital forensics with standardized forensic corpora. Digital Invest. 6(1), S2–S11 (2009). Proceedings of the Ninth Annual DFRWS Conference
Kessler, G.: File signature table, May 2013. http://www.garykessler.net/library/file_sigs.html. Accessed 17 May 2013
Li, W., Wang, K., Stolfo, S.J., Herzog, B.: Fileprints: identifying file types by n-gram analysis. In: Proceedings of the Sixth Systems, Man and Cybernetics: Information Assurance Workshop (IAW’05), pp. 64–71. IEEE, New York (2005)
file(1). ftp://ftp.astron.com/pub/file/. Accessed 15 April 2013
Richard, G.G., Roussev, V.: Scalpel: a frugal, high performance file carver. In: Proceedings of the Fifth Annual DFRWS Conference, New Orleans, LA, pp. 1–10, August 2005. http://www.dfrws.org/2005/proceedings/richard_scalpel.pdf
Foremost. http://foremost.sourceforge.net/. Accessed 21 May 2013
ReviveIT. https://code.google.com/p/reviveit/. Accessed 21 May 2013
PhotoRec. http://www.cgsecurity.org/wiki/PhotoRec. Accessed 15 April 2013
Pal, A., Sencar, H.T., Memon, N.D.: Detecting file fragmentation point using sequential hypothesis testing. Digital Invest. 5(Supplement 1), S2–S13 (2008)
Al-Dahir, O., Hua, J., Marziale, L., Nino, J., Richard III, G.G., Roussev, V.: Mp3 scalpel. Technical report, University of New Orleans (2007). http://sandbox.dfrws.org/2007/UNO/uno-submission.doc
Garfinkel, S.L., Nelson, A., White, D., Roussev, V.: Using purpose-built functions and block hashes to enable small block and sub-file forensics. Digital Invest. 7(1), S13–S23 (2010). Proceedings of the Tenth Annual DFRWS Conference
National Institute of Standards and Technology, National Software Reference Library (NSRL). http://www.nsrl.nist.gov/. Accessed 15 April 2013
Mead, S.: Unique file identification in the national software reference library. Digital Invest. 3(3), 138–150 (2006)
Garfinkel, S.: Lessons learned writing digital forensics tools and managing a 30TB digital evidence corpus. Digital Invest. 9, S80–S89 (2012)
Kim, K., Park, S., Chang, T., Lee, C., Baek, S.: Lessons learned from the construction of a korean software reference data set for digital forensics. Digital Invest. 6, S108–S113 (2009)
Ruback, M., Hoelz, B., Ralha, C.: A new approach for creating forensic hashsets. In: Peterson, G., Shenoi, S. (eds.) Advances in Digital Forensics VIII. IFIP AICT, vol. 383, pp. 83–97. Springer, Heidelberg (2012)
Chawathe, S.: Effective whitelisting for filesystem forensics. In: Proceedings of International Conference on Intelligence and Security Informatics (ISI 2009), IEEE, pp. 131–136 (2009)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the International Conference on Very Large Data Bases, pp. 518–529 (1999)
Garfinkel, S.L.: Forensic feature extraction and cross-drive analysis. Digital Invest. 3, 71–81 (2006)
Dandass, Y.S., Necaise, N.J., Thomas, S.R.: An empirical analysis of disk sector hashes for data carving. J. Digital Forensic Pract. 2, 95–106 (2008). http://www.informaworld.com/10.1080/15567280802050436
Collange, S., Dandass, Y.S., Daumas, M., Defour, D.: Using graphics processors for parallelizing hash-based data carving. In: Proceedings of the 42nd Hawaii International Conference on System Sciences, HICSS’09. IEEE, Los Alamitos, pp. 1–10 (2009)
Foster, K.: Using distinct sectors in media sampling and full media analysis to detect presence of documents from a corpus. Master’s thesis, Naval Postgraduate School, Monterey, California, September 2012
Farrell, P., Garfinkel, S., White, D.: Practical applications of bloom filters to the nist rds and hard drive triage. In: 2008 Proceedings of Annual Computer Security Applications Conference, (ACSAC 2008), pp. 13–22 (2008)
Roussev, V.: An evaluation of forensic similarity hashes. Digital Investl. 8, S34–S41 (2011)
Roussev, V.: Data fingerprinting with similarity digests. In: Chow, K.P., Shenoi, S. (eds.) Advances in Digital Forensics VI. IFIP AICT, vol. 337, pp. 207–226. Springer, Heidelberg (2010)
Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Invest. 3, 91–97 (2006)
Roussev, V., Quates, C.: Content triage with similarity digests: the M57 case study. Digital Invest. 9, S60–S68 (2012)
Breitinger, F., Baier, H.: A Fuzzy Hashing Approach based on Random Sequences and Hamming Distance, May 2012, forthcoming issue
Baier, H., Breitinger, F.: Security aspects of piecewise hashing in computer forensics. In: 2011 Sixth International Conference on IT Security Incident Management and IT Forensics (IMF), pp. 21–36 (2011)
Breitinger, F., Stivaktakis, G., Baier, H.: FRASH: a framework to test algorithms of similarity hashing, August 2013, forthcoming issue
Breitinger, F., Astebøl, K.P., Baier, H., Busch, C.: mvHash-B - a new approach for similarity preserving hashing. In: 7th International Conference on IT Security Incident Management & IT Forensics (IMF), Nürnberg, March 2013
Breitinger, F., Petrov, K.: Reducing time cost in hashing operations. In: Proceedings of the 9th Annual IFIP WG 11.9 International Conference on Digital Forensics, Orlando, FL, USA, January 2013
McDaniel, M.B.: An algorithm for content-based automated file type recognition. Master’s thesis, James Madison University (2001)
McDaniel, M., Heydari, M.H.: Content based file type detection algorithms. In: Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS’03) - Track 9, Washington, DC, USA, IEEE CS, p. 332.1 (2003)
Dhanalakshmi, R., Chellappan, C.: File format identification and information extraction. In: World Congress on Nature Biologically Inspired Computing, NaBIC, pp. 1497–1501 (2009)
Karresand, M.: Completing the picture: fragments and back again. Master’s thesis, Linkoepings universitet (2008). http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-11752. Accessed 22 January 2013
Karresand, M., Shahmehri, N.: Reassembly of fragmented JPEG images containing restart markers. In: Proceedings of the European Conference on Computer Network Defense (EC2ND), Dublin, Ireland, IEEE CS, pp. 25–32 (2008)
Mayer, R.C.: Filetype identification using long, summarized n-grams. Master’s thesis, Naval Postgraduate School, Monterey, California, March 2011
Collins, M.: Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Series of ACL ’02. Association for Computational Linguistics, Stroudsburg, pp. 489–496 (2002)
Karresand, M., Shahmehri, N.: Oscar - file type identification of binary data in disk clusters and RAM pages. In: Fischer-Hübner, S., Rannenberg, K., Yngström, L., Lindskog, S. (eds.) Security and Privacy in Dynamic Environments. IFIP, vol. 201, pp. 413–424. Springer, Heidelberg (2006)
Karresand, M., Shahmehri, N.: File type identification of data fragments by their binary structure. In: Proceedings of the IEEE Information Assurance Workshop, pp. 140–147. IEEE, New York (2006)
Hall, G., Davis, W.: Sliding window measurement for file type identification. Technical report, Mantech Security and Mission Assurance (2006)
Erbacher, R.F., Mulholland, J.: Identification and localization of data types within large-scale file systems. In: Systematic Approaches to Digital Forensic Engineering (SADFE), pp. 55–70 (2007)
Moody, S.J., Erbacher, R.F.: Sádi - statistical analysis for data type identification. In: Systematic Approaches to Digital Forensic Engineering (SADFE), pp. 41–54 (2008)
Veenman, C.J.: Statistical disk cluster classification for file carving. In: Proceedings of the International Symposium on Information Assurance and Security (IAS’07), Manchester, UK, IEEE CS, pp. 393–398 (2007)
Calhoun, W.C., Coles, D.: Predicting the types of file fragments. Digital Invest. 5, 14–20 (2008)
Axelsson, S.: Using normalized compression distance for classifying file fragments. In: Proceedings of the International Conference on Availability, Reliability and Security (ARES 2010), Krakow, Poland, IEEE CS, pp. 641–646 (2010)
Axelsson, S.: The normalised compression distance as a file fragment classifier. Digital Investl. 7, S24–S31 (2010)
Savoldi, A., Piccinelli, M., Gubian, P.: A statistical method for detecting on-disk wiped areas. Digital Invest. 8(3–4), 194–214 (2012)
Harris, R.: Arriving at an anti-forensics consensus: examining how to define and control the anti-forensics problem. Digital Invest. 3, 44–49 (2006)
Rukhin, A., Soto, J., Nechvatal, J., Smid, M., Barker, E.: A statistical test suite for random and pseudorandom number generators for cryptographic applications. Information for the Defense Community, Technical report, May 2001. http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA393366
Ariu, D., Giacinto, G., Roli, F.: Machine learning in computer forensics (and the lessons learned from machine learning in computer security). In: Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, Series of AISec ’11, pp. 99–104. ACM, New York (2011)
Pontello, M.: TrID - File Identifier. http://mark0.net/soft-trid-e.html. Accessed 21 April 2013
Ahmed, I., Lhee, K.-S., Shin, H., Hong, M.: Fast file-type identification. In: 2010 Proceedings of the ACM Symposium on Applied Computing, pp. 1601–1602. ACM, New York (2010)
Ahmed, I., suk Lhee, K., Shin, H., Hong, M.: Content-based file-type identification using cosine similarity and a divide-and-conquer approach. IETE Tech. Rev. 27, 465–477 (2010). http://tr.ietejournals.org/text.asp?2010/27/6/465/67149
Ahmed, I., Lhee, K.-S., Shin, H.-J., Hong, M.-P.: Fast content-based file type identification. In: Peterson, G.L., Shenoi, S. (eds.) Advances in Digital Forensics VII. IFIP AICT, vol. 361, pp. 65–75. Springer, Heidelberg (2011)
Li, Q., Ong, A., Suganthan, P., Thing, V.: A novel support vector machine approach to high entropy data fragment classification. In: Proceedings of the South African Information Security Multi-Conference (SAISMC 2010) (2010)
Gopal, S., Yang, Y., Salomatin, K., Carbonell, J.: File-type identification with incomplete information. In: Proceedings of the Tenth Conference on Machine Learning and Applications, Honolulu, Hawaii, IEEE, December 2011
Gopal, S., Yang, Y., Salomatin, K., Carbonell, J.: Statistical learning for file-type identification. In: Proceedings of the 10th International Conference on Machine Learning and Applications and Workshops (ICMLA), vol. 1, pp. 68–73 (2011)
Sportiello, L., Zanero, S.: File block classification by support vector machines. In: Proceedings of the 6th International Conference on Availability, Reliability and Security (ARES 2011), pp. 307–312 (2011)
Beebe, N.L., Maddox, L.A., Liu, L., Sun, M.: Sceadan: using concatentated n-gram vectors for improved data/file type classification (2013, forthcoming issue)
Digital Forensics Research Conference (DFRWS), DFRWS 2012 Forensics Challenge (2012). http://www.dfrws.org/2012/challenge/. Accessed 5 April 2013
Sportiello, L., Zanero, S.: Context-based file block classification. In: Peterson, G.L., Shenoi, S. (eds.) Advances in Digital Forensics VIII. IFIP AICT, vol. 383, pp. 67–82. Springer, Heidelberg (2012)
Amirani, M.C., Toorani, M., Beheshti, A.A.: A new approach to content-based file type detection. In: Proceedings of the 13th IEEE Symposium on Computers and Communications (ISCC’08), pp. 1103–1108 (2008)
Amirani, M.C., Toorani, M., Mihandoost, S.: Feature-based type identification of file fragments. Secur. Commun. Netw. 6(1), 115–128 (2013)
Kattan, A., Galván-López, E., Poli, R., O’Neill, M.: GP-fileprints: file types detection using genetic programming. In: Esparcia-Alcázar, A.I., Ekárt, A., Silva, S., Dignum, S., Uyar, A.Ş. (eds.) EuroGP 2010. LNCS, vol. 6021, pp. 134–145. Springer, Heidelberg (2010)
Garcia, J., Holleboom, T.: Retention of micro-fragments in cluster slack - a first model. In: First IEEE International Workshop on Information Forensics and Security, WIFS 2009, December 2009, pp. 31–35 (2009)
Holleboom, T., Garcia, J.: Fragment retention characteristics in slack space - analysis and measurements. In: Proceedings of the 2nd International Workshop on Security and Communication Networks (IWSCN), pp. 1–6, May 2010
Blacher, Z.: Cluster-slack retention characteristics: a study of the NTFS filesystem. Master’s thesis, Karlstad University, Faculty of Economic Sciences, Communication and IT (2010)
Xu, M., Yang, H.-R., Xu, J., Xu, Y., Zheng, N.: An adaptive method to identify disk cluster size based on block content. Digital Invest. 7(1–2), 48–55 (2010)
Li, Q.: Searching and extracting digital image evidence. In: Sencar, H.T., Memon, N. (eds.) Digital Image Forensics, pp. 123–153. Springer, New York (2013)
Conti, G., Bratus, S., Shubina, A., Lichtenberg, A., Ragsdale, R., Perez-Alemany, R., Sangster, B., Supan, M.: A visual study of primitive binary fragment types. White Paper, Black Hat USA 2010, Technical report, United States Military Academy, July 2010
Conti, G., Bratus, S., Shubina, A., Sangster, B., Ragsdale, R., Supan, M., Lichtenberg, A., Perez-Alemany, R.: Automated mapping of large binary objects using primitive fragment type classification. Digital Invest. 7, S3–S12 (2010)
Noy, N.F., McGuinness, D.L.: Ontology development 101: a guide to creating your first ontology (2001). http://protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html. Accessed 22 January 2013
Hoss, A., Carver, D.: Weaving ontologies to support digital forensic analysis. In: IEEE International Conference on Intelligence and Security Informatics, ISI ’09, pp. 203–205, June 2009
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Poisel, R., Rybnicek, M., Tjoa, S. (2014). Taxonomy of Data Fragment Classification Techniques. In: Gladyshev, P., Marrington, A., Baggili, I. (eds) Digital Forensics and Cyber Crime. ICDF2C 2013. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 132. Springer, Cham. https://doi.org/10.1007/978-3-319-14289-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-14289-0_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14288-3
Online ISBN: 978-3-319-14289-0
eBook Packages: Computer ScienceComputer Science (R0)