Elsevier

Digital Investigation

Volume 9, Supplement, August 2012, Pages S44-S49
Digital Investigation

Using NLP techniques for file fragment classification

https://doi.org/10.1016/j.diin.2012.05.008Get rights and content
Under a Creative Commons license
open access

Abstract

The classification of file fragments is an important problem in digital forensics. The literature does not include comprehensive work on applying machine learning techniques to this problem. In this work, we explore the use of techniques from natural language processing to classify file fragments. We take a supervised learning approach, based on the use of support vector machines combined with the bag-of-words model, where text documents are represented as unordered bags of words. This technique has been repeatedly shown to be effective and robust in classifying text documents (e.g., in distinguishing positive movie reviews from negative ones).

In our approach, we represent file fragments as “bags of bytes” with feature vectors consisting of unigram and bigram counts, as well as other statistical measurements (including entropy and others). We made use of the publicly available Garfinkel data corpus to generate file fragments for training and testing. We ran a series of experiments, and found that this approach is effective in this domain as well.

Keywords

File fragment classification
File carving
Natural language processing
Bigrams
Machine learning
Support vector machine
Digital forensics

Cited by (0)