On Improving the Accuracy and Performance of Content-Based File Type Identification

Ahmed, Irfan; Lhee, Kyung-suk; Shin, Hyunjung; Hong, ManPyo

doi:10.1007/978-3-642-02620-1_4

Irfan Ahmed¹⁸,
Kyung-suk Lhee¹⁸,
Hyunjung Shin¹⁹ &
…
ManPyo Hong¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 5594))

Included in the following conference series:

Australasian Conference on Information Security and Privacy

797 Accesses

Abstract

Types of files (text, executables, Jpeg images, etc.) can be identified through file extension, magic number, or other header information in the file. However, they are easy to be tampered or corrupted so cannot be trusted as secure ways to identify file types.In the presence of adversaries, analyzing the file content may be a more reliable way to identify file types, but existing approaches of file type analysis still need to be improved in terms of accuracy and speed. Most of them use byte-frequency distribution as a feature in building a representative model of a file type, and apply a distance metric to compare the model with byte-frequency distribution of the file in question. Mahalanobis distance is the most popular distance metric. In this paper, we propose 1) the cosine similarity as a better metric than Mahalanobis distance in terms of classification accuracy, smaller model size, and faster detection rate, and 2) a new type-identification scheme that applies recursive steps to identify types of files. We compare the cosine similarity to Mahalanobis distance using Wei-Hen Li et al.’s single and multi-centroid modeling techniques, which showed 4.8% and 13.10% improvement in classification accuracy (single and multi-centroid respectively). The cosine similarity showed reduction of the model size by about 90% and improvement in the detection speed by 11%. Our proposed type identification scheme showed 37.78% and 31.47% improvement over Wei-Hen Li’s single and multi-centroid modeling techniques respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A New Approach to Compressed File Fragment Identification

Statistical Approaches to File Fragment Analysis

Data Type Classification: Hierarchical Class-to-Type Modeling

References

Exclusion option to skip the files for the scanning in Norton antivirus, http://service1.symantec.com/SUPPORT/nav.nsf/0/c829006aa01d540b852565a6007770d8?OpenDocument
Stegdetect, http://packages.debian.org/unstable/utils/stegdetect
Libmagic1 package, http://packages.debian.org/unstable/libs/libmagic1
Wang, K., Stolfo, S.J.: Anomalous Payload-based Network Intrusion Detection. In: Jonsson, E., Valdes, A., Almgren, M. (eds.) RAID 2004. LNCS, vol. 3224, pp. 203–222. Springer, Heidelberg (2004)
Chapter Google Scholar
Ahmed, I., Lhee, K.-s.: Detection of malcodes by packet classification. In: Workshop on Privacy and Security by means of Artificial Intelligence, ARES 2008, pp. 1028–1035 (2008)
Google Scholar
Li, W.J., Wang, K., Stolfo, S., Herzog, B.: Fileprints: Identifying File Types by n-gram Analysis. In: Workshop on Information Assurance and security (IAW 2005), United States Military Academy, West Point, NY, pp. 64–71 (2005)
Google Scholar
Srinivasan, N., Vaidehil, V.: Reduction of False Alarm Rate in Detecting Network Anomaly using Mahalanobis Distance and Similarity Measure. In: Proceedings of ICSCN, pp. 366–371 (2007)
Google Scholar
Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to data mining. Addison-Wesley, Reading (2005)
Google Scholar
Martin, K., Nahid, S.: Oscar - file type identification of binary data in disk clusters and RAM pages. In: IFIP security and privacy in dynamic environments, pp. 413–424 (2006)
Google Scholar
Martin, K., Nahid, S.: File type identification of data fragments by their binary structure. In: Proceedings of the IEEE workshop on information assurance, pp. 140–147 (2006)
Google Scholar
Veenman, C.J.: Statistical disk cluster classification for file carving. In: IEEE third international symposium on information assurance and security, pp. 393–398 (2007)
Google Scholar
Rencher, A.C.: Methods of Multivariate Analysis. Wiley Interscience, Hoboken (2002)
Book MATH Google Scholar
File extensions, http://www.file-extension.com/
Magic numbers, http://qdn.qnx.com/support/docs/qnx4/utils/m/magic.html
Nachenberg, C.: Polymorphic virus detection module, United States Patent # 5,826,013 (1998)
Google Scholar
Szor, P., Ferrie, P.: Hunting for metamorphic. In: Proceedings of Virus Bulletin Conference, pp. 123–144 (2001)
Google Scholar
RIX, Writing IA32 Alphanumeric Shell codes, http://www.phrack.org/issues.html?issue=57&id=15#article
Eller, R.: Bypassing MSB Data Filters for Buffer Overflow Exploits on Intel platforms (2003), http://community.core-di.com/~juliano/bypassmsb.txt
McDaniel, M., Hossain Heydari, M.: Content Based File Type Detection Algorithms. In: Proceedings of the 36th Annual Hawaii International Conference on System Sciences (2003)
Google Scholar
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems of Information Transmission, 1–11 (1965)
Google Scholar
Calhoun, W.C., Coles, D.: Predicting the types of file fragments. Digital Investigation 5(1), 14–20 (2008)
Article Google Scholar
Wang, K., Parekh, J.J., Stolfo, S.J.: Anagram: A Content Anomaly Detector Resistant to Mimicry Attack. In: Zamboni, D., Krügel, C. (eds.) RAID 2006. LNCS, vol. 4219, pp. 226–248. Springer, Heidelberg (2006)
Chapter Google Scholar
Gu, G., Porras, P., Yegneswaran, V., Fong, M., Lee, W.: BotHunter: Detecting Malware Infection Through IDS-Driven Dialog Correlation: in 16th USENIX Security Symposium (2007)
Google Scholar
Ward, J.H.: Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 235–244 (1963)
Google Scholar

Download references

Author information

Authors and Affiliations

Digital Vaccine and Internet Immune System Lab Graduate School of Information and Communication, Ajou University, South Korea
Irfan Ahmed, Kyung-suk Lhee & ManPyo Hong
Department of Industrial and Information Systems Engineering, Ajou University, South Korea
Hyunjung Shin

Authors

Irfan Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Kyung-suk Lhee
View author publications
You can also search for this author in PubMed Google Scholar
Hyunjung Shin
View author publications
You can also search for this author in PubMed Google Scholar
ManPyo Hong
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information Security Institute, Queensland University of Technology, GPO Box 2434, Qld 4001, Brisbane, Australia
Colin Boyd
Information Security Institute, Queensland Univ. of Technology, GPO Box 2434, QLD, 4001, Brisbane, Australia
Juan González Nieto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ahmed, I., Lhee, Ks., Shin, H., Hong, M. (2009). On Improving the Accuracy and Performance of Content-Based File Type Identification. In: Boyd, C., González Nieto, J. (eds) Information Security and Privacy. ACISP 2009. Lecture Notes in Computer Science, vol 5594. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02620-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-02620-1_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02619-5
Online ISBN: 978-3-642-02620-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics